Skip to content.

plope

Personal tools
You are here: Home » Members » chrism's Home » ZODB Blob Proposal (First Cut)
 
 

ZODB Blob Proposal (First Cut)

Hopefully, at the upcoming PyCon sprints, we'll get a chance to work on native "blob" support for ZODB. This proposal can hopefully serve as a point of reference for participants in that sprint. Comments are obviously welcome!

Overview

We'd like to provide a mechanism for ZODB users to save large, infrequently-changing chunks of binary data ( > ~ 1MB ) into storage that allows later accesses to the object to be more memory/space-efficient than current strategies that exist for accessing large objects.

Rationale

Storing large seldom-changing binary objects in ZODB requires somewhat complicated application logic. When large objects are stored, they currently need to be broken up into many smaller ZODB records in primarily in order to reduce the amount of memory consumed when users wish to deal with the "blob" as a unit.

An example of using "straight" ZODB to store large binary content: The Zope 2 "Image"/"File" content objects use a "Pdata" class to service this requirement. Each Pdata object stores up to 64K of data. The "Image"/"File" object has a method which converts a file stream into multiple Pdata objects which are linked together in succession. A Pdata chain that is created from a 202KB file stream will end up looking like this:

               next             next             next             next
     64K Pdata ---->  64K Pdata ---->  64K Pdata ---->  10K Pdata ---->  None

The None at the end of the Pdata chain delineates the end of the chain.

When a Zope "Image"/"File: object is requested to be rendered to the browser, the File object has a method which iterates over all the Pdata objects, sending each in turn to the remote browser via a Zope RESPONSE.write method call. It stops when the None pointer is encountered in the chain. This is fast and robust.

However, storing binary data in this way comes at a price. In order to increase the overall speed of access to objects stored within a ZODB storage, ZODB keeps an in-memory LRU cache of objects. When Pdata objects are accessed, they are placed in the cache and, if the cache is full, their insertion into the cache may evict other (possibly also frequently-accessed objects) out of the cache.

Also, in the case of Zope, when it serves up a large "file" from an object stored in ZODB, it tends to buffer data into a temporary file before actually sending content out to a remote browser. This requires IO/disk activity and CPU power.

It would be more efficient to actually store large objects as files (or at least as objects which could be opened as a stream) instead of storing them in a ZODB storage as many distinct objects that need to be reassembled into a unit at processing time.

Goals

  • Provide a "blob" API to ZODB users which allows them to store, maniuplate, and retrieve file-like objects much like they would access files via "straight" Python.
  • Provide transactional integrity when a user stores or modifies a blob.
  • Allow many blobs to be accessed in the course of a single transaction.
  • Allow blob accesses to work from applications which serve the same data from separate physical machines via ZEO.
  • Prevent blob data from evicting "more important" data that lives in a ZODB or ZEO cache.
  • Allow blobs to participate in ZODB undo/pack machinery (maybe optionally).

Proposal

We'd like to create a common API for blob-like objects, allowing people to create their own blob object implementations. We'd also like to create a "reference" Blob implementation.

The API for creating a blob may be used something like this within a method of a Persistent object:

    def createMyBlob(self):
        from ZODB.blob import FileBlob
        self.blob = FileBlob()
        fh = self.blob.open('wb')
        fh.write('some_data')
        fh.close()

The API for retrieving all data from a blob (not a common thing to want to do, but useful for demonstration) within a method of a Persistent object might look something like this:

    def getAllMyBlobData(self):
        fh = self.blob.open('rb')
        data = fh.read()
        fh.close()
        return data

Essentially, a "FileBlob" in the above is a persistent object that has an "open" method which returns a file-like object. This is the minimum required API for blob objects.

References

Zope 2 blob product

Created by chrism
Last modified 2005-03-17 12:24 PM

Streamable

I assume you're going to want to be able to trivially and efficiently create a StreamIterator for one of these blobs, so it can be served as fast as possible.

ZEO

fwiw, some benchmarks I ran a while ago indicated that the current Pdata solution for OFS.Image suffers greatly when used with ZEO. If the data is already in the ZEO client cache, you're fine, performance is comparable to FileStorage. If it's not, I've measured performance about 25 times slower.
See the bottom of http://www.slinkp.com/code/zopestuff/blobnotes

I will be THRILLED if you can work out a way to transparently handle blobs through ZEO without such a big hit.

yup...

Yup, streamiterator...

And yes, it should be reasonably fast access to streams representing blobs (from multiple physically distinct systems). I imagine a ZEO server may or may not be involved in actually doing the serving of these streams (depending on the particular Blob implementation). In the beginning, personally, I think I'd just like to use something like NFS to actually get a handle to a stream/file.

Also want to be able to use result of open after connection close

at least (probably most) for reads.

Ideally, I suspect, an open() or open('r') should return an object that is disconected from the original object. This is a light deviation from "normal" file semantics.

IOW (contrary to what I suggested in person. we should not try to reflect changes in the same transactions when files are read read-only.

This makes it simpler to manage the underlying temporary files.