Skip to content.

plope

Personal tools
You are here: Home » Members » chrism's Home » Why I Like ZODB
 
 

Why I Like ZODB

Why I like ZODB better than other persistence systems for writing real-world web applications.

It's not a state secret that I like ZODB . It's a very civilized way to store web application data. I'll try to enumerate some of the most important reasons I like ZODB here, and why I prefer it to other NoSQL systems and relational systems.

For the record, the stuff in this blog post is in the context of writing a web application. I don't mean it in the context of an OLAP system, or some data warehousing system. I mean it in the context of writing your typical web application, which needs to support fewer than a couple thousand requests per second, a small fraction of which are write requests.

Transactions

ZODB uses transactions. When you write to a ZODB database, you change a bunch of objects, then you commit the changes. Until you commit your changes, other threads and processes accessing the database won't see those changes.

I take transactions entirely for granted when writing an application. Wrapping a set of actions which mutate persistent data in a transaction makes a whole class of really hard problems disappear. Many existing NoSQL databases like MongoDB do not.

Any speed or feature benefit in using a non-transactional data store would just be lost in the noise of needing to cope with the loss of transactionality for anything except the most immense, purpose-built application (e.g. you're writing Twitter, or GitHub). If an application is anticipated to serve fewer than, say, 50 write requests per second or so, it's pretty foolish to even momentarily consider using a system without transactions. Even much busier systems can be engineered to use databases with transactions, albeit with lots of fancy coding.

Truth be told, I'm really not clever enough to write a system without transactions where I needed to have any level of confidence that I have the storage of inter-related data right. I'm also pretty sure I don't want to be that clever, and that a job that required me to be that clever would be a very long job indeed. I certainly wouldn't willingly reach into my toolbox and pick out a data storage without transactions for a run-of-the-mill, low-traffic web application.

ZODB gives me transactions, and I appreciate that.

Caching

The way ZODB operates is pretty simple. Any pickleable object created in memory can be persisted. It's persisted by attaching it (via ``__setattr__``) to any other persistent object. Once persisted, every object has its own identifier that can be used as a cache key.

Because the organization of data in the database is a function of the same data as it might otherwise be organized in memory, ZODB has a natural caching system that is simple and robust. Each persistent object, once loaded, remains in a per-thread memory cache until evicted. It's evicted when it hasn't been accessed in a while and RAM is needed to load a more recently accessed object into the cache, or when another thread or process has changed the object. There's very little cost associated with asking for the same set of objects over and over once that set of objects has been asked for initially.

If your working set is smaller than your available RAM, the way ZODB caches its objects effectively means that accessing an object loaded from ZODB is almost costless after the object is first loaded. That object won't be evicted from the in-memory cache until it changes. In such cases, your application will work at "RAM speed" rather than at "disk speed". If your working set is larger than available RAM, accessing the same set of objects over and over won't be costless, but if you ask for that set of objects often enough, and they don't change very often, and you have an appropriate amount of memory in the machine, they will almost never be evicted from the cache.

Because most requests in a typical web application are read requests (they do not mutate persistent data), this built-in caching is extremely effective in a real-world sense. You usually don't have to employ a third-party caching system to make pages render quickly. They just render quickly in the first place without degrading too much as load increases. There are certainly times you have to think harder about it, such as when you begin to have too much data in your working set to effectively fit into the cache, but often applications never reach this level of exposure, and when they do, the answer is often to just increase the memory in the system and the cache size. In the worst case scenario, you do what everyone else who doesn't use ZODB does: you use a global external cache to make lookup speeds acceptable. In ten years of ZODB use, I can count the number of times I've had to do this on one finger. These days, instead of using an external cache, I might just try to change ZODB instead and improve its caching.

Testability

Of the things I really like about ZODB, the one I like the most is the ability to write easily unit-testable code.

I like the majority of the tests I write to be unit tests. Note: unit tests. Not functional tests. Not integration tests. I just want to test the code I'm trying to test, not its integration with the rest of the system. Functional and integration tests are useful and important too, and you need them for any serious application, but you can get better coverage and a much faster set of tests if you use unit testing in the majority. I don't think most people know what the difference between these styles of testing mean. But if you do know, you really know, and you care.

I have a deep respect for the amount of effort put into making object-relational mappings work. SQLAlchemy and other such systems are well written and thoughtfully executed. However, I find it unappealing that every ORM system I've run across effectively requires that you write integration and/or functional tests to actually test the code in your application. These tests need the code to run against a real database. The test code actually causes data to be modified in that database. Between each test case, the database needs to be reset to a baseline state.

This, in practice, utterly blows. As a limitation, it seems to be, at least in part, a function of the query syntax exposed by the ORM. It's an awful hard syntax to mock out. To my knowledge, it's so hard that no one has even tried, and folks just cave and use a "real" database, and write all of their tests as functional tests, or at least they write all of the tests that come near the database as functional tests. Most real-world code comes near the database, so it's often diminishing returns to even attempt to discern which of the tests can be non-integration tests, and thus people tend to just use the same setup and teardown for all their tests, even the ones that don't come close to the database.

Unit tests run much, much, much faster than integration and functional tests. When I see blog posts where people are trying to parallelize their test runs across multiple machines because their test suite takes so long to run, I want to weep, because it's very likely that they'd get just as much of a sense of comfort from a set of unit tests that ran thousands of times faster with a few functional and integration tests thrown in for the sake of sanity. But they can't, because their toolchain doesn't really support it.

ZODB makes it very easy to write the majority of your tests as unit tests. Because the object graph is just a tree of Python objects, and because each of those objects can be instantiated without any particular root or parent, and because often the "query syntax" of ZODB is just plain old Python item or attribute access, you can often just write test code like this:

  class TestFile(unittest.TestCase):
      def _makeOne(self, stream, mimetype):
          from .. import File
          return File(stream, mimetype)

      def test_ctor_no_stream(self):
          inst = self._makeOne(None, None)
          self.assertEqual(inst.mimetype, 'application/octet-stream')

      def test_ctor_with_stream_mimetype_None(self):
          stream = StringIO.StringIO('abc')
          inst = self._makeOne(stream, None)
          self.assertEqual(inst.mimetype, 'application/octet-stream')
          fp = inst.blob.open('r')
          fp.seek(0)
          self.assertEqual(fp.read(), 'abc')

The above test tests an implementation of a ZODB object representing a file (using the ZODB "blob" functionality). The file object it's testing would be considered a "model" by most people used to "MVC" terminology. Model code is actually pretty easy to test, and I suspect that with careful factoring and stubbing it's possible to test most ORM model code without actually using a database connection.

Let's try something harder. The hardest code to test in any web application is view code. View code is the code that responds to an invocation of a particular URL within your system. It's the code that ties everything together: the model objects that represent persistent storage, the various functional subsystems of the application like the HTTP request, sessions, mailers and filesystems, etc. It's hard to test because it's "where the rubber meets the road". Consider the following Pyramid view function, which is taken from an application that uses ZODB:

  from myapp import File

  def add_file(context, request):
      appstruct = get_appstruct(request)
      name = appstruct['name']
      filedata = appstruct['file']
      stream = None
      filename = None
      if filedata:
          filename = filedata['filename']
          stream = filedata['fp']
          if stream:
              stream.seek(0)
          else:
              stream = None
      name = name or filename
      fileob = request.registry.content.create(File, stream)
      context[name] = fileob
      return HTTPFound(request.mgmt_path(fileob, '@@properties'))

The above view function accepts a context object (a ZODB object representing a "folder", in this case, which is just a container of other ZODB objects), and a request object. It returns a Response. Here's a test for the function in the case where there's no file data provided:

  import unittest
  from pyramid import testing

  class Test_add_file(unittest.TestCase):
      def test_no_filedata(self):
          from .. import add_file
          created = testing.DummyResource()
          context = testing.DummyResource()
          request = testing.DummyRequest()
          request.mgmt_path = lambda *arg: '/mgmt'
          request.registry.content = DummyContent(created)
          appstruct = {'name':'abc', 'file':None}
          request.appstruct = appstruct
          result = add_file(context, request)
          self.assertEqual(result.location, '/mgmt')
          self.assertEqual(context['abc'], created)

  class DummyContent(object):
      def __init__(self, result):
          self.result = result

      def create(self, *arg, **kw):
          return self.result

Note that the test creates "dummy" (aka stub) objects for the context, the request, and the object that is returned from the call to create. It asserts that the created object is added to the context under the name abc. And that's all it does. In the running application, when a new file object is created by the view, it is persisted. But in the test, no database setup is required because the "query API" is limited to a single call: __setitem__, which seats the object into its parent. We can mock this up without much of a problem. In this case, the DummyResource has a suitable __setitem__ already, so we didn't need to do any mocking (it was already done for us). This test will run in microseconds.

There are definitely far more complex cases, requiring more stubbing, such as code that uses an indexing and querying system like repoze.catalog to look up persistent objects efficiently by asking a centralized index for all objects with such-and-such attribute. In those cases, the "query API" is not nearly as simple. But it's still simple enough to mock up without ever requiring a "real" ZODB database connection. The tests get longer, and the mocking and stubbing code becomes more complex. But it's not hopeless, like it seems to be in ORM systems.

Comparable idiomatic ORM code would construct the file object the same way but would then tend to call e.g. add on a semimagical threadlocal "session" object. So you'd need to at least mock out the thread local session, or come up with some other context-sensitive way to get at the session without the thread-local. It would also need to ensure that the a file object with the same name didn't already exist in the database before adding it blindly, which would imply some sort of exists query. Something like this (I realize my syntax is likely terrible):

  from myapp import Session
  from myapp import File

  def add_file(request):
      appstruct = get_appstruct(request)
      name = appstruct['name']
      filedata = appstruct['file']
      stream = None
      filename = None
      if filedata:
          filename = filedata['filename']
          stream = filedata['fp']
          if stream:
              stream.seek(0)
          else:
              stream = None
      name = name or filename
      exists = Session.query(File).filter_by(name=name).one()
      if exists:
         Session.query(File).delete(name=name)
      fileob = request.registry.content.create(File, stream)
      Session.add(fileob)
      return HTTPFound(request.mgmt_path(fileob, '@@properties'))

A test is nowhere near as straightforward to write when the view looks like this. You've obtained a global Session object from an import that needs to either be mocked or a stub of it passed in specially. This is purely convention, and you could construct a system that passed the session in like the "context" in a ZODB app, so that's not really a deep concern. But still, the contract of the Session object is complex. It needs to support a query method that accepts an argument, the result of which needs to support both a filter_by and delete method. The filter_by method needs to return an object that has a one method, and so on. The session is the source of truth in this view, and so mutations done through it need to be reflected in later queries.

This example doesn't even take into account more common cases where one query depends on the result of another data-mutating query, or methods of the session like flush which add attributes to recently add -ed objects representing automatically computed primary keys and so on.

It's no wonder that no one bothers to try to mock it out, and just punts back to always testing functionally. When you test functionally, your test runs in milliseconds rather than microseconds. And that difference adds up across lots of tests. You get a lot of power from the ORM, especially the power to do very ad-hoc queries in a sort of stream-of-consciousness way which is very useful in highly dynamic, ill-defined web applications. But you're paying a price. And often you don't need the ad-hocness supplied by the query syntax. You know exactly what you're looking for and where to find it. ZODB lends itself well to such applications, and the complexity curve seems more adjustable on a per-view basis.

For what it's worth I'd love to be wrong about needing to always write functional tests when an ORM is used. It would mean I could write applications that use an ORM in a style that suits my historical application writing patterns, and in a style that supports very fast test suites. Let me know if you've tried and succeeded. In the meantime, I much prefer to write ZODB applications for testing purposes. Note that it's not just ORMs that have this issue; database bindings for other NoSQL databases have similar issues (e.g. PyMongo); their APIs are very complicated and are difficult to mock. Often the features you gain from that complexity is not worth the price you pay.

Created by chrism
Last modified 2012-05-15 03:19 AM

Web scale?

But is ZODB web scale? MongoDB is web scale. You should use MongoDB.

Dauntless!!!

ZODB rocks the box. The many of the finest minds of python have blessed it with their doms. Yet hipster adoption is low.

I contend that the following things are required for true ZODB dominate the Hacker News world.

1. wu-tang flavored marketing
2. ability to use trendy new school DBs as backends
3. A decent query language
4. Non-python specific serialization of objects
5. (related to #2) operational tooling: replication, clustering, etc