[Python-Dev] RE: [spambayes-dev] improving dumbdbm's survival chances... (original) (raw)

Tim Peters tim.one@comcast.net
Sun, 13 Jul 2003 15:16:56 -0400


[Skip]

I realize we (the Spambayes folks) want to discourage people from using dumbdbm, but for those who are either stuck with it or don't realize they are using it, I wonder if we can do a little something to help them out.

As I understand it, if a machine crashes or is shut down without exiting Outlook, there's a good chance that the dumbdbm's commit method won't have been called and the directory and data files will be out-of-sync.

This is so. Worse, because spambayes never calls close() on its Shelf object, it implicitly relies on dumbdbm.del to rewrite the dir file, but dumbdbm.del can easily trigger a shutdown race in dumbdbm._commit (referencing the global "_os" that has already been rebound to None by shutdown cleanup), and the .dir file and .dat files on disk remain inconsistent in that case. (I fixed this race for 2.3 final, BTW.)

It seems that dumbdbm doesn't support a sync() method which shelve likes to call. Shelve's sync method gets called from time-to-time by the Spambayes storage code. dumbdbm.sync has this statement:

No, you're quoting shelve.py here:

if hasattr(self.dict, 'sync'): self.dict.sync()

so maybe it's as simple (short-term) as modifying dbmstorage.opendumbdbm() to def opendumbdbm(*args): """Open a dumbdbm database.""" import dumbdbm db = dumbdbm.open(*args) if not hasattr(db, "sync"): db.sync = db.commit return db

That would help spambayes a lot, because DBDictClassifier.store() does call self.db.sync() on its Shelf at the important times. It wouldn't stop the shutdown race in dumbdbm._commit() from bombing out with an exception, but for spambayes that would no longer matter to on-disk database integrity. People using dumbdbm with spambayes would still be a lot better off using a plain in-memory dict, though (on all counts: it would consume less memory, consume less disk space for the dict pickle, and run faster).

The above should help. Meanwhile, it appears that would be a good method to add to dumbdbm databases both for 2.3 and the 2.2 maintenance branch.

Fine by me, although I doubt a 2.2.4 will get released. Adding

sync = _commit

to the 2.3 code (+ docs + test) should be sufficient.

BTW, this code in the spambayes storage.py is revolting (having one module change the documented default behavior of another module is almost always indefensible -- I can't see any reason for this abuse in spambayes):

"""

Make shelve use binary pickles by default.

oldShelvePickler = shelve.Pickler def binaryDefaultPickler(f, binary=1): return oldShelvePickler(f, binary) shelve.Pickler = binaryDefaultPickler """