[Python-Dev] Dictionary sparseness (original) (raw)

Jack Diederich jack@performancedrivers.com
Mon, 12 May 2003 02:16:57 -0400


On Sun, May 11, 2003 at 09:47:33PM -0400, Tim Peters wrote:

[Agthorr] > An alternate optimization would be the additional of an immutable > dictionary type to the language, initialized from a mutable dictionary > type. Upon creation, this dictionary would optimize itself, in a > manner similar to "gperf" program which creates (nearly) minimal > zero-collision hash tables.

For symbol table apps, Bentley & Sedgewick rehabilitated the idea of ternary search trees a few years ago, and I think you'd enjoy reading their papers: http://www.cs.princeton.edu/~rs/strings/ In particular, they're faster than hashing in the failing-lookup case.

They nest well too. And you can do some caching if the higher level trees are unchaning (local scope can shortcut into builtins).

I have a pure-python ternary tree and a C w/python wrappers of ternary trees lying around. They were written with symbol tables is mind, I haven't touched em since my presentation proposal on the topic [ternary trees in general, replacing python symbol dict w/ t-trees as the closing example] was declined for the Portland OReilly thingy (bruised ego, sour grapes, et al).

Cut-n-paste from an off-list for this undying thread below. Hettinger's idea of treaps is a good one. A ternary-treap would also be possible.

-jack

[Raymond]

My thought is to use a treap. The binary search side would scan the hash values while the heap part would organize from most frequent to least frequently accessed key. It could even be dynamic and re-arrange the heap according to usage patterns.

[me] treaps would probably be a better fit than ternary trees, espcially for builtins for the reasons you mention. A good default ordering would go a long way.

[me, about ternary trees] They nest nicely, a valid 'next' node can be another ternary tree, so pseudo code for import would be

newmodule = import('mymodule')

assume module_symdict is the module's symbol table

module_symdict['mymodule.'] = newmodule.module_symdict

a lookup for 'mymodule.some_function' would happily run from the current module's tree into the 'mymodule' tree. The '.' seperator would only remain speical from a user's point of view.

If symbols don't share leading characters, ternary trees are just binary trees that require additional bookkeeping. This is probably the case, so ternary trees become less neat [even if they do make for prettier pictures].