[Python-Dev] Dictionary sparseness (original) (raw)

Agthorr agthorr@barsoom.org
Sun, 11 May 2003 20:28:18 -0700


On Sun, May 11, 2003 at 09:47:33PM -0400, Tim Peters wrote:

Possibly, but it's fraught with difficulties. For example, Python dicts can be indexed by lots of things besides 8-bit strings, and you generally need to know a great deal about the internal structure of a key type to generate a sensible hash function.

A more fundamental problem is that minimality can be harmful when failing lookups are frequent: a sparse table has a good chance of hitting a null entry immediately then, but a minimal table never does. In the former case full-blown key comparison can be skipped when a null entry is hit, in the latter case full-blown key comparison is always needed on a failing lookup.

Both good observations.

For symbol table apps, Bentley & Sedgewick rehabilitated the idea of ternary search trees a few years ago, and I think you'd enjoy reading their papers:

http://www.cs.princeton.edu/~rs/strings/ In particular, they're faster than hashing in the failing-lookup case.

hhmmm.. yes, those are interesting. Thanks :-) A few months ago I implemented suffix trees for fun and practice. Suffix trees are based on tries, and I used a binary-tree for each node to keep track of its children (which the papers point out is an equivalent way of doing ternary trees).

(Suffix trees let you input a set of strings of total length n. This has a cost of O(n) time and O(n) memory. Then, you can look to see if a string of length m is a substring of any of the strings in the set in O(m) time; this is impressive since the number and size of the set of strings only matters for the setup operation; it has no effect on the lookup speed whatsoever.)

Ternary search trees seem like a good approach for string-only dictionaries. These seem like an inelegant optimization that might yield performance improvements for places where non-string keys are syntactically disallowed anyway (such as the members of a class or module).

-- Agthorr