[Python-Dev] Simple dicts (original) (raw)

Tim Peters tim.one@comcast.net
Thu, 15 May 2003 10:33:56 -0400


[Brett C.]

When I took data structures I was taught that chaining was actually the easiest way to do hash tables and they still had good performance compared to open addressing. Because of this taught bias I always wondered why Python used open addressing; can someone tell me?

malloc overhead is a major drag; as my msg said, the feasibility depends on pymalloc's very low overhead, and that dictentry nodes are 12 bytes apiece even now; pymalloc didn't exist back then, and Python wasn't originally micro-optimized as it is now (e.g., there wasn't the current zoo of dedicated free-lists, or pre-allocation strategies, and dicts could only be indexed by strings so collision only had to worry about the kinds of problems a single known hash function was prone to).

I am interested in seeing how this would pan out, but I am unfortunately going to be busy for the next three days (if anyone is going to be at E3 Thursday or Friday for some odd reason let me know since I will be there). If someone takes this up please let me know; I am interested in helping if I can. Perhaps this should be a sandbox thing?

There's no rush , and I'd be surprised if Python adopted a different scheme in the end anyway. It's likely a just-for-fun to-see-what-happens project.

Note one nasty class of problem: in chaining only primary collisions exist. The current dict implementation turned the problem of open-addressing's secondary collisions into "a feature", which will become clear when you contemplate dictobject.c's

[i << 16 for i in range(20000)]

example. Python's original dict design didn't have a problem with this because it used prime numbers for table sizes and reduced 32-bit hashes via mod-by-a-prime. The current scheme of just grabbing the last i bits is both much faster and more delicate than that, and we really rely on the collision resolution strategy now to protect against unlucky bit patterns.

Another way of looking at this is that the current scheme has a way to get all 32 bits of a hash code to participate in which table slot gets selected; mod-by-an-odd-prime also gets all bits into play; peel-off-the-last-i-bits does not.