[Python-Dev] Tuning Python dicts (original) (raw)

Reid Kleckner rnk at mit.edu
Sat Apr 10 20:06:59 CEST 2010


Hey folks,

I was looking at tuning Python dicts for a data structures class final project. I've looked through Object/dictnotes.txt, and obviously there's already a large body of work on this topic. My idea was to alter dict collision resolution as described in the hopscotch hashing paper[1]. I think the PDF I have came from behind a pay-wall, so I can't find a link to the original paper.

[1] http://en.wikipedia.org/wiki/Hopscotch_hashing

Just to be clear, this is an experiment I'm doing for a class. If it is successful, which I think is unlikely since Python dicts are already well-tuned, I might consider trying to contribute it back to CPython over the summer.

The basic idea of hopscotch hashing is to use linear probing with a cutoff (H), but instead of rehashing when the probe fails, find the next empty space in the table and move it into the neighborhood of the original hash index. This means you have to spend potentially a lot of extra time during insertion, but it makes lookups very fast because H is usually chosen such that the entire probe spans at most two cache lines. This is much better than the current random (what's the right name for the current approach?) probing solution, which does potentially a handful of random accesses into the table.

Looking at dictnotes.txt, I can see that people have experimented with taking advantage of cache locality. I was wondering what benchmarks were used to glean these lessons before I write my own. Python obviously has very particular workloads that need to be modeled appropriately, such as namespaces and **kwargs dicts.

Any other advice would also be helpful.

Thanks, Reid

One caveat I need to work out: If more than H items collide into a single bucket, then you need to rehash. However, if you have a particularly evil hash function which always returns zero, no matter how much you rehash, you will never be able to fit all the items into the first H buckets. This would cause an infinite loop, while I believe the current solution will simply have terrible performance. IMO the solution is just to increase H for the table if the rehash fails, but realistically, this will never happen unless the programmer is being evil. I'd probably skip this detail for the class implementation.



More information about the Python-Dev mailing list