[Python-Dev] Dictionary tuning (original) (raw)

Tim Peters tim.one@comcast.net
Wed, 30 Apr 2003 22:13:46 -0400


[Raymond Hettinger]

... I worked on similar approaches last month and found them wanting. The concept was that a 64byte cache line held 5.3 dict entries and that probing those was much less expensive than making a random probe into memory outside of the cache.

The first thing I learned was that the random probes were necessary to reduce collisions. Checking the adjacent space is like a single step of linear chaining, it increases the number of collisions.

Yes, I believe that any regularity will.

That would be fine if the cost were offset by decreased memory access time; however, for small dicts, the whole dict is already in cache and having more collisions degrades performance with no compensating gain.

The next bright idea was to have a separate lookup function for small dicts and for larger dictionaries. I set the large dict lookup to search adjacent entries. The good news is that an artificial test of big dicts showed a substantial improvement (around 25%). The bad news is that real programs were worse-off than before.

You should qualify that to "some real programs", or perhaps "all real programs I've tried". On the other side, I have real programs that access large dicts in random order, so if you tossed those into your mix, a 25% gain on those would more than wipe out the 1-2% losses you saw elsewhere.

A day of investigation showed the cause. The artificial test accessed keys randomly and showed the anticipated benefit. However, real programs access some keys more frequently than others (I believe Zipf's law applies.)

Some real programs do, and, for all I know, most real programs. It's not the case that all real programs do. The counterexamples that sprang instantly to my mind are those using dicts to accumulate stats for testing random number generators. Those have predictable access patterns only when the RNG they're testing sucks .

Those keys and their collision chains are likely already in the cache. So, big dicts had the same limitation as small dicts: You always lose when you accept more collisions in return for exploiting cache locality.

Apart from that "always" ain't always so, I really like that as a summary!

The conclusion was clear, the best way to gain performance was to have fewer collisions in the first place. Hence, I resumed experiments on sparsification.

How many collisions are you seeing? For int-keyed dicts, all experiments I ran said Python's dicts collided less than a perfectly random hash table would collide (the reason for that is explained in dictobject.c, and that int-keyed dicts tend to use a contiguous range of ints as keys).

For string-keyed dicts, extensive experiments said collision behavior was indistinguishable from a perfectly random hash table.

I never cared enough about other kinds of keys to time 'em, at least not since systematic flaws were fixed in the tuple and float hash functions (e.g., the tuple hash function used to xor the tuple's elements' hash codes, so that all permututions of a given tuple had the same hash code; that's necessary for unordered sets, but tuples are ordered).

If someone wants to experiment with that in lookdictstring(), stick a new

++i; before the for loop, and move the existing i = (i << 2) + i + perturb + 1; to the bottom of that loop. Likewise for lookdict().

PyStone gains 1%. PyBench loses a 1%. timecell gains 2% (spreadsheet benchmark) timemat loses 2% (pure python matrix package benchmark) timepuzzle loses 1% (class based graph traverser)

You'll forgive me if I'm skeptical: they're such small differences that, if I saw them, I'd consider them to be a wash -- in the noise. What kind of platform are you running on that has timing behavior repeatable enough to believe 1-2% blips?

P.S. There is one other way to improve cache behavior but it involves touching code throughout dictobject.c.

Heh -- that wouldn't even be considered a minor nuisance to the truly obsessed .

Move the entry values into a separate array from the key/hash pairs. That way, you get 8 entries per cache line.

What use case would that help? You said before that small dicts are all in cache anyway, so it wouldn't help those. The jumps in large dicts are so extreme that it doesn't seem to matter if the cache line size du jour holds 1 slot or 100. To the contrary, at the end of the large dict lookup, it sounds like it would incur an additional cache miss to fetch the value after the key was found (since that value would no longer ever ride along with the key and hashcode).

I can think of a different reason for considering this: sets have no use for the value slot, and wouldn't need to allocate space for 'em.

P.P.S. One other idea is to use a different search pattern for small dictionaries. Store entries in a self-organizing list with no holes. Dummy fields aren't needed which saves a test in the linear search loop. When an entry is found, move it one closer to the head of the list so that the most common entries get found instantly.

I don't see how more than just one can be found instantly; if "instantly" here means "in no more than a few tries", that's usually true of dicts too -- but is still an odd meaning for "instantly" .

Since there are no holes, all eight cells can be used instead of the current maximum of five. Like the current arrangement, the whole small dict fits into just two cache lines.

Neil Schemenauer suggested that a few years ago, but I don't recall it going anywhere. I don't know how to predict for which apps it would be faster. If people are keen to optimize very small search tables, think about schemes that avoid the expense of hashing too.