[Python-3000] Performance Notes - new hash algorithm (original) (raw)

Jim Jewett jimjjewett at gmail.com
Mon Sep 10 02:58:34 CEST 2007


On 9/8/07, Tim Peters <tim.peters at gmail.com> wrote:

in comments in dictobject.c. As it notes there, hashing the strings "namea", "nameb", "namec", and "named" currently produces (on a sizeof(long) == 4 box):

-1658398457 -1658398460 -1658398459 -1658398462

That the hash codes are very close but not identical is "a feature", since the dict implementation only looks at the last k bits (for various more-or-less small values of k): this gives "better than random" dict collision behavior for input strings very close together.

The proposed hash produces instead:

1892683363 -970432008 51735791 1567337715

Obviously much closer to "random" behavior, but that's not necessarily a good thing for dicts.

To spell this out a bit more:

For cryptography, you want a "random" has function. For hash tables, you just want one that spreads out your actual input. For strings, this tends to mean short strings that look like possible variable names. Because they often are variable names, they are sometimes sequential, like var_a, var_b, var_c.

In the current CPython implementation, dicts start as a size-8 smalldict, and most dicts never grow beyond that. So the effective hash is really (hash%8)

When adding four entries to an 8-slot table, a truly random hash would have at least one collision (0/8 + 1/8 + 2/8 + 3/8 =) 3/4 of the time. As expected, the proposed hash does have a collision for those four values (the first and fourth).

The current hash function does not collide for strings that change only one character to the "next" in ASCIIbetical order until the 9th string -- at which time you need to resize anyhow.

For larger tables, having them close still doesn't cause a problem, and may even be useful if you do decide to sort the keys. (CPython lists use a "timsort" that takes advantage of partially sorted input, so if the iterator gets them close to sorted initially, that can help.)

-jJ



More information about the Python-3000 mailing list