[Python-Dev] String hash function multiplier (original) (raw)

Tim Peters tim_one at email.msn.com
Tue Apr 13 23:40:32 EDT 2004


[Raymond]

Does anyone have any issues with changing the hash multiplier for the string and Unicode hash functions?

Don't touch it unless you can prove major benefits -- it's a remarkable fact of life that the current multiplier hasn't resulted in any real-life (but non-contrived) pathological cases.

Instead of 1000003, I would like to use Aho's 65599, a prime near 2**16 that is efficiently expressible as (x << 6) + (x << 16) - x. This replaces a multiply with fast adds and shifts (with the two shifts potentially being executed in parallel).

It's unclear why that's a good thing. Perhaps you think shifts and adds are faster? I wouldn't -- the imul instruction on modern Pentiums is very fast.

It is clear why it may be a bad thing: that it can be expressed as just a couple shifts and adds makes it suspect as a good scrambler (read Knuth).

Googling for "hash 65599" shows a long history of widespread use and testing without any problems.

Testing in the context of Python string hashing? I didn't think so . The current multiplier has been studied extensively in Python, both via real-life use and via large-scale focused statistical testing (we got some money to do that during BeOpen.com's brief life -- a surprise that came out of that was that CRC made a terrible string hash function as the number of input strings got large). The right thing to compare Python's string hash to is "the standard" Fowler-Noll-Vo string hash, which was developed independently, but which also ended up using a large multiplier.



More information about the Python-Dev mailing list