[Python-Dev] Simple dicts (original) (raw)

Tim Peters tim_one@email.msn.com
Tue, 20 May 2003 23:56:11 -0400


[damien morton]

... I need to address Tim's concerns about the poor hash function used for python integers, but I think this can be addressed easily enough.

Caution: in the context of the current scheme, it's an excellent hash function. No hash function could be cheaper to compute, and in the common case of dicts indexed by a contiguous range of integers, there are no collisions at all. Christian Tismer contributed the pathological case in the dictobject.c comments, but I don't know that any such case has been seen in real life; the current scheme does OK with it.

I would welcome some guidance about what hash functions need to be addressed though. Is it just integers?

Because, e.g., 42 == 42.0 == 42L, and objects that compare equal must have equal hashcodes, what we do for ints has to be duplicated for at least some floats and longs too, and more generally for user-defined numeric types that can call themselves equal to ints (for example, rationals). For this reason it may not be possible to change the hash code for integers (although it would be possible to scramble the incoming hash code when mapping to a table slot, which is effectively what the current scheme does but only when a primary collision occurs).

The string hash code is regular for "consecutive" strings, too (like "ab1", "ab2", "ab3", ...).

Instances of user-defined classes that don't define their own hash effectively use the memory address as the hash code, and of course that's also very regular across objects at times.

(theres a great article on integer hash functions at www.cris.com/~Ttwang/tech/inthash.htm)

Cool! I hadn't seen that before -- thanks for the link.