[Python-Dev] http://mail.python.org/pipermail/python-dev/2011-December/115172.html (original) (raw)

Jim Jewett jimjjewett at gmail.com
Mon Jan 2 04:28:13 CET 2012


On Sun, Jan 1, 2012 at 10:00 PM, PJ Eby <pje at telecommunity.com> wrote:

On Sun, Jan 1, 2012 at 7:37 PM, Jim Jewett <jimjjewett at gmail.com> wrote:

Well, there is nothing wrong with switching to a different hash function after N collisions, rather than "in the first place".  The perturbation effectively does by shoving the high-order bits through the part of the hash that survives the mask.

Since these are true hash collisions, they will all have the same high order bits.  So, the usefulness of the perturbation is limited mainly to the common case where true collisions are rare.

That is only because the perturb is based solely on the hash. Switching to an entirely new hash after the 5th collision (for a given lookup) would resolve that (after the 5th collision); the question is whether or not the cost is worthwhile.

> (Well, technically, you could use trees or some other O log n data > structure as a fallback once you have too many collisions, for some > value > of "too many".  Seems a bit wasteful for the purpose, though.)

Your WSGI specification < http://www.python.org/dev/peps/pep-0333/ > requires using a real dictionary for compatibility; storing some of the values outside the values array would violate that.

When I said "use some other data structure", I was referring to the internal implementation of the dict type, not to user code.  The only user-visible difference (even at C API level) would be the order of keys() et al.

Given the wording requiring a real dictionary, I would have assumed that it was OK (if perhaps not sensible) to do pointer arithmetic and access the keys/values/hashes directly. (Though if the breakage was between python versions, I would feel guilty about griping too loudly.)

-jJ



More information about the Python-Dev mailing list