[Python-Dev] Algoritmic Complexity Attack on Python (original) (raw)

Tim Peters tim.one@comcast.net
Fri, 30 May 2003 19:07:38 -0400


[Scott Crosby]

... Have you done benchmarking to prove that stringhash is in fact an important hotspot in the python interpreter?

It depends on the specific application, of course. The overall speed of dicts is crucial, but in many "namespace" dict apps the strings are interned, implying (among other things) that a hash code is computed only once per string. In those apps the speed of the string hash function isn't so important. Overall, though, I'd welcome a faster string hash, and I agree that Python's isn't particularly zippy.

OTOH, it's just a couple lines of C that run fine on everything from Palm Pilots to Crays, and have created no portability or maintenance headaches. Browsing your site, you've got over 100KB of snaky C code to implement hashing functions, some with known bugs, others with cautions about open questions wrt platforms with different endianness and word sizes than the code was initially tested on. Compared to what Python is using now, that's a maintenance nightmare.

Note that Python's hash API doesn't return 32 bits, it returns a hash code of the same size as the native C long. The multiplication gimmick doesn't require any pain to do that.

Other points that arise in practical deployment:

If so, and doing one modulo operation per string is unacceptable,

If it's mod by a prime, probably. Some architectures Python runs on require hundreds of cycles to do an integer mod, and we don't have the resources to construct custom mod-by-an-int shortcut code for dozens of obscure architectures.

then you may wish to consider Jenkin's hash. The linux kernel people are switching to using a keyed veriant of Jenkin's hash. However, Jenkin's, AFAIK, has no proofs that it is in fact universal. It, however, probably is safe.

Nobody writing a Python program has to use a dict. That dicts have quadratic-time worst-case behavior isn't hidden, and there's no cure for that short of switching to a data structure with better worst-case bounds. I certainly agree it's something for programmers to be aware of. I still don't see any reason for the core language to care about this, though.