[Python-Dev] String hash function multiplier (original) (raw)

Tim Peters tim_one at email.msn.com
Wed Apr 14 23:13:45 EDT 2004

Previous message: [Python-Dev] Optimization targets
Next message: [Python-Dev] String hash function multiplier
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[Raymond]

On the P4, the documented latency went up from 4 cycles to 14 cycles while shifts and adds went down to 0.5 cycles and 1 cycle respectively. Timings confirm the result.

What makes the current stepping of some flavor of P4 overwhelmingly important <0.3 wink>? Tradeoffs at this level change with every chip generation, and within a generation vary across vendors. In particular, they can make int multiply as fast as they want just by throwing more highly regular silicon at it (which is why its relative performance varies so much across chip generations, and is usually worst in the earliest iterations of a new generation).

My overwhelming concern isn't micro-efficiency of the generated code for the string-hash loop, it's that the resulting hash have high quality. I don't know that the "little prime" is vulnerable to bad cases in non-contrived real life, but we have years of evidence suggesting that the "big prime" isn't. If we can develop evidence that the little prime doesn't either, great, then I lose all objections to using it.

It looks like the best bet is to try to speedup the code without changing the multiplier. Intel's software optimization cookbook recommends a partial unrolling and elimination of data dependencies so that a second multiply can start 4 cycles after the previous one started. If practice bears out the theory, the timings could show a three or fourfold speedup without changing the multiplier.

Or leave it alone. Hand unrolling a loop all but guarantees that the next generation of compiler optimizer won't be able to reverse-engineer the original "natural" loop structure, and so won't be able to do the code transformation that works best for the next generation of hardware tricks. IOW, once you start down the path of hand-optimizing for a specific compiler

HW combo, there are two outcomes: (1) someone makes this their job forever after; or, (2) the performance of the code actually gets worse over time (compared to what it would have been if the original code had been left dirt simple).

I happen to have an app (spambayes) that makes very heavy use of string-keyed dicts, and where interning is impractical (so the string hashes can't be avoided). Still, speed up string hashing by a factor of 10, and it won't make enough difference in that app's throughput to mention. I also have a hyper-threaded box today, and the imul stalls in this P4 give great opportunities for other processes to get work done <0.9 wink>.

Previous message: [Python-Dev] Optimization targets
Next message: [Python-Dev] String hash function multiplier
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list