Review Request CR#7118743 : Alternative Hashing for String with Hash-based Maps (original) (raw)

Doug Lea dl at cs.oswego.edu
Wed May 30 13:49:08 UTC 2012


On 05/29/12 17:25, Mike Duigou wrote:

An intrinsic and/or native version might be added later if it turns out that the Java murmur3 implementation is a performance bottleneck.

I don't think that going native will make much difference. The main disadvantages of using Murmur3 for char[]s are that it takes twice as long as for byte[]s (for which it was developed), and that its statistical collision properties when using chars in which the top byte is usually 0 (as is the case) are probably weaker than otherwise.

Backing up: There's a history of obsession with these sorts of details in JDK hash table classes. Here's some background.

Many small-looking factors can make a big difference in application-level performance, and even more so in some "important" but poorly conceived benchmarks (most notably, the previous version of specJBB). Some of this is due to to the fact that some user and JDK hashCode algorithms (e.g. for class Float) are so poorly distributed that some intervention is required inside the hash table classes to avoid horrible performance. The added wrinkle in the current efforts is to also avoid horrible performance when Strings are based on external inputs that may be crafted to cause collisions. The use of a random seed for String hashing addresses this pretty well. But since this cannot (per the JLS) be done for all uses of String.hashCode(), it opens up further exploration of exactly which random-seeded algorithm to use.

This is similar to previous efforts that took the hashCode() as a given, and then looked at: cost(convertHashToIndexAndLoadBin) + sum(i=0, inf, cost(compareElement) * prob >= i collisions))

Here, the main question is how much effort to expend to convert poorly distributed hash codes via techniques that spread bits across a 32bit hash, resolved to an index via a power-of-two mask. Bit-spreading (via those shifts, xors, etc you see in hash classes) gives fewer collisions, but with diminishing returns with more effort, in part because this code occurs at a choke-point: Loading the bin (and key) at the index may entail a cache miss, stalling the processor, so the code leading up to this point is on a critical path.

You can model some of this analytically, as deviations from ideal Poisson distribution, where the probability of at least one collision is roughly 1 / (8 * N), where N is number of elements, under default resizing policies and pure random keys. But models don't give you any answer beyond illustrating that the tradeoff between mixing cost and collision rates generally favors cheaper mixing. Beyond that, the best you can do is empirically evaluate particular choices against a wide range of key types and use cases, (There are other non-analytic concerns here as well. For example, the current bit-spreading in class HashMap additionally has the property of bounding cache miss rates during traversal when used for adjacent consecutive integer keys. Yes, this is a weird consideration but was important to some people at the time.)

The new String efforts differ in that the hashCode function is up for grabs, and additionally is cacheable. So the costs have both one-time initial hash computation, and recurring (for, say, "U" usages of a key) components. But because we are in control of initial hash, we can ensure that we don't need further bit-spreading. so, for U usages, the cost is:

cost(initialHash(key)) + U * sum(i=0, inf, cost(compareElement) * prob >= i collisions))

Back-of-envelope analytic modeling shows that the tradeoffs between initial cost and collision rate depend on both U (usages per key) and N (number of elements). There are too many unknown constants to make any absolute conclusions about choice of algorithm. But plugging in a few sample values shows that the break even point for increasing initial hash cost vs better collision rate is further out than you might expect. For example, guestimating that the murmur code in current patch is twice as expensive as the "DJB" algorithm in String hashCode (which I think is a lower bound) but gives 10% lower collision probability (which I think is upper bound), you'd only choose it if U > c * N, where "c" absorbs a lot of slop but is likely greater than 1. Which indicates that this is probably not the most common use case. (But it is a common case in most benchmarks, that reuse keys a lot to get better time estimates.)

So it does seem worth exploring algorithms with different speed/quality tradeoffs. For example, some similar guestimation (and a bit of empirical confirmation) suggests that a good choice might be a random-seeded version of the well-established Jenkins "one-time" algorithm (http://en.wikipedia.org/wiki/Jenkins_hash_function), which is only a little more expensive as DJB but doesn't have as good collision rates as Murmur3. (Although I'm not at all sure it is not as good because of the top-byte-usually-zero issue for Murmur3 on char[]).

In fact, it is not clear to me at the moment whether just using random-seeded version of current DJB would be about as good as any other choice. If that were the case, some of the upcoming changes could be further simplified. And could be vastly simplified if it were OK to have a VM switch -UseRandomStringHashSeed. If so, the main impact would be to change one line in String.hashCode. I'm sure the JLS purists out there would not be happy about this though.

(BTW, see some related discussions on StackExchange: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/)

There are other, completely different, approaches to dealing with bad user hash codes, such as the use of separate data structures to handle colliding codes. We've rejected them in the past, but I'm trying some out in the in-progress overhaul of ConcurrentHashMap.

Aside: I discovered while investigating some of this that explicitly omitting the bit-spreading step for String keys in hash table classes under any of these algorithms works well if you are going to explicitly do a (key instanceof String) check anyway. (This partially accounts for results I noted in previous mail about impact of Murmur3.) While it doesn't hurt to further bit-spread any of these String hash codes, it doesn't help, and since the test is there anyway, it is better to also use it to skip spread step. It may be is worth adding this check even if not otherwise needed because it helps compilers resolve the hashCode call along this path, and Strings are the most common kinds of keys anyway.

-Doug



More information about the core-libs-dev mailing list