HashMap - Hybrid approach (original) (raw)

Alex Yakovlev alex14n at gmail.com
Sat Jun 13 23:47:52 UTC 2009


Doug,

On Fri, Jun 12, 2009 at 6:23 PM, Doug Lea<dl at cs.oswego.edu> wrote:

This was fun to figure out. Performance on some of our tests involving get() is sensitive to the serial correlation of slot indices for the keys used across successive calls.

Are there any tests that have get() order the same as put() order? Data in my map is stored almost in insertion order :)

As another quick experiment, I just tried a hacked version of IdentityHashMap carrying hashes array, and did see around a 20% improvement in some tests for get() versus version with index-dependence. So some variant of direct access still seems likely to be worth pursuing.

Sure. But checking stored hashcode always have limited time, and user's equals method is unpredictable and can be slow. We could accumulate nanoseconds statistics for equals calls and choose either to check hashcode first or try direct key.equals, don't know. In my observations it depends on locality of user's data, if equals operates only local properties - direct equals is faster, and even String with it's remove char[] is already slower than checking saved hashcode first. Another way is to check couple of keys contents with reflection, if all contents are just primitive fields - it's save to call direct equals, if not - check hashcode first. But user can access some static classes, have some heave computations, etc.

I'll try to put together a more serious version along the lines of my suggestions, although maybe not for a few days.

Waiting with a lot of interest.

Meanwhile I've tried an open map approach with my smart bits, like marking 'end-of-list', putting first hash bin element always on it's place, etc. In some tests it gives significant performance improvement due to sequental access in all arrays - max 2 cache misses in get. It certainly needs another bit distribution function and load factor management. But on some data it gets piled up very heavilly with awful performance.

Maybe this can solved with some bit flags, not only 'end of list' but also choosing between +1 step and some large step - like those suggested by Ismael (in python). Or even a bit flag to look into some external overflow storage.

Now I've implemented a bit of this approach in my map - when next element in hash table is empty it is used by its neighbours for faster access. It gives some improvement on data with bad hashcodes distribution (when there are many second lookups).

But source code becomes more and more complex and I'm afraid too much conditions and branches could harm performance.

Alex



More information about the core-libs-dev mailing list