[Python-Dev] More compact dictionaries with faster iteration (original) (raw)

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Thu Dec 13 08:26:47 CET 2012


On 12/13/2012 06:11 AM, PJ Eby wrote:

On Wed, Dec 12, 2012 at 5:06 PM, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:

Perfect hashing already separates "hash table" from "contents" (sort of), and saves the memory in much the same way.

Yes, you can repeat the trick and have 2 levels of indirection, but that then requires an additional table of small ints which is pure overhead present for the sorting; in short, it's no longer an optimization but just overhead for the sortability. I'm confused. I understood your algorithm to require repacking, rather than it being a suitable algorithm for incremental change to an existing dictionary. ISTM that that would mean you still pay some sort of overhead (either in time or space) while the dictionary is still being mutated.

As-is the algorithm just assumes all key-value-pairs are available at creation time.

So yes, if you don't reallocate when making the dict perfect then it could make sense to combine it with the scheme discussed in this thread.

If one does leave some free slots open there's some probability of an insertion working without complete repacking, but the probability is smaller than with a normal dict. Hybrid schemes and trade-offs in this direction could be possible.

Also, I'm not sure how 2 levels of indirection come into it. The algorithm you describe has, as I understand it, only up to 12 perturbation values ("bins"), so it's a constant space overhead, not a variable one. What's more, you can possibly avoid the extra memory access by using a different perfect hashing algorithm, at the cost of a slower optimization step or using a little more memory.

I said there's k perturbation values; you need an additional array

some_int_t d[k]

where some_int_t is large enough to hold n (the number of entries). Just like what's proposed in this thread.

The paper recommends k > 2*n, but in my experiments I could get away with k = n in 99.9% of the cases (given perfect entropy in the hashes...). So the overhead is roughly the same as what's proposed here.

I think the most promising thing would be to have always have a single integer table and either use it for indirection (usual case) or perfect hash function parameters (say, after a pack() method has been called and before new insertions).

Note: I'm NOT suggesting the use of perfect hashing, just making sure it's existence is mentioned and that one is aware that if always-ordered dicts become the language standard it precludes this option far off in the future. Not really. It means that some forms of perfect hashing might require adding a few more ints worth of overhead for the dictionaries that use it. If it's really a performance benefit for very-frequently-used dictionaries, that might still be worthwhile.

As mentioned above the overhead is larger.

I think the main challenge is to switch to a hashing scheme with larger entropy for strings, like murmurhash3. Having lots of zero bits in the string for short strings will kill the scheme, since it needs several attempts to succeed (the "r" parameter). So string hashing is slowed down a bit (given the caching I don't know how important this is).

Ideally one should make sure the hashes 64-bit on 64-bit platforms too (IIUC long is 32-bit on Windows but I don't know Windows well).

But the main reason I say I'm not proposing it is I don't have time to code it up for demonstration and people like to have something to look at when they get proposals :-)

Dag Sverre



More information about the Python-Dev mailing list