[Python-Dev] PEP for new dictionary implementation (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu Feb 16 22:34:00 CET 2012


Am 16.02.2012 19:24, schrieb Jim J. Jewett:

PEP author Mark Shannon wrote (in http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt): ... allows ... (the _dict_ attribute of an object) to share keys with other attribute dictionaries of instances of the same class. Is "the same class" a deliberate restriction, or just a convenience of implementation?

It's about the implementation: the class keeps a pointer to the key set. A subclass has a separate pointer for that.

I have often created subclasses (or even families of subclasses) where instances (as opposed to the type) aren't likely to have additional attributes. These would benefit from key-sharing across classes, but I grant that it is a minority use case that isn't worth optimizing if it complicates the implementation.

In particular, the potential savings are small: the instances of the subclass will share the key sets per-class. So if you have S subclasses, you could save up to S keysets, whereas you are already saving N-S-1 keysets (assuming you have a total of N objects across all classes).

Have you timed not storing the hash (in the dict) at all, at least for (unicode) str-only dicts? Going to the string for its own cached hash breaks locality a bit more, but saves 1/3 of the memory for combined tables, and may make a big difference for classes that have relatively few instances.

I'd be in favor of that, but it is actually an unrelated change: whether or not you share key sets is unrelated to whether or not str-only dicts drop the cached hash. Given a dict, it may be tricky to determine whether or not it is str-only, i.e. what layout to use.

Reduction in memory use is directly related to the number of dictionaries with shared keys in existence at any time. These dictionaries are typically half the size of the current dictionary implementation. How do you measure that? The limit for huge N across huge numbers of dicts should be 1/3 (because both hashes and keys are shared); I assume that gets swamped by object overhead in typical small dicts.

It's more difficult than that. He also drops the smalltable (which I think is a good idea), so accounting how this all plays together is tricky.

If a table is split the values in the keys table are ignored, instead the values are held in a separate array. If they're just dead weight, then why not use them to hold indices into the array, so that values arrays only have to be as long as the number of keys, rather than rounding them up to a large-enough power-of-two? (On average, this should save half the slots.)

Good idea. However, how do you track per-dict how large the table is?

Regards, Martin



More information about the Python-Dev mailing list