[Python-Dev] Rethinking intern() and its data structure (original) (raw)

John Arbash Meinel john.arbash.meinel at gmail.com
Thu Apr 9 20:20:11 CEST 2009


Alexander Belopolsky wrote:

On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel <john at arbash-meinel.com> wrote: ...

a) Don't keep a double reference to both key and value to the same object (1 pointer per entry), this could be as simple as using a Set() instead of a dict()

There is a rejected patch implementing just that: http://bugs.python.org/issue1507011 .

Thanks for the heads up.

So reading that thread, the final reason it was rejected was 2 part:

Without reviewing the patch again, I also doubt it is capable of getting rid of the reference count cheating: essentially, this cheating enables the interning dictionary to have weak references to strings, this is important to allow automatic collection of certain interned strings. This feature needs to be preserved, so the cheating in the reference count must continue.

That specific argument was invalid. Because the patch just changed the refcount trickery to use +- 1. And I'm pretty sure Alexander's argument was just that +- 2 was weird, not that the "weakref" behavior was bad.

The other argument against the patch was based on the idea that: The operation "give me the member equal but not identical to E" is conceptually a lookup operation; the mathematical set construct has no such operation, and the Python set models it closely. IOW, set is not a dict with key==value.

I don't know if there was any consensus reached on this, since only Martin responded this way.

I can say that for my "do some work with a medium size code base", the overhead of "interned" as a dictionary was 1.5MB out of 20MB total memory.

Simply changing it to a Set would drop this to 1.0MB. I have no proof about the impact on performance, since I haven't benchmarked it yet.

Changing it to a StringSet could further drop it to 0.5MB. I would guess that any performance impact would depend on whether the total size of 'interned' would fit inside L2 cache or not.

There is a small bug in the original patch adding the string to the set failed. Namely it would return "t == NULL" which would be "t != s" and the intern in place would end up setting your pointer to NULL rather than doing nothing and clearing the error code.

So I guess some of it comes down to whether "loweis" would also reject this change on the basis that mathematically a "set is not a dict". Though given that his claim "nobody else is speaking in favor of the patch", while at least Colin Winter has expressed some interest at this point.

John =:->



More information about the Python-Dev mailing list