[Python-Dev] Rethinking intern() and its data structure (original) (raw)

John Arbash Meinel john at arbash-meinel.com
Thu Apr 9 17:02:14 CEST 2009


-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

I've been doing some memory profiling of my application, and I've found some interesting results with how intern() works. I was pretty surprised to see that the "interned" dict was actually consuming a significant amount of total memory. To give the specific values, after doing: bzr branch A B of a small project, the total memory consumption is ~21MB

Of that, the largest single object is the 'interned' dict, at 1.57MB, which contains 22k strings. One interesting bit, the size of it + the referenced strings is only 2.4MB. So the "interned" dict by itself is 2/3rds the size of the dict + strings it contains.

It also means that the average size of a referenced string is 37.4 bytes. A 'str' has 24 bytes of overhead, so the average string is 13.5 characters long. So to save references to 13.5*22k ~ 300kB of character data, we are paying 2.4MB, or about 8:1 overhead.

When I looked at the actual references from interned, I saw mostly variable names. Considering that every variable goes through the python intern dict. And when you look at the intern function, it doesn't use setdefault logic, it actually does a get() followed by a set(), which means the cost of interning is 1-2 lookups depending on likelyhood, etc. (I saw a whole lot of strings as the error codes in win32all / winerror.py, and windows error codes tend to be longer-than-average variable length.)

Anyway, I the internals of intern() could be done a bit better. Here are some concrete things:

a) Don't keep a double reference to both key and value to the same object (1 pointer per entry), this could be as simple as using a Set() instead of a dict()

b) Don't cache the hash key in the set, as strings already cache them. (1 long per entry). This is a big win for space, but would need to be balanced against lookup and collision resolving speed.

 My guess is that reducing the size of the set will actually improve
 speed more, because more items can fit in cache. It depends on how
 many times you need to resolve a collision. If the string hash is
 sufficiently spread out, and the load factor is reasonable, then
 likely when you actually find an item in the set, it will be the
 item you want, and you'll need to bring the string object into
 cache anyway, so that you can do a string comparison (rather than
 just a hash comparison.)

c) Use the existing lookup function one time. (PySet->lookup()) Sets already have a "lookup" which is optimized for strings, and returns a pointer to where the object would go if it exists. Which means the intern() function can do a single lookup resolving any collisions, and return the object or insert without doing a second lookup.

d) Having a special structure might also allow for separate optimizing of things like 'default size', 'grow rate', 'load factor', etc. A lot of this could be tuned specifically knowing that we really only have 1 of these objects, and it is going to be pointing at a lot of strings that are < 50 bytes long.

 If hashes of variable name strings are well distributed, we could
 probably get away with a load factor of 2. If we know we are likely
 to have lots and lots that never go away (you rarely *unload*
 modules, and all variable names are in the intern dict), that would
 suggest having a large initial size, and probably a wide growth
 factor to avoid spending a lot of time resizing the set.

e) How tuned is String.hash() for the fact that most of these strings are going to be ascii text? (I know that python wants to support non-ascii variable names, but I still think there is going to be an overwhelming bias towards characters in the range 65-122 ('A'-'z').

Also note that the performance of the "interned" dict gets even worse on 64-bit platforms. Where the size of a 'dictentry' doubles, but the average length of a variable name wouldn't change.

Anyway, I would be happy to implement something along the lines of a "StringSet", or maybe the "InternSet", etc. I just wanted to check if people would be interested or not.

John =:->

PS> I'm not yet subscribed to python-dev, so if you could make sure to CC me in replies, I would appreciate it.

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkneDfYACgkQJdeBCYSNAAPMywCfQVWOg51dtIkWT/jttVTARV0g WJ4An1w7ypB+akHT5hiSwRKoUhH7ez4j =9TTp -----END PGP SIGNATURE-----



More information about the Python-Dev mailing list