[Python-Dev] Rethinking intern() and its data structure (original) (raw)

John Arbash Meinel john.arbash.meinel at gmail.com
Thu Apr 9 21:59:02 CEST 2009

Previous message: [Python-Dev] Rethinking intern() and its data structure
Next message: [Python-Dev] Rethinking intern() and its data structure
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

...

I like your rationale (save memory) much more, and was asking in the tracker for specific numbers, which weren't forthcoming.

...

Now that you brought up a specific numbers, I tried to verify them, and found them correct (although a bit unfortunate), please see my test script below. Up to 21800 interned strings, the dict takes (only) 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k interned strings is "typical", I still don't know.

Given that every variable name in any file is interned, it can grow pretty rapidly. As an extreme case, consider the file "win32/lib/winerror.py" which tracks all possible win32 errors.

import winerror print len(winerror.dict) 1872

So a single error file has 1.9k strings.

My python version (2.5.2) doesn't have 'sys.getsizeof()', but otherwise your code looks correct.

If all I do is find the interned dict, I see:

print len(d) 5037

So stock python, without importing much extra (just os, sys, gc, etc.) has almost 5k strings already.

I don't have a great regex yet for just extracting how many unique strings there are in a given bit of source code.

However, if I do:

import gc, sys def find_interned_dict(): cand = None for o in gc.get_objects(): if not isinstance(o, dict): continue if "find_interned_dict" not in o: continue for k,v in o.iteritems(): if k is not v: break else: assert not cand cand = o return cand

d = find_interned_dict() print len(d)

Just import a few of the core structures

from bzrlib import branch, repository, workingtree, builtins print len(d)

I start at 5k strings, and after just importing the important bits of bzrlib, I'm at: 19,316

Now, the bzrlib source code isn't particularly huge. It is about 3.7MB / 91k lines of .py files (that is, without importing the test suite).

Memory consumption with just importing bzrlib shows up at 15MB, with 300kB taken up by the intern dict.

If I then import some extra bits of bzrlib, like http support, ftp support, and sftp support (which brings in python's httplib, and paramiko, and ssh/sftp implementation), I'm up to:

print len(d) 25186

Memory has jumped to 23MB, (interned is now 1.57MB) and I haven't actually done anything but import python code yet. If I sum the size of the PyString objects held in intern() it ammounts to 940KB. Though they refer to only 335KB of char data. (or an average of 13 bytes per string).

Wrt. your proposed change, I would be worried about maintainability, in particular if it would copy parts of the set implementation.

Right, so in the first part, I would just use Set(), as it could then save 1/3rd of the memory it uses today. (Dropping down to 1MB from 1.5MB.)

I don't have numbers on how much that would improve CPU times, I would imagine improving 'intern()' would impact import times more than run times, simply because import time is interning a lot of strings.

Though honestly, Bazaar would really like this, because startup overhead for us is almost 400ms to 'do nothing', which is a lot for a command line app.

John =:->

Previous message: [Python-Dev] Rethinking intern() and its data structure
Next message: [Python-Dev] Rethinking intern() and its data structure
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list