[Python-Dev] undesireable unpickle behavior, proposed fix (original) (raw)

Jake McGuire jake at youtube.com
Tue Jan 27 10:49:29 CET 2009


Instance attribute names are normally interned - this is done in
PyObject_SetAttr (among other places). Unpickling (in pickle and
cPickle) directly updates dict on the instance object. This
bypasses the interning so you end up with many copies of the strings
representing your attribute names, which wastes a lot of space, both
in RAM and in pickles of sequences of objects created from pickles.
Note that the native python memcached client uses pickle to serialize
objects.

import pickle class C(object): ... def init(self, x): ... self.long_attribute_name = x ... len(pickle.dumps([pickle.loads(pickle.dumps(C(None),
pickle.HIGHEST_PROTOCOL)) for i in range(100)],
pickle.HIGHEST_PROTOCOL)) 3658 len(pickle.dumps([C(None) for i in range(100)],
pickle.HIGHEST_PROTOCOL)) 1441

Interning the strings on unpickling makes the pickles smaller, and at
least for cPickle actually makes unpickling sequences of many objects
slightly faster. I have included proposed patches to cPickle.c and
pickle.py, and would appreciate any feedback.

dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Modules/ cPickle.c cPickle.c --- Downloads/Python-2.4.3/Modules/cPickle.c 2004-07-26
22:22:33.000000000 -0700 +++ cPickle.c 2009-01-26 23:30:31.000000000 -0800 @@ -4258,6 +4258,8 @@ PyObject *state, *inst, *slotstate; PyObject *setstate; PyObject *d_key, *d_value; + PyObject *name; + char * key_str; int i; int res = -1;

@@ -4319,8 +4321,24 @@

      i = 0;
      while (PyDict_Next(state, &i, &d_key, &d_value)) {

dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/ pickle.py pickle.py --- Downloads/Python-2.4.3/Lib/pickle.py 2009-01-27 01:41:43.000000000
-0800 +++ pickle.py 2009-01-27 01:41:31.000000000 -0800 @@ -1241,7 +1241,15 @@ state, slotstate = state if state: try:



More information about the Python-Dev mailing list