| msg80670 - (view) |
Author: Jake McGuire (jakemcguire) |
Date: 2009-01-27 21:52 |
| Instance attribute names are normally interned - this is done in PyObject_SetAttr (among other places). Unpickling (in pickle and cPickle) directly updates __dict__ on the instance object. This bypasses the interning so you end up with many copies of the strings representing your attribute names, which wastes a lot of space, both in RAM and in pickles of sequences of objects created from pickles. Note that the native python memcached client uses pickle to serialize objects. >>> import pickle >>> class C(object): ... def __init__(self, x): ... self.long_attribute_name = x ... >>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None), pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 3658 >>> len(pickle.dumps([C(None) for i in range(100)], pickle.HIGHEST_PROTOCOL)) 1441 >>> Interning the strings on unpickling makes the pickles smaller, and at least for cPickle actually makes unpickling sequences of many objects slightly faster. I have included proposed patches to cPickle.c and pickle.py, and would appreciate any feedback. |
|
|
| msg80678 - (view) |
Author: Gabriel Genellina (ggenellina) |
Date: 2009-01-27 23:30 |
| Either my browser got crazy, or you uploaded the same patch (.py) twice. |
|
|
| msg80688 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2009-01-28 02:54 |
| The patch for cPickle doesn't do the same thing as the pickle one. In the cPickle one, you are only interning slot attributes, which, I believe, is not what you intended. :-) |
|
|
| msg80689 - (view) |
Author: Jake McGuire (jakemcguire) |
Date: 2009-01-28 03:00 |
| Are you sure? I may not have enough context in my diff, which I should have done against an anonymous svn checkout, but it looks like the slot attributes get set several lines after my diff. "while (PyDict_Next(slotstate, ...))" as opposed to the "while (PyDict_Next(state, ...))" in my change... -jake On Jan 27, 2009, at 6:54 PM, Alexandre Vassalotti wrote: > > Alexandre Vassalotti <alexandre@peadrop.com> added the comment: > > The patch for cPickle doesn't do the same thing as the pickle one. In > the cPickle one, you are only interning slot attributes, which, I > believe, is not what you intended. :-) > > ---------- > assignee: -> alexandre.vassalotti > nosy: +alexandre.vassalotti > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue5084> > _______________________________________ |
|
|
| msg80690 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2009-01-28 03:03 |
| Oh, you are right. I was looking at py3k's version of pickle, which uses PyDict_Update instead of a while loop; that is why I got confused. |
|
|
| msg80918 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-01 20:53 |
| Why do you call PyString_AsString() followed by PyString_FromString()? Strings are immutable so you shouldn't neek to take a copy. Besides, it would be nice to add a test. |
|
|
| msg82686 - (view) |
Author: Jake McGuire (jakemcguire) |
Date: 2009-02-24 23:32 |
| The fromstring/asstring dance was due to my incomplete understanding of refcounting. PyDict_Next returns a borrowed reference but PyString_InternInPlace expects an owned reference. Thanks to Kirk McDonald, I have a new patch that does the refcounting correctly. What sort of test did you have in mind? |
|
|
| msg82690 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-25 02:03 |
| The test should check that the unpickled strings have been interned. |
|
|
| msg82691 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-25 02:07 |
| In your patch, I'm not sure where the `name` variable is coming from. Have you checked it works? |
|
|
| msg82692 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-02-25 02:11 |
| To give an example of what the test could check: >>> class C(object): ... def __init__(self): ... self.some_long_attribute_name = 5 ... >>> c = C() >>> c.__dict__ {'some_long_attribute_name': 5} >>> sorted(map(id, c.__dict__)) [140371243499696] >>> import pickle >>> d = pickle.loads(pickle.dumps(c, -1)) >>> d <__main__.C object at 0x7faaba1b0390> >>> d.__dict__ {'some_long_attribute_name': 5} >>> sorted(map(id, d.__dict__)) [140371243501232] The `sorted(map(id, d.__dict__))` should have been the same before and after pickling. |
|
|
| msg82720 - (view) |
Author: Jake McGuire (jakemcguire) |
Date: 2009-02-25 21:17 |
| Ugh. Clearly I didn't check to see if it worked or not, as it didn't even compile. A new diff, tested and verified to work, is attached. I'll work on creating a test. |
|
|
| msg86919 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-05-01 21:54 |
| Jake, are you string working on a test? |
|
|
| msg86920 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-05-01 21:55 |
| (s/string/still/, of course...) |
|
|
| msg86930 - (view) |
Author: Jake McGuire (jakemcguire) |
Date: 2009-05-02 02:38 |
| This fell through the cracks. But the following unit test seems like it does the right thing (fails with the old module, works with the new ones). |
|
|
| msg86939 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-05-02 11:41 |
| Thanks, I'll take a look very soon. |
|
|
| msg86982 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-05-02 21:42 |
| Committed in r72223, r72224. Thanks! |
|
|