msg148627 - (view) |
Author: Irmen de Jong (irmen)  |
Date: 2011-11-30 00:09 |
Pickling of bytearrays is quite inefficient right now, because bytearray's __reduce__ encodes the bytes of the bytearray into a str. A pickle made from a bytearray is quite a bit larger than necessary because of this, and it also takes a lot more processing to create it and to convert it back into the actual bytearray when unpickling (because it uses bytearray's str based initializer with encoding). I've attached a patch (both for the default 3.x branch and the 2.7 branch) that changes this to use the bytes type instead. A pickle made from a bytearray with this patch applied now utilizes the BINBYTES/BINSTRING pickle opcodes which are a lot more efficient than BINUNICODE that is used now. The reconstruction of the bytearray now uses bytearray's initializer that takes a bytes object. I don't think additional unit tests are needed because test_bytes already performs pickle tests on bytearrays. A bit more info can be found in my recent post on comp.lang.python about this, see http://mail.python.org/pipermail/python-list/2011-November/1283668.html |
|
|
msg148628 - (view) |
Author: Irmen de Jong (irmen)  |
Date: 2011-11-30 00:15 |
btw I'm aware of PEP-3154 but I don't think this particular patch requires a pickle protocol bump. |
|
|
msg148633 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2011-11-30 02:07 |
Since it's a performance improvement, it's Python 3.3-only. The patch looks ok but there's an unrelated issue. If I first pickle a bytearray under 3.3 using protocol 2: >>> b = bytearray(b'xyz') >>> pickle.dumps(b, protocol=2) b'\x80\x02c__builtin__\nbytearray\nq\x00c__builtin__\nbytes\nq\x01]q\x02(KxKyKze\x85q\x03Rq\x04\x85q\x05Rq\x06.' and then unpickle it under 2.7, I get: >>> pickle.loads(b'\x80\x02c__builtin__\nbytearray\nq\x00c__builtin__\nbytes\nq\x01]q\x02(KxKyKze\x85q\x03Rq\x04\x85q\x05Rq\x06.') bytearray(b'[120, 121, 122]') ... which is wrong. It seems pickling bytes objects with protocol 2 and unpickling them with Python 2.x is broken. |
|
|
msg148634 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2011-11-30 02:14 |
The irony is that with protocol < 3, bytes should piggy-back on bytearray and not the reverse (since the bytearray constructor has the same semantics under 2.x and 3.x). It also means that the latin1 encoding solution should probably be kept for protocol < 3. Using __reduce_ex__ should allow for such combination, AFAIK. I think a separate issue should be opened for bytes. |
|
|
msg148854 - (view) |
Author: Irmen de Jong (irmen)  |
Date: 2011-12-04 19:14 |
Added new patch that only does the new reduction when protocol is 3 or higher. |
|
|
msg148885 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-12-05 19:45 |
New changeset e2959a6a1440 by Antoine Pitrou in branch 'default': Issue #13503: Use a more efficient reduction format for bytearrays with http://hg.python.org/cpython/rev/e2959a6a1440 |
|
|
msg148886 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2011-12-05 19:46 |
The patch wasn't entirely PEP 7-compliant (spaces around operators, space after if), but I've fixed that when committing. Thank you! |
|
|
msg148902 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2011-12-06 02:17 |
Antoine, do we have unit tests for this code path? |
|
|
msg149411 - (view) |
Author: Irmen de Jong (irmen)  |
Date: 2011-12-13 22:04 |
Alexandre: the existing test_bytes already performs byte array pickle tests. |
|
|
msg188480 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2013-05-06 01:15 |
Reopening this one because there is a size issue, not just speed. My clients are bumping into this issue repeatedly. There is a reasonable expectation that pickling a bytearray will result in a pickle about the same size as the bytearray (not a 50% to 100% expansion depending on the content). Likewise, the size shouldn't double when switching from protocol 0 to the presumably more efficient protocol 2: >>> # Example using Python 2.7.4 on Mac OS X 10.8 >>> from pickle import dumps >>> print len(dumps(bytearray([200] * 10000), 0)) 10055 >>> print len(dumps(bytearray([200] * 10000), 2)) 20052 >>> print len(dumps(bytearray([100] * 10000), 2)) 10052 >>> print len(dumps(bytearray([100, 200] * 5000), 2)) 15052 An attractive feature of bytearrays are their compact representation of data. An attractive feature of the binary pickle protocol is improved compactness and speed. Currently, it isn't living up to expectations. |
|
|
msg188491 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-05-06 06:25 |
Raymond, we can't just backport this without breaking compatibility with installed Python versions. |
|
|
msg188494 - (view) |
Author: Alexandre Vassalotti (alexandre.vassalotti) *  |
Date: 2013-05-06 07:26 |
Our hands are pretty much tied here. The pickle bytearray as unicode hack is likely the best we can do without pickling compatibility between Python 2 and 3. I can't think of a solution that could work here. For example. 1. Pickling bytearrays as a Python 2 str doesn't work because Python 2 strs are unpickled as unicode in Python 3. 2. Pickling bytearrays as an int lists makes the growth factor is much worst: 2x instead of the expected 1.5x. 3. Using a custom constructor breaks pickling compatibility with all the minor releases which doesn't implement the custom constructor. 4. Implementing special support in pickle for bytearrays requires a pickle protocol bump, which is disallowed for bugfixes releases. 5. Creating a special tag type to pickle Python 2 str as bytes in Python 3 has the same problem as #3. |
|
|
msg188590 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-06 20:39 |
How about base64? self.save_reduce(base64.b64decode, (str(base64.b64encode(obj), 'ascii'),), obj=obj) >>> len(dumps(bytes([200] * 10000), 2)) 13372 |
|
|
msg188591 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-05-06 20:45 |
Serhiy Storchaka added the comment: > > How about base64? > > self.save_reduce(base64.b64decode, > (str(base64.b64encode(obj), 'ascii'),), obj=obj) > > >>> len(dumps(bytes([200] * 10000), 2)) > 13372 It's statistically better (a fixed 1.33 expansion instead of 1.5 average), but ASCII bytearrays will pickle less efficiently than currently. There's something else: we had enough regressions in the latest 2.7.x release, and we shouldn't risk adding new ones in the next release. Really, let's close this issue. |
|
|
msg188615 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2013-05-07 01:25 |
It looks like this is an unfortunate dead-end. |
|
|