Author: Alberto Planas Domínguez (Alberto.Planas.Domínguez)
Date: 2010-05-17 10:45
Sometimes, when I use cPickle to serialize tuples of strings, I get different dumps() result for the same tuple: import cPickle t = ('', 'JOHN') s1 = cPickle.dumps(t) s2 = cPickle.dumps(cPickle.loads(cPickle.dumps(t))) assert s1 == s2 # AssertionError With cPickle doesn't matter what protocol use por dumps(). The assertion is Ok if I use the pickle module instead of cPickle. This means that I can't use a serialized object as a key in a map/dict object.
I don't think you can expect serialized results to always be equal. It can depend on specifics of the internal algorithm, such as optimizations or dict iteration order.
There seems to be a bug somewhere in 2.x cPickle. Here is a somewhat simpler way to demonstrate the bug: the following code from pickletools import dis import cPickle t = 1L, # use long for easy 3.x comparison s1 = cPickle.dumps(t) s2 = cPickle.dumps(cPickle.loads(s1)) print(s1 == s2) dis(s1) dis(s2) prints False 0: ( MARK 1: L LONG 1L 5: t TUPLE (MARK at 0) 6: p PUT 1 9: . STOP highest protocol among opcodes = 0 0: ( MARK 1: L LONG 1L 5: t TUPLE (MARK at 0) 6: . STOP highest protocol among opcodes = 0 The difference is probably immaterial because nothing in the pickle uses the tuple again and PUT is redundant, but the difference does not show up when python pickle module is used instead of cPickle and is not present in py3k. The comparable py3k code: from pickletools import dis import pickle t = 1, s1 = pickle.dumps(t, 0) s2 = pickle.dumps(pickle.loads(s1), 0) print(s1 == s2) dis(s1) dis(s2) produces True 0: ( MARK 1: L LONG 1 5: t TUPLE (MARK at 0) 6: p PUT 0 9: . STOP highest protocol among opcodes = 0 0: ( MARK 1: L LONG 1 5: t TUPLE (MARK at 0) 6: p PUT 0 9: . STOP highest protocol among opcodes = 0 Most likely the bug is benign and not worth fixing, but I would like to figure out what's going on and what changed in 3.x.
OK, the 2.7 behavior is explainable and correct. cPickle checks the reference count and does not generate PUT for objects that don't have references: >>> from pickletools import dis >>> from cPickle import dumps >>> dis(dumps(tuple([1]))) 0: ( MARK 1: I INT 1 4: t TUPLE (MARK at 0) 5: . STOP highest protocol among opcodes = 0 >>> t = 1, >>> dis(dumps(t)) 0: ( MARK 1: I INT 1 4: t TUPLE (MARK at 0) 5: p PUT 1 8: . STOP highest protocol among opcodes = 0 This optimization is not available from python, of course so pickle.py behaves differently. The remaining question is why this optimization was removed from 3.x.
I am speculating here while Alexandre probably knows the answer. The skip PUT on unreferenced objects optimization was probably removed because doing so makes _pickle module behave more like pickle and because pickletools now has optimize method which can provide a more thorough removal of unused unused PUT opcodes. Closing as "invalid".
History
Date
User
Action
Args
2022-04-11 14:57:01
admin
set
github: 52984
2010-07-15 05:50:20
belopolsky
set
status: open -> closedversions: + Python 2.7, - Python 3.2messages: + resolution: not a bugstage: resolved