[Python-Dev] Unpickling memory usage problem, and a proposed solution (original) (raw)
Dan Gindikin dgindikin at gmail.com
Fri Apr 23 23:44:50 CEST 2010
- Previous message: [Python-Dev] Unpickling memory usage problem, and a proposed solution
- Next message: [Python-Dev] Unpickling memory usage problem, and a proposed solution
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Collin Winter <collinwinter google.com> writes:
I don't think it's possible in general to remove any PUTs if the pickle is being written to a file-like object. It is possible to reuse a single Pickler to pickle multiple objects: this causes the Pickler's memo dict to be shared between the objects being pickled. If you pickle foo, bar, and baz, foo may not have any GETs, but bar and baz may have GETs that reference data added to the memo by foo's PUT operations. Because you can't know what will be written to the file-like object later, you can't remove any of the PUT instructions in this scenario.
Hmm, that is a good point. A possible solution would be for the two-pass optimizer to scan through the entire file, going right through '.' opcodes. That would deal with the case you are describing, but not if the user "maliciously" wrote some other stuff into the file in between pickle dumps, all the while reusing the same pickler.
I think a better solution would be to make sure that the '.' is the last thing in the file and die otherwise. This would at least ensure correctness and detection of cases that this thing could not handle.
don't break cvs2svn, it's not fun to fix :). I added some basic tests for this support in cPython's Lib/test/pickletester.py.
Thanks for the warning :)
There might be room for app-specific optimizations that do this, but I'm not sure it would work for a general-usage cPickle that needs to stay compatible with the current system.
That may well be true. Still, when trying to deal with large data you really need something like this. Our situation was made worse because we had a extension types. As they were allocated they got interspersed with temporaries generated by the spurious PUTs, and that is what really fragmented the memory. However its probably not a stretch to assume that if you are dealing with large stuff through python you are going to have extension types in the mix.
- Previous message: [Python-Dev] Unpickling memory usage problem, and a proposed solution
- Next message: [Python-Dev] Unpickling memory usage problem, and a proposed solution
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]