[Python-Dev] pickling of large arrays (original) (raw)

Ralf W. Grosse-Kunstleve rwgk@yahoo.com
Thu, 20 Feb 2003 04:38:58 -0800 (PST)


This is question is related to PEP 307, "Extensions to the pickle protocol", http://www.python.org/peps/pep-0307.html .

Apparently the new Pickle "protocol 2" provides a mechanism for avoiding large temporaries, but only for lists and dicts (section "Pickling of large lists and dicts" near the end). I am wondering if the new protocol could also help us to eliminate large temporaries when pickling Boost.Python extension classes.

We wrote an open source C++ array library with Boost.Python bindings. For pickling we use the getstate, setstate protocol. As it stands pickling involves converting the arrays to Python strings, similar to what is done in Numpy. There are two mechanisms:

  1. "single buffered":

    For numeric types (int, long, double, etc.) a Python string is allocated based on an upper estimate for the required size (PyString_FromStringAndSize). The entire numeric array is converted directly to that string. Finally the Python string is resized (_PyString_Resize). With this mechanism there are 2 copies of the array in memory:

    • the original array and
    • the Python string.
  2. "double buffered":

    For some user-defined element types it is very difficult to estimate an upper limit for the size of the string representation. Therefore the array is first converted to a dynamically growing C++ std::string, which is then copied to a Python string. With this mechanism there are 3 copies of the array in memory:

    • the original array,
    • the std::string, and
    • the Python string.

For very large arrays the memory overhead can be a limiting factor. Could the new protocol 2 help us in some way?

Thank you in advance, Ralf


Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/