Serializing Pandas Functions (original) (raw)
In recent efforts using Pandas on multiple machines I've found that some of the functions are tricky to serialize. Apparently this might be due to runtime generation. Here are a few examples of serialization breaking, occasionally in unpleasant ways:
In [1]: import pandas as pd In [2]: import pickle In [3]: pd.read_csv Out[3]: <function pandas.io.parsers._make_parser_function..parser_f> In [4]: pickle.loads(pickle.dumps(pd.read_csv)) AttributeError: Can't pickle local object '_make_parser_function..parser_f'
Lest you think that this is just a problem with pickle (which has many flaws), dill, a much more robust function serialization library, also fails (the failure here is py35 only.) (cc @mmckerns)
In [5]: import dill In [6]: dill.loads(dill.dumps(pd.read_csv)) PicklingError: Can't pickle <function _make_parser_function..parser_f at 0x7f71f5ec1158>: it's not found as pandas.io.parsers._make_parser_function..parser_f
In this particular case though cloudpickle will work.
Other functions have this problem as well. Consider the series methods:
In [7]: pickle.loads(pickle.dumps(pd.Series.sum)) AttributeError: Can't pickle local object '_make_stat_function..stat_func'
In this case, concerningly cloudpickle completes, but returns a wrong result:
In [9]: import cloudpickle In [11]: pd.Series.sum Out[11]: <function pandas.core.generic._make_stat_function..stat_func>
In [12]: cloudpickle.loads(cloudpickle.dumps(pd.Series.sum)) Out[12]:
I've been able to fix some of these in cloudpipe/cloudpickle#46 but generally speaking I'm running into a number of problems here. It would be useful if, during the generation of these functions we could at least pay attention to assigning metadata like __name__ correctly. This one in particular confused me for a while:
In [15]: pd.Series.cumsum.name Out[15]: 'sum'
What would help?
Testing that most of the API is serializable
Looking at what metadata the serialization libraries use, and making sure that this metadata is enough to properly identify the function. Some relevant snippets from cloudpickle follow:
def save_instancemethod(self, obj): # Memoization rarely is ever useful due to python bounding if obj.self is None: self.save_reduce(getattr, (obj.im_class, obj.name)) else: if PY3: self.save_reduce(types.MethodType, (obj.func, obj.self), obj=obj) else: self.save_reduce(types.MethodType, (obj.func, obj.self, obj.self.class), obj=obj) def _reduce_method_descriptor(obj): return (getattr, (obj.objclass, obj.name))