HDFStore appending for mixed datatypes, including NumPy arrays · Issue #3032 · pandas-dev/pandas (original) (raw)

A pandas array I have contains some image data, recorded from a camera during a behavioral experiment. A simplified version looks like this:

num_frames = 100
mouse = [{"velocity":np.random.random((1,))[0], \
        "image":np.random.random((80,80)).astype('float32'), \
        "spine":np.r_[0:80].astype('float32'),
        #"time":millisec(i*33),
        "mouse_id":"mouse1",
        "special":i} for i in range(num_frames)]
df = DataFrame(mouse)

I understand I can't query over the image or spine entries. Of course, I can easily query for low velocity frames, like this:

low_velocity = df[df['velocity'] < 0.5]

However, there is a lot of this data (several hundred gigabytes), so I'd like to keep it in an HDF5 file, and pull up frames only as needed from disk.

In v0.10, I understand that "mixed-type" frames now can be appended into the HDFStore. However, I get an error when trying to append this dataframe into the HDFStore.

store = HDFStore("mouse.h5", "w")
store.append("mouse", df)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-30-8f0da271e75f> in <module>()
      1 store = HDFStore("mouse.h5", "w")
----> 2 store.append("mouse", df)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
    543             raise Exception("columns is not a supported keyword in append, try data_columns")
    544 
--> 545         self._write_to_group(key, value, table=True, append=True, **kwargs)
    546 
    547     def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
    799             raise ValueError('Compression not supported on non-table')
    800 
--> 801         s.write(obj = value, append=append, complib=complib, **kwargs)
    802         if s.is_table and index:
    803             s.create_index(columns = index)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
   2537         # create the axes
   2538         self.create_axes(axes=axes, obj=obj, validate=append,
-> 2539                          min_itemsize=min_itemsize, **kwargs)
   2540 
   2541         if not self.is_exists:

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2279                 raise
   2280             except (Exception), detail:
-> 2281                 raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
   2282             j += 1
   2283 

Exception: cannot find the correct atom type -> [dtype->object,items->Index([image, mouse_id, spine], dtype=object)] cannot set an array element with a sequence

I'm working with a relatively new release of pandas:

pandas.__version__
'0.11.0.dev-95a5326'

import tables
tables.__version__
'2.4.0+1.dev'

It would be immensely convenient to have a single repository for all of this data, instead of fragmenting just the queryable parts off to separate nodes.
Is this possible currently with some work-around (maybe with record arrays), and will this be supported officially in the future?

As a side-note, this kind of heterogeneous data ("ragged" arrays) is incredibly wide-spread in neurobiology and the biological sciences in general. Any extra support along these lines would be incredibly well-received.