HDFStore appending for mixed datatypes, including NumPy arrays · Issue #3032 · pandas-dev/pandas (original) (raw)
A pandas array I have contains some image data, recorded from a camera during a behavioral experiment. A simplified version looks like this:
num_frames = 100
mouse = [{"velocity":np.random.random((1,))[0], \
"image":np.random.random((80,80)).astype('float32'), \
"spine":np.r_[0:80].astype('float32'),
#"time":millisec(i*33),
"mouse_id":"mouse1",
"special":i} for i in range(num_frames)]
df = DataFrame(mouse)
I understand I can't query over the image
or spine
entries. Of course, I can easily query for low velocity frames, like this:
low_velocity = df[df['velocity'] < 0.5]
However, there is a lot of this data (several hundred gigabytes), so I'd like to keep it in an HDF5 file, and pull up frames only as needed from disk.
In v0.10, I understand that "mixed-type" frames now can be appended into the HDFStore. However, I get an error when trying to append this dataframe into the HDFStore.
store = HDFStore("mouse.h5", "w")
store.append("mouse", df)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-30-8f0da271e75f> in <module>()
1 store = HDFStore("mouse.h5", "w")
----> 2 store.append("mouse", df)
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
543 raise Exception("columns is not a supported keyword in append, try data_columns")
544
--> 545 self._write_to_group(key, value, table=True, append=True, **kwargs)
546
547 def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
799 raise ValueError('Compression not supported on non-table')
800
--> 801 s.write(obj = value, append=append, complib=complib, **kwargs)
802 if s.is_table and index:
803 s.create_index(columns = index)
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
2537 # create the axes
2538 self.create_axes(axes=axes, obj=obj, validate=append,
-> 2539 min_itemsize=min_itemsize, **kwargs)
2540
2541 if not self.is_exists:
/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
2279 raise
2280 except (Exception), detail:
-> 2281 raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
2282 j += 1
2283
Exception: cannot find the correct atom type -> [dtype->object,items->Index([image, mouse_id, spine], dtype=object)] cannot set an array element with a sequence
I'm working with a relatively new release of pandas:
pandas.__version__
'0.11.0.dev-95a5326'
import tables
tables.__version__
'2.4.0+1.dev'
It would be immensely convenient to have a single repository for all of this data, instead of fragmenting just the queryable parts off to separate nodes.
Is this possible currently with some work-around (maybe with record arrays), and will this be supported officially in the future?
As a side-note, this kind of heterogeneous data ("ragged" arrays) is incredibly wide-spread in neurobiology and the biological sciences in general. Any extra support along these lines would be incredibly well-received.