HDF5: empty groups and keys · Issue #29916 · pandas-dev/pandas (original) (raw)

Hi,

With some of the hdf5 files I have, pandas.HDFStore.groups() returns an empty list. (as does .keys() which iterates over the groups). However, the data are accessible via .get() or .get_node().

This is related to #21543 and #21372 where the .groups() logic was changed, in particular using self._handle.walk_groups() instead of self._handle.walk_nodes(), now to be found here:

for g in self._handle.walk_groups()

Current Output

Expected Ouptut

List of groups and keys as visible with e.g. h5dump.
Note: Changing the aforementioned line back to use .walk_nodes() fixes the issue and lists the groups and keys properly:

hdf.groups() [/Data/Table Layout (Table(69462,), zlib(4)) '' description := { ... /Data/Array Layout/2D Parameters/Data Parameters (Table(15,)) '' description := { "mnemonic": StringCol(itemsize=8, shape=(), dflt=b'', pos=0), "description": StringCol(itemsize=48, shape=(), dflt=b'', pos=1), "isError": Int64Col(shape=(), dflt=0, pos=2), "units": StringCol(itemsize=7, shape=(), dflt=b'', pos=3), "category": StringCol(itemsize=31, shape=(), dflt=b'', pos=4)} byteorder := 'little' chunkshape := (642,)]]

hdf.keys() ['/Data/Table Layout', '/Metadata/Data Parameters', '/Metadata/Experiment Notes', '/Metadata/Experiment Parameters', '/Metadata/Independent Spatial Parameters', '/Metadata/_record_layout', '/Data/Array Layout/Layout Description', '/Data/Array Layout/1D Parameters/Data Parameters', '/Data/Array Layout/2D Parameters/Data Parameters']

Fix

One solution would be (I guess) to revert #21543, another to fix at least .keys() to use ._handle.walk_nodes() instead of .groups() in

return [n._v_pathname for n in self.groups()]

Could also be that it is a bug in pytables.

Problem background

I was trying to figure out why some hdf5 files open fine with pandas but fail with dask.
The reason is that dask allows wildcards and iterates over the keys to find valid ones. If .keys() is empty, reading the files with dask fails.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.27.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : 3.6.1
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None