HDF5: empty groups and keys · Issue #29916 · pandas-dev/pandas (original) (raw)
Hi,
With some of the hdf5 files I have, pandas.HDFStore.groups()
returns an empty list. (as does .keys()
which iterates over the groups). However, the data are accessible via .get()
or .get_node()
.
This is related to #21543 and #21372 where the .groups()
logic was changed, in particular using self._handle.walk_groups()
instead of self._handle.walk_nodes()
, now to be found here:
for g in self._handle.walk_groups() |
---|
Current Output
Expected Ouptut
List of groups and keys as visible with e.g. h5dump
.
Note: Changing the aforementioned line back to use .walk_nodes()
fixes the issue and lists the groups and keys properly:
hdf.groups() [/Data/Table Layout (Table(69462,), zlib(4)) '' description := { ... /Data/Array Layout/2D Parameters/Data Parameters (Table(15,)) '' description := { "mnemonic": StringCol(itemsize=8, shape=(), dflt=b'', pos=0), "description": StringCol(itemsize=48, shape=(), dflt=b'', pos=1), "isError": Int64Col(shape=(), dflt=0, pos=2), "units": StringCol(itemsize=7, shape=(), dflt=b'', pos=3), "category": StringCol(itemsize=31, shape=(), dflt=b'', pos=4)} byteorder := 'little' chunkshape := (642,)]]
hdf.keys() ['/Data/Table Layout', '/Metadata/Data Parameters', '/Metadata/Experiment Notes', '/Metadata/Experiment Parameters', '/Metadata/Independent Spatial Parameters', '/Metadata/_record_layout', '/Data/Array Layout/Layout Description', '/Data/Array Layout/1D Parameters/Data Parameters', '/Data/Array Layout/2D Parameters/Data Parameters']
Fix
One solution would be (I guess) to revert #21543, another to fix at least .keys()
to use ._handle.walk_nodes()
instead of .groups()
in
return [n._v_pathname for n in self.groups()] |
---|
Could also be that it is a bug in pytables
.
Problem background
I was trying to figure out why some hdf5 files open fine with pandas
but fail with dask
.
The reason is that dask
allows wildcards and iterates over the keys to find valid ones. If .keys()
is empty, reading the files with dask
fails.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.27.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : 3.6.1
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None