Presence of softlink in HDF5 file breaks HDFStore.keys() · Issue #20523 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
#! /path/to/python3.6
import pandas as pd
df = pd.DataFrame({ "a": [1], "b": [2] }) print(df.to_string())
hdf = pd.HDFStore("/tmp/test.hdf", mode="w") hdf.put("/test/key", df)
#Brittle hdf._handle.create_soft_link(hdf._handle.root.test, "symlink", "/test/key") hdf.close() print("Successful write")
hdf = pd.HDFStore("/tmp/test.hdf", mode="r")
'''
Traceback (most recent call last):
File "snippet.py", line 31, in
print(hdf.keys())
File "python3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 529, in keys
return [n._v_pathname for n in self.groups()]
File "python3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1077, in groups
g for g in self._handle.walk_nodes()
File "python3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1078, in
if (getattr(g._v_attrs, 'pandas_type', None) or
File "python3.6.3/lib/python3.6/site-packages/tables/link.py", line 79, in getattr
"%s
instance" % self.class.name)
KeyError: 'you cannot get attributes from this NoAttrs
instance'
'''
print(hdf.keys()) #causes exception
hdf.close()
print("Successful read")
Problem description
I know I have a esoteric problem, but I'm building an HDF5 file using Pandas and then using pytables to softlink to the Pandas dataframe. I understand this is unsupported and brittle but for my use case I haven't been able to come up with a better/simpler solution.
This issue is similar to: #6019
The root cause is when we call HDFStore.keys(), it calls HDFStore.groups() and eventually g._v_attrs on a Pytables File.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L1076
But calling g._v_attrs on a tables.link.SoftLink causes a KeyError due to:
https://github.com/PyTables/PyTables/blob/develop/tables/link.py#L76
And there doesn't look to be a way to guard against an instance of NoAttrs since that class is defined within the method. One solution may be to check the instance of g if it's a Link
return [
g for g in self._handle.walk_nodes()
if (not isinstance(g, _table_mod.link.Link) and
(getattr(g._v_attrs, 'pandas_type', None) or
getattr(g, 'table', None) or
(isinstance(g, _table_mod.table.Table) and
g._v_name != u('table'))))
]
I'd be happy to write a PR and tests if you find this change acceptable.
Expected Output
a b
0 1 2
Successful write
['/test/key']
Successful read
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf-8
LANG: en_US.utf-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None