Presence of softlink in HDF5 file breaks HDFStore.keys() · Issue #20523 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

#! /path/to/python3.6

import pandas as pd

df = pd.DataFrame({ "a": [1], "b": [2] }) print(df.to_string())

hdf = pd.HDFStore("/tmp/test.hdf", mode="w") hdf.put("/test/key", df)

#Brittle hdf._handle.create_soft_link(hdf._handle.root.test, "symlink", "/test/key") hdf.close() print("Successful write")

hdf = pd.HDFStore("/tmp/test.hdf", mode="r") ''' Traceback (most recent call last): File "snippet.py", line 31, in print(hdf.keys()) File "python3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 529, in keys return [n._v_pathname for n in self.groups()] File "python3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1077, in groups g for g in self._handle.walk_nodes() File "python3.6.3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1078, in if (getattr(g._v_attrs, 'pandas_type', None) or File "python3.6.3/lib/python3.6/site-packages/tables/link.py", line 79, in getattr "%s instance" % self.class.name) KeyError: 'you cannot get attributes from this NoAttrs instance' ''' print(hdf.keys()) #causes exception hdf.close()

print("Successful read")

Problem description

I know I have a esoteric problem, but I'm building an HDF5 file using Pandas and then using pytables to softlink to the Pandas dataframe. I understand this is unsupported and brittle but for my use case I haven't been able to come up with a better/simpler solution.

This issue is similar to: #6019

The root cause is when we call HDFStore.keys(), it calls HDFStore.groups() and eventually g._v_attrs on a Pytables File.

https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L1076

But calling g._v_attrs on a tables.link.SoftLink causes a KeyError due to:

https://github.com/PyTables/PyTables/blob/develop/tables/link.py#L76

And there doesn't look to be a way to guard against an instance of NoAttrs since that class is defined within the method. One solution may be to check the instance of g if it's a Link

        return [
            g for g in self._handle.walk_nodes()
            if (not isinstance(g, _table_mod.link.Link) and
                (getattr(g._v_attrs, 'pandas_type', None) or
                 getattr(g, 'table', None) or
                (isinstance(g, _table_mod.table.Table) and
                 g._v_name != u('table'))))
        ]

I'd be happy to write a PR and tests if you find this change acceptable.

Expected Output

   a  b
0  1  2
Successful write
['/test/key']
Successful read

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf-8
LANG: en_US.utf-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None