BUG: Categorical data fails to load from hdf when all columns are NaN · Issue #18413 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import numpy as np import pandas as pd
df = pd.DataFrame({'a': ['a', 'b', 'c', np.nan], 'b': [np.nan, np.nan, np.nan, np.nan]}) df['a'] = df.a.astype('category') df['b'] = df.b.astype('category')
df.to_hdf('foo.h5', 'bar', format='table') pd.read_hdf('foo.h5', 'bar')
Problem description
While storing an hdf file with categorical data containing np.nan
s works fine, loading the file back in to a DataFrame raises an exception.
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 372, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 742, in select
return it.get_result()
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 1449, in get_result
results = self.func(self.start, self.stop, where)
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 735, in func
columns=columns, **kwargs)
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 4124, in read
if not self.read_axes(where=where, **kwargs):
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 3329, in read_axes
a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 2133, in convert
if mask.any():
AttributeError: 'bool' object has no attribute 'any'
This exception is related to storing a column that contains np.nan
values only (column a
stores and loads fine on its own).
The problem could already be in the way the metadata
for column b
(the np.nan
-only column) is stored when calling df.to_hdf()
as the metadata
is None
when loading. The relevant code for the pd.read_hdf()
in DataCol.convert
:
elif meta == u('category'):
# we have a categorical
categories = self.metadata
codes = self.data.ravel()
# if we have stored a NaN in the categories
# then strip it; in theory we could have BOTH
# -1s in the codes and nulls :<
mask = isnull(categories)
if mask.any():
categories = categories[~mask]
codes[codes != -1] -= mask.astype(int).cumsum().values
has metadata set to None (self.metadata == categories == None
) which in turn makes mask (=isnull(None)
) a scalar value (False
) and thus .any()
fails.
Expected Output
No exception; dataframe loads as it would without categorical data.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: None.None
pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.16.0
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None