BUG: Categorical data fails to load from hdf when all columns are NaN · Issue #18413 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import numpy as np import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', np.nan], 'b': [np.nan, np.nan, np.nan, np.nan]}) df['a'] = df.a.astype('category') df['b'] = df.b.astype('category')

df.to_hdf('foo.h5', 'bar', format='table') pd.read_hdf('foo.h5', 'bar')

Problem description

While storing an hdf file with categorical data containing np.nans works fine, loading the file back in to a DataFrame raises an exception.

  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 372, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 742, in select
    return it.get_result()
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 1449, in get_result
    results = self.func(self.start, self.stop, where)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 735, in func
    columns=columns, **kwargs)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 4124, in read
    if not self.read_axes(where=where, **kwargs):
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 3329, in read_axes
    a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
  File "~/env/lib/python2.7/site-packages/pandas/io/pytables.py", line 2133, in convert
    if mask.any():
AttributeError: 'bool' object has no attribute 'any'

This exception is related to storing a column that contains np.nan values only (column a stores and loads fine on its own).

The problem could already be in the way the metadata for column b (the np.nan-only column) is stored when calling df.to_hdf() as the metadata is None when loading. The relevant code for the pd.read_hdf() in DataCol.convert:

elif meta == u('category'):

            # we have a categorical
            categories = self.metadata
            codes = self.data.ravel()

            # if we have stored a NaN in the categories
            # then strip it; in theory we could have BOTH
            # -1s in the codes and nulls :<
            mask = isnull(categories)
            if mask.any():
                categories = categories[~mask]
                codes[codes != -1] -= mask.astype(int).cumsum().values

has metadata set to None (self.metadata == categories == None) which in turn makes mask (=isnull(None)) a scalar value (False) and thus .any() fails.

Expected Output

No exception; dataframe loads as it would without categorical data.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.16.0
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 1.4.3
openpyxl: 1.8.6
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None