BUG: queries on categorical string columns in read_hdf return unexpected results · Issue #39189 · pandas-dev/pandas (original) (raw)


Code Sample, a copy-pastable example

h5_path = 'test.h5' df = pd.DataFrame({'col': ['a', 'b', 's']}) categorical_values = list(sorted(list(df.col.unique()))) # sorted because of https://github.com/pandas-dev/pandas/issues/16623 max_widths = {'col': 1} # will also set this column to be a "data column" df.col = df.col.astype('category') df.col.cat.set_categories(categorical_values, inplace=True) df.to_hdf(h5_path, 'main', mode='a', format='table', append=True, min_itemsize=max_widths)

read_df = pd.read_hdf(h5_path, where='col == "q"') #returns df with index=2, col = s

h5_path = 'test.h5' df = pd.DataFrame({'col': ['Word', 'Test']}) categorical_values = list(sorted(list(df.col.unique()))) max_widths = {'col': 4} df.col = df.col.astype('category') df.col.cat.set_categories(categorical_values, inplace=True) df.to_hdf(h5_path, 'main', mode='a', format='table', append=True, min_itemsize=max_widths)

read_df = pd.read_hdf(h5_path, where='col == "W"') #returns df with index=0, col=Word read_df = pd.read_hdf(h5_path, where='col == "T"') # returns empty df (as expected)

Problem description

Using the where clause for on disk hdf queries appears to give incorrect results sometimes. From what I have tested, this appears to only happen for columns that are both string based and categoricals. This is important because the output is completely inaccurate and makes this feature mostly unusable for these column types. I should note that I have not seen issues with querying for values that are present in the dataframe however.

Expected Output

For all read_hdf calls, the expected output is an empty dataframe.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-514.21.2.el7.x86_64
Version : #1 SMP Tue Jun 20 12:24:47 UTC 2017
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.0.post20201006
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None