BUG: entries missing when reading from pytables hdf store using "where" statement (original) (raw)
When I select from a HDF store using a "where" string (locating entries in which one field matches a particular string value), the function returns fewer rows than when I load the entire dataframe into memory and then match on that field. Below is some code that reproduces the problem; unfortunately, I can't easily provide the code that generates the source HDF store, but I'm happy to provide the kept_tids_20150310.h5 file if it would help. There are no nan values in the dataframe.
Running ptrepack on the dataframe solves the problem, but I don't believe this should happen in the first place.
I am using pandas 0.15.2 but have not tried 0.16.0.
import pandas as pd pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.3.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.15.2 nose: 1.3.4 Cython: 0.19.2 numpy: 1.9.2 scipy: 0.15.1 statsmodels: 0.5.0 IPython: 2.4.0 sphinx: 1.2.3 patsy: 0.2.1 dateutil: 2.4.1 pytz: 2014.10 bottleneck: 0.8.0 tables: 3.1.0 numexpr: 2.3 matplotlib: 1.4.2 openpyxl: None xlrd: None xlwt: 0.7.2 xlsxwriter: None lxml: 2.3.2 bs4: 4.3.2 html5lib: 0.999 httplib2: 0.7.2 apiclient: None rpy2: 2.4.2 sqlalchemy: None pymysql: None psycopg2: None
kept_tids = pd.read_hdf('kept_tids_20150310.h5', 'kept_tids', mode='r') kept_tids.to_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='w', format='t', data_columns=True) chroms = kept_tids['chrom'].drop_duplicates().order().tolist() print chroms ['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrM', 'chrX', 'chrY'] len(kept_tids) 202836 sum(len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == '%s'"%x)) for x in chroms) 193757 (kept_tids['chrom']=='chr16').sum() 10157 len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == 'chr16'")) 6278