BUG: ValueError when doing HDFStore.Select of contiguous mixed-data table ft. VLArray · Issue #17021 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd
myDf = pd.DataFrame({'a' : pd.Series([1443525810,1443540836,1443609470]),
'b' : pd.Series(['ab','cd','ab'])})
myDf.to_hdf('test.h5', 'test')
with pd.HDFStore('test.h5') as myFile:
df = myFile.select('/test', start=0, stop=2) # omit "start=0, stop=2" to prevent error
display (df)
Problem description
ValueError: Shape of passed values is (2, 3), indices imply (2, 2)
Expected Output
a b
0 1443525810 ab
1 1443540836 cd
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Windows
OS-release: 2012ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
Other remarks:
- please be gentle, this is my first Github interaction :)
- notebook attached that contains problem and solution output
- My guess is that pytables.py's read_array takes the one-dimensional behavior of VLArray into account too late; after slicing "data = node[start:stop]", resulting in the slice returning the whole column, my following implementation of the method seems to fix it.
def read_array(self, key, start=None, stop=None):
""" read an array for the specified node (off of group """
import tables
node = getattr(self.group, key)
attrs = node._v_attrs
transposed = getattr(attrs, 'transposed', False)
if isinstance(node, tables.VLArray):
ret = node[0][start:stop]
else:
dtype = getattr(attrs, 'value_type', None)
shape = getattr(attrs, 'shape', None)
if shape is not None:
# length 0 axis
ret = np.empty(shape, dtype=dtype)
else:
ret = node[start:stop]
if dtype == u('datetime64'):
# reconstruct a timezone if indicated
ret = _set_tz(ret, getattr(attrs, 'tz', None), coerce=True)
elif dtype == u('timedelta64'):
ret = np.asarray(ret, dtype='m8[ns]')
if transposed:
return ret.T
else:
return ret