Performance pd.HDFStore().keys() slow · Issue #17593 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd, numpy as np path = 'test.h5' dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)] with pd.HDFStore(path) as store: for i, df in enumerate(dataframes): store.put('test' + str(i), df) %timeit store = pd.HDFStore(path).keys()
Problem description
The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.
It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.
/tables/file.py
def iter_nodes(self, where, classname=None):
"""Iterate over children nodes hanging from where.
**group = self.get_node(where)** # Does the parent exist?
self._check_group(group) # Is it a group?
return group._f_iter_nodes(classname)
%lprun -f store._handle.iter_nodes store.keys() Timer unit: 2.56e-07 s Total time: 0.0424965 s File: D:\Anaconda3\lib\site-packages\tables\file.py Function: iter_nodes at line 1998 Line # Hits Time Per Hit % Time Line Contents
1998 def iter_nodes(self, where, classname=None):
1999 """Iterate over children nodes hanging from where.
2000
2001 Parameters
2002 ----------
2003 where
2004 This argument works as in :meth:File.get_node
, referencing the
2005 node to be acted upon.
2006 classname
2007 If the name of a class derived from
2008 Node (see :ref:NodeClassDescr
) is supplied, only instances of
2009 that class (or subclasses of it) will be returned.
2010
2011 Notes
2012 -----
2013 The returned nodes are alphanumerically sorted by their name.
2014 This is an iterator version of :meth:File.list_nodes
.
2015
2016 """
2017
2018 6001 125237 20.9 75.4 group = self.get_node(where) # Does the parent exist?
2019 6001 26549 4.4 16.0 self._check_group(group) # Is it a group?
2020
2021 6001 14216 2.4 8.6 return group._f_iter_nodes(classname)
Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.
Also not sure whether to raise this in pandas or tables.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None