Performance pd.HDFStore().keys() slow · Issue #17593 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np path = 'test.h5' dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)] with pd.HDFStore(path) as store: for i, df in enumerate(dataframes): store.put('test' + str(i), df) %timeit store = pd.HDFStore(path).keys()

Problem description

The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.

It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.

/tables/file.py

def iter_nodes(self, where, classname=None):
    """Iterate over children nodes hanging from where.

    **group = self.get_node(where)**  # Does the parent exist?
    self._check_group(group)  # Is it a group?

    return group._f_iter_nodes(classname)

%lprun -f store._handle.iter_nodes store.keys() Timer unit: 2.56e-07 s Total time: 0.0424965 s File: D:\Anaconda3\lib\site-packages\tables\file.py Function: iter_nodes at line 1998 Line # Hits Time Per Hit % Time Line Contents

1998 def iter_nodes(self, where, classname=None): 1999 """Iterate over children nodes hanging from where. 2000
2001 Parameters 2002 ---------- 2003 where 2004 This argument works as in :meth:File.get_node, referencing the 2005 node to be acted upon. 2006 classname 2007 If the name of a class derived from 2008 Node (see :ref:NodeClassDescr) is supplied, only instances of 2009 that class (or subclasses of it) will be returned. 2010
2011 Notes 2012 ----- 2013 The returned nodes are alphanumerically sorted by their name. 2014 This is an iterator version of :meth:File.list_nodes. 2015
2016 """ 2017
2018 6001 125237 20.9 75.4 group = self.get_node(where) # Does the parent exist? 2019 6001 26549 4.4 16.0 self._check_group(group) # Is it a group? 2020
2021 6001 14216 2.4 8.6 return group._f_iter_nodes(classname)

Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.

Also not sure whether to raise this in pandas or tables.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None