Improve HDFStore.groups performance · Issue #21372 · pandas-dev/pandas (original) (raw)

I ran into this issue (#17593) recently, when I was trying to figure out what was causing the order of magnitude difference between h5ls and HDFStore.groups() or HDFStore.keys() and was trying to figure out more information about the problem. This was shrugged off as a Tables problem at the time and then closed.

I've come to the conclusion that this is actually Pandas problem. Hopefully, I can convince you of that as well.

If I look at a large flat HDF5 file (one where I have only created pandas dataframes at the root node /), then:

%%timeit
groups_iter_nodes = set()
for x in store._handle.iter_nodes(store._handle.root):
    groups_iter_nodes |= {x._v_pathname}

78.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

where:

%%timeit
groups_walk_nodes = set()
for x in store._handle.walk_nodes(store._handle.root):
    groups_walk_nodes |= {x._v_pathname}

7.08 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And the iter_nodes method has found every node that is relevant. The bottom version is (an abbreviated) version of what the Pandas library does, while the top version is just looking at the first level of keys.

Obviously, just looking at the root level of the HDF file doesn't work for the vast majority of cases, but I think there is still significant room to improve.

The root cause is that there are a whole bunch of leaves of groups that don't need to be looked at because we already know the answer. In my file, each of those groups has the following arrays under it: ['axis0', 'axis1', 'block0_items', 'block0_values']. These arrays are where the data lies for the pandas dataframe. In the above example, walk_nodes must actually look at every one of these nodes, while the iter_nodes version doesn't.

If we check if there are any children not in this list:

%%timeit
groups_iter_nodes = set()
for x in store._handle.iter_nodes(store._handle.root):
    if len(set(x._v_children.keys()) - {'axis0', 'axis1', 'block0_items', 'block0_values'}) > 0:
        print("oops")
    groups_iter_nodes |= {x._v_pathname}

88.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is still much faster than actually visiting the nodes in order to determine that they aren't pandas dataframes.

I'm not entirely sure how pandas deals with dataframes in hdf files. But if pandas really doesn't ever store anything except as children of groups, then this is a better way to look for all the groups:

%%timeit
groups_walk_nodes = set()
for x in store._handle.walk_groups(store._handle.root):
    groups_walk_nodes |= {x._v_pathname}

149 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I don't know enough about the history of pandas and HDFStores, or the details about how pandas stores dataframes in hdf files to fix this -- and I expect that there is a lot of historical backwards compatibility baggage that needs to be navigated -- but I think it is clear that this could be much smarter about how it finds groups.

If backwards compatability isn't an issue, I would be happy to submit a pull request for this. I can only assume that walk_groups is the least that could be done to improve things.