PERF: Saving many datasets in a single group slows down with each new addition · Issue #58248 · pandas-dev/pandas (original) (raw)

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I found a strange behaviour that seems to appear only for PyTabels (Pandas).

Saving many datasets within a single group becomes progressively slower.

import pandas as pd

df = pd.DataFrame({'A': [1.0] * 1000})
df = pd.concat([df] * 13, axis=1, ignore_index=True)

size = 5000 timings = [] for i in tqdm.tqdm(range(0, size), total=size): key = ''.join(random.choices(string.ascii_uppercase, k=20))

start = time.time()
df.to_hdf('test.h5', key=key, mode='a', complevel=9) 
timings.append(time.time() - start)

plt.plot(timings[10:])

This is not the case for h5py, which can easily write up to 100 more datasets without slowing down.

import h5py

timings = [] size = 500000 with h5py.File('test2.h5', 'w', libver='latest') as hf: group = hf.create_group(f'group') for i in tqdm.tqdm(range(0, size), total=size): key = ''.join(random.choices(string.ascii_uppercase, k=20))

    start = time.time()
    group.create_dataset(key, data=df.values, compression="gzip", compression_opts=9) 
    timings.append(time.time() - start)

plt.plot(timings[10:])

Installed Versions

Replace this line with the output of pd.show_versions()

Prior Performance

I have raised this issue in the Pytables repo and it seems it is actually an issue with pandas https://github.com/PyTables/PyTables/issues/1155