PERF: Saving many datasets in a single group slows down with each new addition · Issue #58248 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I found a strange behaviour that seems to appear only for PyTabels (Pandas).
Saving many datasets within a single group becomes progressively slower.
import pandas as pd
df = pd.DataFrame({'A': [1.0] * 1000})
df = pd.concat([df] * 13, axis=1, ignore_index=True)
size = 5000 timings = [] for i in tqdm.tqdm(range(0, size), total=size): key = ''.join(random.choices(string.ascii_uppercase, k=20))
start = time.time()
df.to_hdf('test.h5', key=key, mode='a', complevel=9)
timings.append(time.time() - start)
plt.plot(timings[10:])
This is not the case for h5py
, which can easily write up to 100 more datasets without slowing down.
import h5py
timings = [] size = 500000 with h5py.File('test2.h5', 'w', libver='latest') as hf: group = hf.create_group(f'group') for i in tqdm.tqdm(range(0, size), total=size): key = ''.join(random.choices(string.ascii_uppercase, k=20))
start = time.time()
group.create_dataset(key, data=df.values, compression="gzip", compression_opts=9)
timings.append(time.time() - start)
plt.plot(timings[10:])
Installed Versions
Replace this line with the output of pd.show_versions()
Prior Performance
I have raised this issue in the Pytables repo and it seems it is actually an issue with pandas https://github.com/PyTables/PyTables/issues/1155