PERF: optimize memory usage for to_hdf by jreback · Pull Request #9648 · pandas-dev/pandas (original) (raw)

the bug is back

df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB

%memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 7934.20 MiB, increment: 3823.80 MiB

pd.__version__
'0.24.2'

With a more complex structure, everything is much worse.

data_ifa.info()
<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, b88d3b87-3432-43cc-8219-f45d97389d8f to eb705297-94e8-4ccf-a910-5f3b9734d572
Data columns (total 2 columns):
bundles        100000 non-null object
bundles_len    100000 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.3+ MB

%memit -r 1 data_ifa.to_hdf(full_file_name_hd5, key='data_ifa', encoding='utf-8', complevel=9, mode='w', format='table')
peak memory: 22106.07 MiB, increment: 21324.53 MiB