PERF: MultiIndex.memory_usage shouldn't trigger the index engine by GianlucaFicarelli · Pull Request #58385 · pandas-dev/pandas (original) (raw)

Ignore the index engine when it isn't already cached.

Reproducible Example

Calling memory_usage() can be unexpectedly slow when called on a big MultiIndex.

Using the main branch:

In [3]: %time idx = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000), np.arange(100)], names=["x0", "x1", "x2"]) CPU times: user 247 ms, sys: 170 ms, total: 417 ms Wall time: 419 ms

In [4]: %time idx.memory_usage() CPU times: user 2.91 s, sys: 4.09 s, total: 7 s Wall time: 8.81 s Out[4]: 500016953

Using this PR branch:

In [3]: %time idx = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000), np.arange(100)], names=["x0", "x1", "x2"]) CPU times: user 238 ms, sys: 148 ms, total: 386 ms Wall time: 385 ms

In [4]: %time idx.memory_usage() CPU times: user 112 µs, sys: 1e+03 ns, total: 113 µs Wall time: 118 µs Out[4]: 500016953

Side note: index._engine.sizeof() doesn't consider the content of index._engine.values. If it should, a separate issue can be opened. In the example above, the additional unreported used memory would be:

In [8]: idx._engine.values Out[8]: array([ 262402, 262403, 262404, ..., 131331299, 131331300, 131331301], dtype=uint64)

In [9]: idx._engine.values.shape Out[9]: (100000000,)

In [15]: idx._engine.values.nbytes Out[15]: 800000000