PERF: MultiIndex.memory_usage shouldn't trigger the index engine by GianlucaFicarelli · Pull Request #58385 · pandas-dev/pandas (original) (raw)
Ignore the index engine when it isn't already cached.
- closes #xxxx (Replace xxxx with the GitHub issue number)
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
Reproducible Example
Calling memory_usage()
can be unexpectedly slow when called on a big MultiIndex.
Using the main branch:
In [3]: %time idx = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000), np.arange(100)], names=["x0", "x1", "x2"]) CPU times: user 247 ms, sys: 170 ms, total: 417 ms Wall time: 419 ms
In [4]: %time idx.memory_usage() CPU times: user 2.91 s, sys: 4.09 s, total: 7 s Wall time: 8.81 s Out[4]: 500016953
Using this PR branch:
In [3]: %time idx = pd.MultiIndex.from_product([np.arange(1000), np.arange(1000), np.arange(100)], names=["x0", "x1", "x2"]) CPU times: user 238 ms, sys: 148 ms, total: 386 ms Wall time: 385 ms
In [4]: %time idx.memory_usage() CPU times: user 112 µs, sys: 1e+03 ns, total: 113 µs Wall time: 118 µs Out[4]: 500016953
Side note: index._engine.sizeof()
doesn't consider the content of index._engine.values
. If it should, a separate issue can be opened. In the example above, the additional unreported used memory would be:
In [8]: idx._engine.values Out[8]: array([ 262402, 262403, 262404, ..., 131331299, 131331300, 131331301], dtype=uint64)
In [9]: idx._engine.values.shape Out[9]: (100000000,)
In [15]: idx._engine.values.nbytes Out[15]: 800000000