PERF: don't call RangeIndex._data unnecessarily by topper-123 · Pull Request #26565 · pandas-dev/pandas (original) (raw)

I've looked into RangeIndex and found that the index type creates and caches a int64 array if/when RangeIndex._data property is being called. This basically means that in many cases, a RangeIndex has the same memory consumption and the same speed as an Int64Index.

This PR improves on that situation by giving RangeIndex custom .get_loc and ._format_with_header methods. This avoids the calls to ._data in some cases, which helps on the speed and memory consumption (see performance improvements below). There are probably other case where RangeIndex._data can be avoided, which I'll investigate over the coming days.

%timeit pd.RangeIndex(1_000_000).get_loc(900_000) 8.95 ms ± 485 µs per loop # master 4.31 µs ± 303 ns per loop # this PR rng = pd.RangeIndex(1_000_000) %timeit rng.get_loc(900_000) 17.3 µs ± 392 ns per loop # master 547 ns ± 8.26 ns per loop # this PR. get_loc is now lightningly fast df = pd.DataFrame({'a': range(1_000_000)}) %timeit df.loc[800_000: 900_000] 132 µs ± 5.79 µs per loop # master 89 µs ± 2.95 µs per loop # this PR