PERF: don't call RangeIndex._data unnecessarily by topper-123 · Pull Request #26565 · pandas-dev/pandas (original) (raw)
- closes #xxxx
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
I've looked into RangeIndex
and found that the index type creates and caches a int64 array if/when RangeIndex._data
property is being called. This basically means that in many cases, a RangeIndex
has the same memory consumption and the same speed as an Int64Index
.
This PR improves on that situation by giving RangeIndex
custom .get_loc
and ._format_with_header
methods. This avoids the calls to ._data
in some cases, which helps on the speed and memory consumption (see performance improvements below). There are probably other case where RangeIndex._data
can be avoided, which I'll investigate over the coming days.
%timeit pd.RangeIndex(1_000_000).get_loc(900_000) 8.95 ms ± 485 µs per loop # master 4.31 µs ± 303 ns per loop # this PR rng = pd.RangeIndex(1_000_000) %timeit rng.get_loc(900_000) 17.3 µs ± 392 ns per loop # master 547 ns ± 8.26 ns per loop # this PR. get_loc is now lightningly fast df = pd.DataFrame({'a': range(1_000_000)}) %timeit df.loc[800_000: 900_000] 132 µs ± 5.79 µs per loop # master 89 µs ± 2.95 µs per loop # this PR