PERF: Index.getitem performance issue · Issue #6370 · pandas-dev/pandas (original) (raw)

Once again, caused by #6328 investigation.

There's something very strange with how Index objects handle slices:

In [1]: import pandas.util.testing as tm

In [2]: idx = tm.makeStringIndex(1000000)

In [3]: timeit idx[:-1] 100000 loops, best of 3: 2 µs per loop

In [4]: timeit idx[slice(None,-1)] 100 loops, best of 3: 6.5 ms per loop

Obviously, this happens because Index doesn't override __getslice__ provided by ndarray, hence idx[:-1] is executed via ndarray.__getslice__ -> Index.__array_finalize__ and idx[slice(None, -1)] goes via Index.__getitem__ -> Index.__new__.

__getitem__ is made 1000x slower trying to infer slice data type and convert it to a different subclass. The problem is that interactive invocation idx[:-1], which is when that milliseconds-vs-microseconds issue doesn't matter, is likely to miss this feature, because it's dispatched via __getslice__ . But for programmatic invocation idx[slice(None, -1)] which hits this soft spot, I'd argue that this type conversion magic is not at all necessary.

Is there a rationale behind this?