PERF: optimize index.getitem for slice & boolean mask indexers by immerrr · Pull Request #6440 · pandas-dev/pandas (original) (raw)

This patch fixes performance issues discussed in and closes #6370.

There's an API change though: no type inference will now happen upon resulting index, e.g. here's what it would look like on current master:

In [1]: pd.Index([1,2,3, 'a', 'b', 'c']) Out[1]: Index([1, 2, 3, u'a', u'b', u'c'], dtype='object')

In [2]: _1[[0,1,2]] Out[2]: Int64Index([1, 2, 3], dtype='int64')

And this is how it looks with the patch:

In [1]: pd.Index([1,2,3, 'a', 'b', 'c']) Out[1]: Index([1, 2, 3, u'a', u'b', u'c'], dtype='object')

In [2]: _1[[0,1,2]] Out[2]: Index([1, 2, 3], dtype='object')

On the bright side, I'd say using mixed-type indices seems a quite rare scenario and the performance improvement is worth it.

Performance Before:

In [1]: import pandas.util.testing as tm

In [2]: idx = tm.makeStringIndex(1000000)

In [3]: mask = np.arange(1000000) % 3 == 0

In [4]: series_mask = pd.Series(mask); mask Out[4]: array([ True, False, False, ..., False, False, True], dtype=bool)

In [5]: timeit idx[:-1] 100000 loops, best of 3: 1.57 µs per loop

In [6]: timeit idx[::2] 100 loops, best of 3: 14.6 ms per loop

In [7]: timeit idx[mask] 100 loops, best of 3: 15.4 ms per loop

In [8]: timeit idx[series_mask] 100 loops, best of 3: 15.4 ms per loop

In [9]: pd.version Out[9]: '0.13.1-278-gaf63b99'

Performance After:

In [1]: import pandas.util.testing as tm

In [2]: idx = tm.makeStringIndex(1000000)

In [3]: mask = np.arange(1000000) % 3 == 0

In [4]: series_mask = pd.Series(mask)

In [5]: timeit idx[:-1] 1000000 loops, best of 3: 1.56 µs per loop

In [6]: timeit idx[::2] 100000 loops, best of 3: 4.89 µs per loop

In [7]: timeit idx[mask] 100 loops, best of 3: 11.4 ms per loop

In [8]: timeit idx[series_mask] 100 loops, best of 3: 11.4 ms per loop

In [9]: pd.version Out[9]: '0.13.1-279-gbc810f0'