PERF: optimize index.getitem for slice & boolean mask indexers by immerrr · Pull Request #6440 · pandas-dev/pandas (original) (raw)
This patch fixes performance issues discussed in and closes #6370.
There's an API change though: no type inference will now happen upon resulting index, e.g. here's what it would look like on current master:
In [1]: pd.Index([1,2,3, 'a', 'b', 'c']) Out[1]: Index([1, 2, 3, u'a', u'b', u'c'], dtype='object')
In [2]: _1[[0,1,2]] Out[2]: Int64Index([1, 2, 3], dtype='int64')
And this is how it looks with the patch:
In [1]: pd.Index([1,2,3, 'a', 'b', 'c']) Out[1]: Index([1, 2, 3, u'a', u'b', u'c'], dtype='object')
In [2]: _1[[0,1,2]] Out[2]: Index([1, 2, 3], dtype='object')
On the bright side, I'd say using mixed-type indices seems a quite rare scenario and the performance improvement is worth it.
Performance Before:
In [1]: import pandas.util.testing as tm
In [2]: idx = tm.makeStringIndex(1000000)
In [3]: mask = np.arange(1000000) % 3 == 0
In [4]: series_mask = pd.Series(mask); mask Out[4]: array([ True, False, False, ..., False, False, True], dtype=bool)
In [5]: timeit idx[:-1] 100000 loops, best of 3: 1.57 µs per loop
In [6]: timeit idx[::2] 100 loops, best of 3: 14.6 ms per loop
In [7]: timeit idx[mask] 100 loops, best of 3: 15.4 ms per loop
In [8]: timeit idx[series_mask] 100 loops, best of 3: 15.4 ms per loop
In [9]: pd.version Out[9]: '0.13.1-278-gaf63b99'
Performance After:
In [1]: import pandas.util.testing as tm
In [2]: idx = tm.makeStringIndex(1000000)
In [3]: mask = np.arange(1000000) % 3 == 0
In [4]: series_mask = pd.Series(mask)
In [5]: timeit idx[:-1] 1000000 loops, best of 3: 1.56 µs per loop
In [6]: timeit idx[::2] 100000 loops, best of 3: 4.89 µs per loop
In [7]: timeit idx[mask] 100 loops, best of 3: 11.4 ms per loop
In [8]: timeit idx[series_mask] 100 loops, best of 3: 11.4 ms per loop
In [9]: pd.version Out[9]: '0.13.1-279-gbc810f0'