loc very slow on sorted, non-unique index with list of labels ar argument · Issue #9466 · pandas-dev/pandas (original) (raw)

In [1]: import pandas, numpy

In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))

In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop

In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop

... makes sense to me.

In [5]: df.index = list(range(99999)) + [55555]

In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop

In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop

Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things...

In [8]: df.sort(inplace=True)

In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop

In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop

... here I'm lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,

In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]: 
          0     1     2     3
55555  True  True  True  True
55555  True  True  True  True

(As a sidenote: the reason why I'm doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)