loc very slow on sorted, non-unique index with list of labels ar argument · Issue #9466 · pandas-dev/pandas (original) (raw)
In [1]: import pandas, numpy
In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))
In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop
In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop
... makes sense to me.
In [5]: df.index = list(range(99999)) + [55555]
In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop
In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop
Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things...
In [8]: df.sort(inplace=True)
In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop
In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop
... here I'm lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,
In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]:
0 1 2 3
55555 True True True True
55555 True True True True
(As a sidenote: the reason why I'm doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)