PERF: speed up CategoricalIndex.get_loc by topper-123 · Pull Request #23235 · pandas-dev/pandas (original) (raw)

closes PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is the the final puzzle for #20395 and speeds up CategoricalIndex.get_loc for monotonic increasing indexes.

This PR supersedes #21699, so #21699 should be closed.

The problem with the current get_loc implementation is that CategoricalIndex._engine constantly recodes int8 arrays to int64 arrays:

@cache_readonly
def _engine(self):

# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))

Notice the lambda: self.codes.astype('i8') part, which means that every time _engine.vgetter is called, an int64-array is created. This is expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.

A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted apparantly needs a dtype-compatible input or else it is also very slow:

n = 1_000_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.codes.searchsorted(1) # search for left location of 'b' 7.38 ms # slow code = np.int8(1) %timeit ci.codes.searchsorted(code) 2.57 µs # fast

Solution

As CategoricalIndex.codes may be int8, int16, etc, the solution must
(1) have an indexing engine for each integer dtype and
(2) have the code for the key be translated into the same dtype as the codes array before calling searchsorted.

This PR does that, essentially.

Performance improvement examples

n = 100_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.get_loc('b') 2.05 ms # master 8.96 µs # this PR n = 1_000_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.get_loc('b') 18.7 ms # master 9.09 µs # this PR

So we go from O(n) performance to O(1) performance.

The indexing_engines.py ASV results:

[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running indexing_engines.NumericEngineIndexing.time_get_loc                                                        7.18±1μs;...
[100.00%] ··· Running indexing_engines.ObjectEngineIndexing.time_get_loc                                                       7.04±0.4μs;...