PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 by topper-123 · Pull Request #21699 · pandas-dev/pandas (original) (raw)

This is the the final puzzle for #20395.

The problem with the current implementation is that CategoricalIndex._engine constantly recodes int8 arrays to int64 arrays:

@cache_readonly
def _engine(self):
# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))

Notice the lambda: self.codes.astype('i8') part, which means that every time _engine.vgetter is called, an int64-array is created. This is very expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.

A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted apparantly needs a dtype-compatible input or else it is also very slow:

n = 1_000_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.codes.searchsorted(1) # search for left location of 'b' 7.38 ms # slow code = np.int8(1) %timeit ci.codes.searchsorted(code) 2.57 µs # fast

Solution options

As CategoricalIndex.codes may be int8, int16, etc, the solution must be to (1) have an indexing engine for each integer dtype or an indexing engine that accepts all int types, not just int64 and (2) that the key must be translated into the same dtype as the codes array before calling searchsorted. So either:

  1. Change Int64Engine to be a IntEngine (i.e. accept all integer dtypes)
  2. Make new IntEngine classes, with the appropriate flexibility for accepting all integer dtypes, but defers to Int64 version if/when needed (e.g. if codes is int8, but we only have algos.is_monotonic_int64 for checking monotonicity)
  3. Do everything in Python.

I assume option 1 is not desired, and option 3 assumedly likewise. In the updated PR I've made a proposal in Cython, that attains the needed speed.

Benchmarks from asv_bench/indexing.py

      before           after         ratio
     [dc45fbaf]       [bc03b8bd]
-     1.54±0.02ms         9.80±0μs     0.01  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_incr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.