PERF: speed up CategoricalIndex.get_loc by topper-123 · Pull Request #23235 · pandas-dev/pandas (original) (raw)
- closes PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
This is the the final puzzle for #20395 and speeds up CategoricalIndex.get_loc
for monotonic increasing indexes.
This PR supersedes #21699, so #21699 should be closed.
The problem with the current get_loc implementation is that CategoricalIndex._engine
constantly recodes int8 arrays to int64 arrays:
@cache_readonly |
---|
def _engine(self): |
# we are going to look things up with the codes themselves |
return self._engine_type(lambda: self.codes.astype('i8'), len(self)) |
Notice the lambda: self.codes.astype('i8')
part, which means that every time _engine.vgetter
is called, an int64-array is created. This is expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.
A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted
apparantly needs a dtype-compatible input or else it is also very slow:
n = 1_000_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.codes.searchsorted(1) # search for left location of 'b' 7.38 ms # slow code = np.int8(1) %timeit ci.codes.searchsorted(code) 2.57 µs # fast
Solution
As CategoricalIndex.codes may be int8, int16, etc, the solution must
(1) have an indexing engine for each integer dtype and
(2) have the code for the key be translated into the same dtype as the codes array before calling searchsorted.
This PR does that, essentially.
Performance improvement examples
n = 100_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.get_loc('b') 2.05 ms # master 8.96 µs # this PR n = 1_000_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n)) %timeit ci.get_loc('b') 18.7 ms # master 9.09 µs # this PR
So we go from O(n) performance to O(1) performance.
The indexing_engines.py ASV results:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running indexing_engines.NumericEngineIndexing.time_get_loc 7.18±1μs;...
[100.00%] ··· Running indexing_engines.ObjectEngineIndexing.time_get_loc 7.04±0.4μs;...