PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed by topper-123 · Pull Request #21659 · pandas-dev/pandas (original) (raw)
- progress towards PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
- ASVs added
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently, a lot of indexing/slicing ops on CategoricalIndexes goes through CategoricalIndex._create_categorical, which can be a slow operation, because calling data._set_dtype there is slow. This PR improves performance by avoiding calling data._set_dtype as often, specifically when the new and the old dtypes are equal.
An complicating issue was that comparisons of CategoricalDtype was quite slow, so the improvements I saw in #20395 were offset by slowdowns in other places. To avoid this, CategoricalDtype.__equal__ needed to become smarter and pandas has to pass round existing dtypes rather than only categories and ordered. This is ok, as CategoricalDtype is immutable. This minimizes the need to call the expensive CategoricalDtype.__hash__ method in CategoricalDtype.__equal__ and makes comparisons much faster.
Some notable results
First setup:
n = 100_000 c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n)) ci = pd.CategoricalIndex(c) df = pd.DataFrame({'A': range(n * 3)}, index=ci) sl = slice(n, n * 2)
Results:
%timeit c[sl] 13.9 µs # master 4.43 µs # this PR %timeit ci[sl] 740 µs # master 12.7 µs # this PR %timeit df.iloc[sl] 855 µs # master 72.2 µs # this PR %timeit df.loc['b'] 3.23 ms # master 1.62 ms # this PR
Benchmarks
benchmarks/indexing.py:
before after ratio
[36422a88] [e0c62df0]
+ 53.4±0μs 61.1±0μs 1.14 indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
- 477±4ms 414±4ms 0.87 indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
- 476±40ns 381±20ns 0.80 indexing.MethodLookup.time_lookup_iloc
- 1.23±0.2ms 367±0μs 0.30 indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
- 1.29±0.2ms 344±5μs 0.27 indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
- 115±8μs 19.5±2μs 0.17 indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
- 122±2μs 19.5±0μs 0.16 indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
- 125±8μs 19.8±2μs 0.16 indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
- 121±2μs 13.4±0μs 0.11 indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
- 129±2μs 12.5±0μs 0.10 indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_incr')
- 195±20μs 13.4±0.2μs 0.07 indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
benchmarks/categoricals.py:
before after ratio
[a620e725] [04667f67]
+ 10.4±0ms 11.7±0.5ms 1.12 categoricals.Concat.time_union
+ 3.42μs 3.76μs 1.10 categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')
- 3.66μs 3.20μs 0.87 categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')
- 20.8±0.7ms 13.9±0ms 0.67 categoricals.ValueCounts.time_value_counts(False)
- 23.4±0ms 12.2±0ms 0.52 categoricals.ValueCounts.time_value_counts(True)
- 13.3μs 5.86μs 0.44 categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
- 13.3μs 4.88μs 0.37 categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
- 17.1μs 6.12μs 0.36 categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
- 17.1μs 6.11μs 0.36 categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
- 15.5μs 4.39μs 0.28 categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
- 22.0μs 6.11μs 0.28 categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
I haven't run the whole test suite, as that takes a long time (4-5 hours?) for me. Would appreciate input first and, if I need to run the whole suite, if there is a smarter way to do it.