PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed by topper-123 · Pull Request #21659 · pandas-dev/pandas (original) (raw)

progress towards PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
ASVs added
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Currently, a lot of indexing/slicing ops on CategoricalIndexes goes through CategoricalIndex._create_categorical, which can be a slow operation, because calling data._set_dtype there is slow. This PR improves performance by avoiding calling data._set_dtype as often, specifically when the new and the old dtypes are equal.

An complicating issue was that comparisons of CategoricalDtype was quite slow, so the improvements I saw in #20395 were offset by slowdowns in other places. To avoid this, CategoricalDtype.__equal__ needed to become smarter and pandas has to pass round existing dtypes rather than only categories and ordered. This is ok, as CategoricalDtype is immutable. This minimizes the need to call the expensive CategoricalDtype.__hash__ method in CategoricalDtype.__equal__ and makes comparisons much faster.

Some notable results

First setup:

n = 100_000 c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n)) ci = pd.CategoricalIndex(c) df = pd.DataFrame({'A': range(n * 3)}, index=ci) sl = slice(n, n * 2)

Results:

%timeit c[sl] 13.9 µs # master 4.43 µs # this PR %timeit ci[sl] 740 µs # master 12.7 µs # this PR %timeit df.iloc[sl] 855 µs # master 72.2 µs # this PR %timeit df.loc['b'] 3.23 ms # master 1.62 ms # this PR

Benchmarks

benchmarks/indexing.py:

      before           after         ratio
     [36422a88]       [e0c62df0]
+        53.4±0μs         61.1±0μs     1.14  indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
-         477±4ms          414±4ms     0.87  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-        476±40ns         381±20ns     0.80  indexing.MethodLookup.time_lookup_iloc
-      1.23±0.2ms          367±0μs     0.30  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
-      1.29±0.2ms          344±5μs     0.27  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
-         115±8μs         19.5±2μs     0.17  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
-         122±2μs         19.5±0μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
-         125±8μs         19.8±2μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
-         121±2μs         13.4±0μs     0.11  indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
-         129±2μs         12.5±0μs     0.10  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_incr')
-        195±20μs       13.4±0.2μs     0.07  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

benchmarks/categoricals.py:

      before           after         ratio
     [a620e725]       [04667f67]
+        10.4±0ms       11.7±0.5ms     1.12  categoricals.Concat.time_union
+          3.42μs           3.76μs     1.10  categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')
-          3.66μs           3.20μs     0.87  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')
-      20.8±0.7ms         13.9±0ms     0.67  categoricals.ValueCounts.time_value_counts(False)
-        23.4±0ms         12.2±0ms     0.52  categoricals.ValueCounts.time_value_counts(True)
-          13.3μs           5.86μs     0.44  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-          13.3μs           4.88μs     0.37  categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
-          17.1μs           6.12μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-          17.1μs           6.11μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
-          15.5μs           4.39μs     0.28  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
-          22.0μs           6.11μs     0.28  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I haven't run the whole test suite, as that takes a long time (4-5 hours?) for me. Would appreciate input first and, if I need to run the whole suite, if there is a smarter way to do it.