PERF: Improve performance of CategoricalIndex.is_unique by topper-123 · Pull Request #21107 · pandas-dev/pandas (original) (raw)

CategoricalIndex.is_unique creates an extraneous boolean array. By changing CategoricalIndex.is_unique to use CategoricalIndex._engine.is_unique instead, this array creation is avoided. We simultaneously get to set is_monotonic* for free, and therefore will save time, if that property is called later.

Demonstration

Setup:

n = 1_000_000 ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))

Currently, ci.is_unique is about the same (disregarding@readonly_cache) as:

from pandas._libs.hashtable import duplicated_int64 not duplicated_int64(ci.codes.astype('int64')).any() False %timeit duplicated_int64(ci.codes.astype('int64')).any() 46.7 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notice that the duplicated_int64() creates an boolean array, which is not needed and slows the operation down.

If we instead use ci._engine.is_unique to check for uniqueness, the check is roughly similar to:

from pandas._libs.algos import is_monotonic_int64 is_monotonic_int64(ci.codes.astype('int64'), False) (True, False, False) # (is_monotonic_inc, is_monotonic_dec, is_unique) %timeit is_monotonic_int64(ci.codes.astype('int64'), False) 23.3 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is faster than the other version, as the intermediate boolean array is not created in this version. Also, is it (IMO) more idiomatic, as index._engine is in general supposed to be used for this kind of index content checks.