PERF: CategoricalDtype.update_dtype by mroeschke · Pull Request #59647 · pandas-dev/pandas (original) (raw)

If a CategoricalDtype is passed to CategoricalDtype.update_dtype, this API will attempt to unnecessarily re-validate the categories if it was not None.

CategoricalDtype.update_dtype is called in constructors like Categorical.__init__ and Categorical._simple_new where there is an attempt to update the passed dtype with ordered=False if it was None. A fully validated CategoricalDtype should just return itself if passed to update_dtype

In [1]: import pandas as pd

In [2]: cdtype = pd.CategoricalDtype(categories=list(range(100_000)), ordered=True)

In [3]: base_dtype = pd.CategoricalDtype(ordered=False)

In [4]: %timeit base_dtype.update_dtype(cdtype) 2.5 μs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit base_dtype.update_dtype(cdtype) 865 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)