PERF: avoid unnecessary copies factorize by jbrockmendel · Pull Request #46109 · pandas-dev/pandas (original) (raw)
The core.algorithms change should only affect non-64bit cases. The Categorical change could help across the board.
import numpy as np
import pandas as pd
arr = np.arange(10**5, dtype="uint32")
%timeit pd.factorize(arr)
2.27 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- main
1.08 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # <- PR
cat = pd.Categorical(np.arange(10**5))
%timeit cat.factorize()
2.67 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <- main
1.06 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # <- PR
cat2 = pd.Categorical(["A", "B", "C"] * 1000)
%timeit cat2.factorize()
108 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # <- main
30.9 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # <- PR