PERF: Series.nunique can compute unique, then remove na (original) (raw)

Currently we first remove nans, then use len on the result of Series.unique. Except for Series that are mostly null values, it is more performant to switch the order of these operations:

n = 100_000
part_nan = 10
ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)

%timeit ser.nunique()
%timeit (~np.isnan(ser.unique())).sum()

gives

104 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67 ms ± 567 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Changing part_nan to 100 gives

126 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
96.5 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On my machine, they are about equal when part_nan is 250 (~70% null values).