PERF: Avoid materializing entire IntervalIndex when using cut · Issue #27668 · pandas-dev/pandas (original) (raw)

When using cut with an IntervalIndex for bins the result of the cut is first materialized as an IntervalIndex and then converted to a Categorical:

if isinstance(bins, IntervalIndex):
# we have a fast-path here
ids = bins.get_indexer(x)
result = algos.take_nd(bins, ids)
result = Categorical(result, categories=bins, ordered=True)
return result, bins

It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex via take_nd and instead directly construct the Categorical via Categorical.from_codes.

Some ad hoc measurements on master:

In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii) 7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii) peak memory: 278.39 MiB, increment: 130.76 MiB

And the same measurements with the Categorical.from_codes fix:

In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii) 1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii) peak memory: 145.81 MiB, increment: 15.98 MiB