PERF: Avoid materializing entire IntervalIndex when using cut (original) (raw)
When using cut with an IntervalIndex for bins the result of the cut is first materialized as an IntervalIndex and then converted to a Categorical:
| if isinstance(bins, IntervalIndex): |
|---|
| # we have a fast-path here |
| ids = bins.get_indexer(x) |
| result = algos.take_nd(bins, ids) |
| result = Categorical(result, categories=bins, ordered=True) |
| return result, bins |
It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex via take_nd and instead directly construct the Categorical via Categorical.from_codes.
Some ad hoc measurements on master:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii) 7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii) peak memory: 278.39 MiB, increment: 130.76 MiB
And the same measurements with the Categorical.from_codes fix:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii) 1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii) peak memory: 145.81 MiB, increment: 15.98 MiB