PERF: Avoid materializing entire IntervalIndex when using cut · Issue #27668 · pandas-dev/pandas (original) (raw)
When using cut
with an IntervalIndex
for bins
the result of the cut
is first materialized as an IntervalIndex
and then converted to a Categorical
:
if isinstance(bins, IntervalIndex): |
---|
# we have a fast-path here |
ids = bins.get_indexer(x) |
result = algos.take_nd(bins, ids) |
result = Categorical(result, categories=bins, ordered=True) |
return result, bins |
It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex
via take_nd
and instead directly construct the Categorical
via Categorical.from_codes
.
Some ad hoc measurements on master
:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii) 7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii) peak memory: 278.39 MiB, increment: 130.76 MiB
And the same measurements with the Categorical.from_codes
fix:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii) 1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii) peak memory: 145.81 MiB, increment: 15.98 MiB