PERF: Avoid materializing values in Categorical.set_categories
· Issue #17508 · pandas-dev/pandas (original) (raw)
In Categorical.set_categories
, we allocate an array of the values, which may be expensive:
It should be possible to do this operation by just manipulating the codes.
In [6]: c = pd.Categorical(['a'] * 100000)
In [7]: c.set_categories(['a', 'b']) Out[7]: [a, a, a, a, a, ..., a, a, a, a, a] Length: 100000 Categories (2, object): [a, b]
See 5ab0123 for how this might work, which will probably be squashed, but it's the implementation of Categorical._set_dtype
in #16015
I may get to this as a followup to that PR.