PERF: Avoid materializing values in Categorical.set_categories · Issue #17508 · pandas-dev/pandas (original) (raw)

In Categorical.set_categories, we allocate an array of the values, which may be expensive:

It should be possible to do this operation by just manipulating the codes.

In [6]: c = pd.Categorical(['a'] * 100000)

In [7]: c.set_categories(['a', 'b']) Out[7]: [a, a, a, a, a, ..., a, a, a, a, a] Length: 100000 Categories (2, object): [a, b]

See 5ab0123 for how this might work, which will probably be squashed, but it's the implementation of Categorical._set_dtype in #16015

I may get to this as a followup to that PR.