API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False
· Issue #48749 · pandas-dev/pandas (original) (raw)
TLDR: When calling df.groupby(key=categocial<order=False>, sort=True, observed=False)
the resulting CategoricalIndex
will have it's values and categories unordered.
In [1]: df = DataFrame(
...: [
...: ["(7.5, 10]", 10, 10],
...: ["(7.5, 10]", 8, 20],
...: ["(2.5, 5]", 5, 30],
...: ["(5, 7.5]", 6, 40],
...: ["(2.5, 5]", 4, 50],
...: ["(0, 2.5]", 1, 60],
...: ["(5, 7.5]", 7, 70],
...: ],
...: columns=["range", "foo", "bar"],
...: )
In [2]: col = "range"
In [3]: df["range"] = Categorical(df["range"], ordered=False)
In [4]: df.groupby(col, sort=True, observed=False).first().index
Out[4]: CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')
In [5]: df.groupby(col, sort=False, observed=False).first().index
Out[5]: CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], ordered=False, dtype='category', name='range')
It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:
May have been an outcome of fixing #8868, but if grouping and sort=False
the values can be achieved without reordering the categories, there would probably be a nice performance benefit.