pandas (original) (raw)

TLDR: When calling df.groupby(key=categocial<order=False>, sort=True, observed=False) the resulting CategoricalIndex will have it's values and categories unordered.

In [1]:     df = DataFrame(
   ...:         [
   ...:             ["(7.5, 10]", 10, 10],
   ...:             ["(7.5, 10]", 8, 20],
   ...:             ["(2.5, 5]", 5, 30],
   ...:             ["(5, 7.5]", 6, 40],
   ...:             ["(2.5, 5]", 4, 50],
   ...:             ["(0, 2.5]", 1, 60],
   ...:             ["(5, 7.5]", 7, 70],
   ...:         ],
   ...:         columns=["range", "foo", "bar"],
   ...:     )

In [2]: col = "range"

In [3]: df["range"] = Categorical(df["range"], ordered=False)

In [4]: df.groupby(col, sort=True, observed=False).first().index
Out[4]: CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

In [5]: df.groupby(col, sort=False, observed=False).first().index
Out[5]: CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], ordered=False, dtype='category', name='range')

It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:

May have been an outcome of fixing #8868, but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and sort=False · Issue #48749 · pandas-dev/pandas (original) (raw)

API/PERF: Don't reorder categoricals when grouping by an unordered categorical and `sort=False` · Issue #48749 · pandas-dev/pandas (original) (raw)