groupby transform misbehaving with categoricals · Issue #9921 · pandas-dev/pandas (original) (raw)
Starting from this SO question, we can see that something is odd when doing a groupby using Series of category dtype. (Today's trunk, 0.16.0-163-gf9f88b2.)
>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> c
0 (0, 10]
1 (10, 20]
2 (20, 30]
Name: a, dtype: category
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]
>>> df.groupby(c).sum()
a
a
(0, 10] 5
(10, 20] 15
(20, 30] 25
(30, 40] NaN
>>> df.groupby(c).transform(sum)
Traceback (most recent call last):
File "<ipython-input-81-1eb61986eff3>", line 1, in <module>
df.groupby(c).transform(sum)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
indexer = indices[name]
KeyError: '(30, 40]'
>>> df.a.groupby(c).transform(max)
Traceback (most recent call last):
File "<ipython-input-87-9759e8c1e070>", line 1, in <module>
df.a.groupby(c).transform(max)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2422, in transform
return self._transform_fast(cyfunc)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2461, in _transform_fast
values = np.repeat(values, com._ensure_platform_int(counts))
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 393, in repeat
return repeat(repeats, axis)
ValueError: count < 0
It looks like we're somehow using the categories themselves, not the values. Not sure if this is a consequence of the special-casing of grouping on categories..
ISTM if we have a Series with categorical dtype passed to groupby, we should group on the values, and not return the missing elements-- if you want the missing elements, you should pass the Categorical object itself. Admittedly I haven't thought through all the consequences, but I was pretty surprised when I read the docs and saw this was the intended behaviour.