Dataframe.groupby aggregations with categorical columns lead to incorrect results. (original) (raw)

Code Sample

In[2]:

import pandas as pd

def create_df(): df = pd.DataFrame( { 'major_id': [1, 2, 1, 2, 2], 'minor_id': ['a', 'b', 'c', 'd', 'e'], 'values': [1, 2, 3, 4, 5] } ) return df

def groupby(df): df['max_value'] = ( df .groupby(['major_id', 'minor_id']) ['values'] .transform('max') )

return df

In[3]:

correct result

df = create_df() groupby(df)

Out[3]

   "   major_id minor_id  values  max_value\n",
   "0         1        a       1          1\n",
   "1         2        b       2          2\n",
   "2         1        c       3          3\n",
   "3         2        d       4          4\n",
   "4         2        e       5          5"

In[4]:

incorrect result: groupby with one non-categorical column and one categorical column

df = create_df() df = df.astype({'minor_id': 'category'}) groupby(df)

Out[4]

   "   major_id minor_id  values  max_value\n",
   "0         1        a       1        1.0\n",
   "1         2        b       2        3.0\n",
   "2         1        c       3        NaN\n",
   "3         2        d       4        NaN\n",
   "4         2        e       5        NaN"

Problem description

groupby with one non-categorical column and one categorical column leads to incorrect aggregations (wrong values, or NANs).

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : None
pytest : 5.2.1
hypothesis : 5.5.4
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.3.3
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : None
tabulate : 0.8.5
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0