Dataframe.groupby aggregations with categorical columns lead to incorrect results. · Issue #32494 · pandas-dev/pandas (original) (raw)
Code Sample
In[2]:
import pandas as pd
def create_df(): df = pd.DataFrame( { 'major_id': [1, 2, 1, 2, 2], 'minor_id': ['a', 'b', 'c', 'd', 'e'], 'values': [1, 2, 3, 4, 5] } ) return df
def groupby(df): df['max_value'] = ( df .groupby(['major_id', 'minor_id']) ['values'] .transform('max') )
return df
In[3]:
correct result
df = create_df() groupby(df)
Out[3]
" major_id minor_id values max_value\n",
"0 1 a 1 1\n",
"1 2 b 2 2\n",
"2 1 c 3 3\n",
"3 2 d 4 4\n",
"4 2 e 5 5"
In[4]:
incorrect result: groupby with one non-categorical column and one categorical column
df = create_df() df = df.astype({'minor_id': 'category'}) groupby(df)
Out[4]
" major_id minor_id values max_value\n",
"0 1 a 1 1.0\n",
"1 2 b 2 3.0\n",
"2 1 c 3 NaN\n",
"3 2 d 4 NaN\n",
"4 2 e 5 NaN"
Problem description
groupby
with one non-categorical column and one categorical column leads to incorrect aggregations (wrong values, or NAN
s).
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : None
pytest : 5.2.1
hypothesis : 5.5.4
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.3.3
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : None
tabulate : 0.8.5
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0