Unexpected behaviour of groupby.transform when using 'fillna' · Issue #30918 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

import pandas as pd import numpy as np

df = pd.DataFrame( { 'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'baz'], 'B': [1, 2, np.nan, 3, 3, np.nan, 4], 'C': [np.nan]*7, 'D': [0,1,2,3,4,5,6], 'E': [np.nan] + [datetime.datetime(2020,1,1)]*3 + [datetime.datetime(2020,1,2)]*2 +[datetime.datetime(2020,1,3)], 'F': list('abcdefg'), 'G': list('abc') + [np.nan] + list('efg'), 'id': range(0,7), } ).set_index('id') df.groupby('A').transform('fillna', value=9999)

Output

B C D E F G
9999.0 9999.0 2 2020-01-01 00:00:00 c c
9999.0 9999.0 2 2020-01-01 00:00:00 c c
9999.0 9999.0 2 2020-01-01 00:00:00 c c
9999.0 9999.0 2 2020-01-01 00:00:00 c c
1.0 9999.0 0 9999 a a
1.0 9999.0 0 9999 a a
2.0 9999.0 1 2020-01-01 00:00:00 b b

Problem description

When using GroupBy.transform together with 'fillna' I expected it to work like GroupBy.transform together with lambda x: x.fillna(). Instead, it seems to also change values that are not NaN. Even worse, it seems to shuffle contents between groups.

Is this how it is expected to work?

Expected Output

df.groupby('A').transform(lambda x: x.fillna(9999))

B C D E F G
1.0 9999.0 0 9999 a a
2.0 9999.0 1 2020-01-01 00:00:00 b b
9999.0 9999.0 2 2020-01-01 00:00:00 c c
3.0 9999.0 3 2020-01-01 00:00:00 d 9999
3.0 9999.0 4 2020-01-02 00:00:00 e e
9999.0 9999.0 5 2020-01-02 00:00:00 f f
4.0 9999.0 6 2020-01-03 00:00:00 g g

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None