BUG: groupby.transform calls the user function ~1.5 times more than necessary · Issue #44977 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd df = pd.DataFrame({'key': ['z', 'z'], 'a': [1,2], 'b': [3,4]}) def f(x): print(x) return x.sum() df.groupby('key').transform(f)
Issue Description
For every group the user function is first called with every series of this group (which is correct), but then with the group as a whole (which is not right, as the result is not used anywhere at all).
After digging through the source code I've found remnants of the 'fast path' and 'slow path' – an optimization that has long been gone, but those extra calls to user function are still there.
The commit which obliterates this optimization is
b8b6471
ENH: Add numba engine to groupby.transform (#32854)
After it was merged in, the path output variable of the _choose_path
is not used anywhere any longer. So that extra call to the user function that was only necessary setup the path is not necessary as well:
- path = None
for name, group in gen:
object.__setattr__(group, "name", name)
- if path is None:
+ if engine == "numba":
+ values, index = split_for_numba(group)
+ res = numba_func(values, index, *args)
+ if func not in self._numba_func_cache:
+ self._numba_func_cache[func] = numba_func
+ # Return the result as a DataFrame for concatenation later
+ res = DataFrame(res, index=group.index, columns=group.columns)
+ else:
# Try slow path and fast path.
try:
path, res = self._choose_path(fast_path, slow_path, group)
@@ -1376,8 +1422,6 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
except ValueError as err:
msg = "transform must return a scalar value for each group"
raise ValueError(msg) from err
- else:
- res = path(group)
Or maybe it was done by mistake and deletion of the res=path(group) line should be reverted.
After this commit this line from the docs is no longer valid:
If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.
Expected Behavior
0 1 # <-- this is correct
1 2
Name: a, dtype: int64
0 3 # <-- this is correct
1 4
Name: b, dtype: int64
a b # <-- this is wrong (result is ignored)
0 1 3
1 2 4
a b # <-- the result is correct
0 3 7
1 3 7
Installed Versions
INSTALLED VERSIONS
commit : 66e3805
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.3.5
numpy : 1.21.2
pytz : 2019.2
dateutil : 2.8.0
pip : 21.0.1
setuptools : 41.1.0
Cython : 0.29.14
pytest : 5.1.3
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.16.2
xlrd : 1.2.0
xlwt : None
numba : 0.51.2