BUG: groupby.transform calls the user function ~1.5 times more than necessary · Issue #44977 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd df = pd.DataFrame({'key': ['z', 'z'], 'a': [1,2], 'b': [3,4]}) def f(x): print(x) return x.sum() df.groupby('key').transform(f)

Issue Description

For every group the user function is first called with every series of this group (which is correct), but then with the group as a whole (which is not right, as the result is not used anywhere at all).

After digging through the source code I've found remnants of the 'fast path' and 'slow path' – an optimization that has long been gone, but those extra calls to user function are still there.

The commit which obliterates this optimization is
b8b6471
ENH: Add numba engine to groupby.transform (#32854)

After it was merged in, the path output variable of the _choose_path is not used anywhere any longer. So that extra call to the user function that was only necessary setup the path is not necessary as well:

-        path = None
         for name, group in gen:
             object.__setattr__(group, "name", name)

-            if path is None:
+            if engine == "numba":
+                values, index = split_for_numba(group)
+                res = numba_func(values, index, *args)
+                if func not in self._numba_func_cache:
+                    self._numba_func_cache[func] = numba_func
+                # Return the result as a DataFrame for concatenation later
+                res = DataFrame(res, index=group.index, columns=group.columns)
+            else:
                 # Try slow path and fast path.
                 try:
                     path, res = self._choose_path(fast_path, slow_path, group)
@@ -1376,8 +1422,6 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
                 except ValueError as err:
                     msg = "transform must return a scalar value for each group"
                     raise ValueError(msg) from err
-            else:
-                res = path(group)

Or maybe it was done by mistake and deletion of the res=path(group) line should be reverted.

After this commit this line from the docs is no longer valid:

If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.

Expected Behavior

0 1 # <-- this is correct
1 2
Name: a, dtype: int64
0 3 # <-- this is correct
1 4
Name: b, dtype: int64
a b # <-- this is wrong (result is ignored)
0 1 3
1 2 4
a b # <-- the result is correct
0 3 7
1 3 7

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.3.5
numpy : 1.21.2
pytz : 2019.2
dateutil : 2.8.0
pip : 21.0.1
setuptools : 41.1.0
Cython : 0.29.14
pytest : 5.1.3
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.16.2
xlrd : 1.2.0
xlwt : None
numba : 0.51.2