DEPR: deprecate relableling dicts in groupby.agg by jreback · Pull Request #15931 · pandas-dev/pandas (original) (raw)
@TomAugspurger Thanks for proposing alternatives, however these miss to tackle the real issue.
The problem is not whether or not it is possible to do the renaming and how. The answer to that is yes it's possible and your proposed solutions do not bring more value than the other example above (like assigning directly to mydf.columns
), except the burden of adding two more methods to the already long list of methods of the DataFrame class.
The real issue is that this change forces us to separate the place where the renaming is defined from the definition of the corresponding aggregate function.
Semantically this is really annoying, because now we have to keep track of two lists and keep them in sync when we want to add another aggregate column. We must track down in which order the new aggregate column will land and where in the renaming list to update after adding one more aggregate function...
So in a nutshell:
- Before, column renaming and the definition of the operation were together, so they are naturally in sync.
- Now, first you define the aggregate callable, and afterward you have to rename and be very careful about the resulting column order.
And no one has even begun to address the issue of using custom aggregates. Because these custom callables may have the same __name__
attribute and this results in an exception (partial functions inherit the name of the parent function, and one cannot define it at creation, and all lambda functions are named <lambda>
and this is worse because afaik there's no way to define the name of a lambda).
Thus this is a backward incompatible change, and this one has no easy workaround. (there exists tricky workarounds to add the __name__
attribute)
Slightly extended example from above with lambda and partial:
(please note, this is a crafted example for the purpose of demonstrating the problem, but all of the demonstrated issues here did bite me in real life since the change)
Before:
easy and works as expected
import numpy as np import statsmodels.robust as smrb
percentile17 = lambda x: np.percentile(x, 17) mad_c1 = partial(smrb.mad, c=1)
mydf_agg = mydf.groupby('cat').agg({ 'energy': { 'total_energy': 'sum', 'energy_p98': lambda x: np.percentile(x, 98), 'energy_p17': percentile17, }, 'distance': { 'total_distance': 'sum', 'average_distance': 'mean', 'distance_mad': smrb.mad, 'distance_mad_c1': mad_c1, }, })
results in
energy distance
total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A 5.79 2.0364 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.5930 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 1.0100 0.60 0.600 0.000000 0.000
and all is left is:
get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)
After
import numpy as np import statsmodels.robust as smrb
percentile17 = lambda x: np.percentile(x, 17) mad_c1 = partial(smrb.mad, c=1)
mydf_agg = mydf.groupby('cat').agg({ 'energy': [ 'sum', lambda x: np.percentile(x, 98), percentile17 ], 'distance': [ 'sum', 'mean', smrb.mad, mad_c1 ], })
The above breaks because the lambda functions will all result in columns named <lambda>
which results in
SpecificationError: Function names must be unique, found multiple named <lambda>
Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.
If one removes the lambda x: np.percentile(x, 98)
from above, we get the same issue with the partial function which inherits the function name from the original function:
SpecificationError: Function names must be unique, found multiple named mad
Finally, after overwriting the __name__
attribute of the partial (mad_c1.__name__ = 'mad_c1'
) we get:
energy distance
sum <lambda> sum mean mad mad_c1
cat
A 5.79 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 0.60 0.600 0.000000 0.000
with still the renaming to deal with.