DEPR: deprecate relableling dicts in groupby.agg by jreback · Pull Request #15931 · pandas-dev/pandas (original) (raw)

@TomAugspurger Thanks for proposing alternatives, however these miss to tackle the real issue.

The problem is not whether or not it is possible to do the renaming and how. The answer to that is yes it's possible and your proposed solutions do not bring more value than the other example above (like assigning directly to mydf.columns), except the burden of adding two more methods to the already long list of methods of the DataFrame class.

The real issue is that this change forces us to separate the place where the renaming is defined from the definition of the corresponding aggregate function.

Semantically this is really annoying, because now we have to keep track of two lists and keep them in sync when we want to add another aggregate column. We must track down in which order the new aggregate column will land and where in the renaming list to update after adding one more aggregate function...

So in a nutshell:

Before, column renaming and the definition of the operation were together, so they are naturally in sync.
Now, first you define the aggregate callable, and afterward you have to rename and be very careful about the resulting column order.

And no one has even begun to address the issue of using custom aggregates. Because these custom callables may have the same __name__ attribute and this results in an exception (partial functions inherit the name of the parent function, and one cannot define it at creation, and all lambda functions are named <lambda> and this is worse because afaik there's no way to define the name of a lambda).

Thus this is a backward incompatible change, and this one has no easy workaround. (there exists tricky workarounds to add the __name__ attribute)

Slightly extended example from above with lambda and partial:

(please note, this is a crafted example for the purpose of demonstrating the problem, but all of the demonstrated issues here did bite me in real life since the change)

Before:

easy and works as expected

import numpy as np import statsmodels.robust as smrb

percentile17 = lambda x: np.percentile(x, 17) mad_c1 = partial(smrb.mad, c=1)

mydf_agg = mydf.groupby('cat').agg({ 'energy': { 'total_energy': 'sum', 'energy_p98': lambda x: np.percentile(x, 98), 'energy_p17': percentile17, }, 'distance': { 'total_distance': 'sum', 'average_distance': 'mean', 'distance_mad': smrb.mad, 'distance_mad_c1': mad_c1, }, })

results in

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

and all is left is:

get rid of the first MultiIndex level in a pretty straightforward way

mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

After

import numpy as np import statsmodels.robust as smrb

percentile17 = lambda x: np.percentile(x, 17) mad_c1 = partial(smrb.mad, c=1)

mydf_agg = mydf.groupby('cat').agg({ 'energy': [ 'sum', lambda x: np.percentile(x, 98), percentile17 ], 'distance': [ 'sum', 'mean', smrb.mad, mad_c1 ], })

The above breaks because the lambda functions will all result in columns named <lambda> which results in

SpecificationError: Function names must be unique, found multiple named <lambda>

Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.

If one removes the lambda x: np.percentile(x, 98) from above, we get the same issue with the partial function which inherits the function name from the original function:

SpecificationError: Function names must be unique, found multiple named mad

Finally, after overwriting the __name__ attribute of the partial (mad_c1.__name__ = 'mad_c1') we get:

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

with still the renaming to deal with.