Deprecation of relabeling dicts in groupby.agg brings many issues (original) (raw)

This issue is created based on the discussion from #15931 following the deprecation of relabeling dicts in groupby.agg. A lot of what is summarized below was already discussed in the previous discussion. I would recommend in particular #15931 (comment) where the problems are also clearly stated.

The motivation behind the deprecation of #15931 was mostly related to bringing a consistent interface for agg() between Series and Dataframe (see also #14668 for context).

The relabeling functionality with a nested dict has been described by some as being too complex and/or inconsistent and thus deprecated.

However, this comes at a price: the impossibility to aggregate and rename at the same time leads to very annoying issues and some backward incompatibility where no sensible workaround is available:

[annoying] no more control over the names of the resulting columns
[annoying] you need to find a way to rename the MultiIndex after performing the aggregation, requiring to keep track of the order of columns at two places in the code.... not practical at all and sometimes downright impossible (cases below).
⚠️ [breaking] cannot apply more than one callable with the same internal name on the same input column. This results in two sub-cases:
- [breaking] you can't apply anymore two or more lambda aggregators on the same column
- [breaking] you can't apply anymore two or more aggregators from partial functions unless you alter their hidden __name__ attribute

Example

(please note, this is a crafted example for the purpose of demonstrating the problem in as short a code as possible, but all of the demonstrated issues here did bite me in real life since the change, and in situations not as simple as here)

Input Dataframe

mydf = pd.DataFrame( { 'cat': ['A', 'A', 'A', 'B', 'B', 'C'], 'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01], 'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6] }, index=range(6) )

  cat  distance  energy
0   A      1.20    1.80
1   A      1.50    1.95
2   A      1.74    2.04
3   B      0.82    1.25
4   B      1.01    1.60
5   C      0.60    1.01

Before:

easy to write and read, and works as expected

import numpy as np import statsmodels.robust as smrb from functools import partial

median absolute deviation as a partial function

in order to demonstrate the issue with partial functions as aggregators

mad_c1 = partial(smrb.mad, c=1)

renaming and specifying the aggregators at the same time

note that I want to choose the resulting column names myself

for example "total_xxxx" instead of just "sum"

mydf_agg = mydf.groupby('cat').agg({ 'energy': { 'total_energy': 'sum', 'energy_p98': lambda x: np.percentile(x, 98), # lambda 'energy_p17': lambda x: np.percentile(x, 17), # lambda }, 'distance': { 'total_distance': 'sum', 'average_distance': 'mean', 'distance_mad': smrb.mad, # original function 'distance_mad_c1': mad_c1, # partial function wrapping the original function }, })

results in

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

and all is left is:

get rid of the first MultiIndex level in a pretty straightforward way

mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

Happy dance praising pandas 💃 🕺 !

After

import numpy as np import statsmodels.robust as smrb from functools import partial

median absolute deviation as a partial function

in order to demonstrate the issue with partial functions as aggregators

mad_c1 = partial(smrb.mad, c=1)

no way of choosing the destination's column names...

mydf_agg = mydf.groupby('cat').agg({ 'energy': [ 'sum', lambda x: np.percentile(x, 98), # lambda lambda x: np.percentile(x, 17), # lambda ], 'distance': [ 'sum', 'mean', smrb.mad, # original function mad_c1, # partial function wrapping the original function ], })

The above breaks because the lambda functions will all result in columns named <lambda> which results in

SpecificationError: Function names must be unique, found multiple named <lambda>

Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.

If one removes the lambda x: np.percentile(x, 98) from above, we get the same issue with the partial function which inherits the function name from the original function:

SpecificationError: Function names must be unique, found multiple named mad

Finally, after overwriting the __name__ attribute of the partial (for example with mad_c1.__name__ = 'mad_c1') we get:

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

with still

one column missing (98th percentile)
the handling of the MultiIndex columns
and the renaming of the columns

to deal with in separate step.

There is no control possible for the column names after aggregation, the best we can get in an automated way is some combination of original column name and the aggregate function's name like this:

mydf_agg.columns = ['_'.join(col) for col in mydf_agg.columns]

which results in:

     energy_sum  energy_<lambda>  distance_sum  distance_mean  distance_mad distance_mad_c1
cat
A          5.79           1.8510          4.44          1.480      0.355825           0.240
B          2.85           1.3095          1.83          0.915      0.140847           0.095
C          1.01           1.0100          0.60          0.600      0.000000           0.000

and if you really need to have different names, you can do it like this:

mydf_agg.rename({ "energy_sum": "total_energy", "energy_": "energy_p17", "distance_sum": "total_distance", "distance_mean": "average_distance" }, inplace=True)

but that means that you need to be careful to keep the renaming code (which must now be located at another place in the code) in sync with the code where the aggregation is defined...

Sad pandas user 😢 (which still loves pandas of course)

I am all in for consistency, and at the same time I deeply regret the deprecation of the aggregate and rename functionality. I hope the examples above make the pain points clear.

Possible solutions

Un-deprecate the dict-of-dict relabeling functionality
Provide another API to be able to do it (but why should there be two methods for the same main purpose, namely aggregation?)
??? (open to suggestions)

Optional read:

With respect to the aforementioned discussion in the pull request which has been going on already for a few months, I only recently realized one of the reasons why I am so bothered by this deprecation: "aggregate and rename" is a natural thing to do with GROUP BY aggregations in SQL since in SQL you usually provide the destination column name directly next to the aggregation expression, e.g. SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1.

I'm not saying that Pandas should necessarily provide the same functionalities as SQL of course. But the examples provided above demonstrate why the dict-of-dict API was in my opinion a clean and simple solution to many use-cases.

(* I don't personally agree that the dict-of-dict approach is complex.)