REF: Decouple Series.apply from Series.agg by topper-123 · Pull Request #53400 · pandas-dev/pandas (original) (raw)

I may not be experienced enough with the groupby and window methods to know, but I think it's worth looking into if the Series and Dataframe methods can have their undesired behavior deprecated using a normal process, i.e we can avoid a parallel implementation for them.

TBH, this has always been a very complex area and it was only after doing #53362 ( i.e. very recently) that I have started thinking it could be possible to do this with a normal deprecation process. I may still have missed something and be proven wrong of course, but that's part of the discussion...

The way I see it that after this PR and #53325 for example SeriesApply.agg will be (shortened a bit):

def agg(self):
    result = super().agg()
    if result is None:
        obj = self.obj
        f = self.f

        try:
            result = obj.apply(f)
        except (ValueError, AttributeError, TypeError):
            result = f(self.obj)
        else:
            msg = (
                f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                f"has been deprecated. Use {type(obj).__name__}.transform to "
                f"keep behavior unchanged."
            )
            warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
        return result

The above just deprecates calling obj.apply(f) and we above actually not guaranteed that f(obj) returns a aggregated value, for example f =lambda x: np.sqrt(x), then agg will return transformed values, not a aggregate.

But if we change the above to:

def agg(self):
    result = super().agg()
    if result is None:
        obj = self.obj
        f = self.f

        try:
            result = obj.apply(f)
        except (ValueError, AttributeError, TypeError):
            result = f(self.obj)

    if not self._is_aggregate_value(result):  # aside: how do we know something is an aggregate value?
            msg = (
                f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                f"has been deprecated. Use {type(obj).__name__}.transform to "
                f"keep behavior unchanged."
            )
            warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

    return result

then returning a non-aggregate value will emit a warning. So the question is if the above is backward compatible and, if the user fix their code to not emit warnings, their code will be compatible with v.3.0.:

In pandas v3.0 the method will become:

def agg(self):
    result = super().agg()
    if result is None:
        result = self.f(self.obj)

    if not self._is_aggregate_value(result):
            msg = (
                f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                f"has been deprecated. Use {type(obj).__name__}.transform to "
                f"keep behavior unchanged."
            )
            warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

    return result

The same can be done similarly with Series.transform where we also need to check the result is a transformed result and not an aggregation or something else. To do it with Series.apply requires a array_ops_only parameter however.

My suspicion is that if the above can be done on Series.(agg|transform|apply), then DataFrame.(agg|transform|apply) will also fall into place easily. I haven't looked enough into (Groupby|Resample|Window).(agg|transform|apply) enough to have a clear opinion, but maybe everything will fall into place like a puzzle there also, if the Series/DataFrame methods can be deprecated correctly. And if not, the new Series/DataFrame methods can be used in a new versions of (Groupby|Resample|Window) methods, easing their implementations?