BUG: Issues with groupby ewm and times · Issue #40951 · pandas-dev/pandas (original) (raw)


Code Sample, a copy-pastable example

This refers to the code that is currently on master 84d9c5e (2021-04-14). The issues also exist on the latest version of pandas but are different.

import pandas as pd

halflife = "23 days" baseline_df = pd.DataFrame( { "A": ["a", "b", "a", "b", "a", "b"], "B": [0, 0, 1, 1, 2, 2], "C": pd.to_datetime( [ "2020-01-01", "2020-01-01", "2020-01-10", "2020-01-02", "2020-01-23", "2020-01-03", ] ) } )

cython_result = baseline_df.groupby("A").ewm(halflife=halflife, times="C").mean() print("cython") print(cython_result) print("numba") numba_result = baseline_df.groupby("A").ewm(halflife=halflife, times="C").mean(engine="numba") print(numba_result)

expected_result_a = pd.DataFrame([0, 1, 2]).ewm( halflife=halflife, times=pd.to_datetime(["2020-01-01", "2020-01-10", "2020-01-23"]) ).mean() expected_result_b = pd.DataFrame([0, 1, 2]).ewm( halflife=halflife, times=pd.to_datetime(["2020-01-01", "2020-01-02", "2020-01-03"]) ).mean() print("expected") print(" group a") print(expected_result_a) print(" group b") print(expected_result_b)

Output:

cython B A
a 0 0.000000 2 0.500000 4 1.094088 b 1 0.000000 3 0.500000 5 1.094088 numba B A
a 0 0.000000 2 0.666667 4 1.428571 b 1 0.000000 3 0.666667 5 1.428571 expected group a 0 0 0.000000 1 0.567395 2 1.221209 group b 0 0 0.000000 1 0.507534 2 1.020088

Problem description

There are three problems with the current groupby ewm implementation in the case of non-None times.

  1. numba implementation: ignores the times
  2. cython implementation: does not use the correct times/deltas in aggregations.pyx in case of multiple groups
  3. if the groups are non-trivial the time vector and values become out of sync

I have a branch that fixes these issues, will link to it in a bit.

Expected Output

cython B A
a 0 0.000000 2 0.567395 4 1.221209 b 1 0.000000 3 0.507534 5 1.020088 numba B A
a 0 0.000000 2 0.567395 4 1.221209 b 1 0.000000 3 0.507534 5 1.020088 expected group a 0 0 0.000000 1 0.567395 2 1.221209 group b 0 0 0.000000 1 0.507534 2 1.020088

Output of pd.show_versions()