Time-based .rolling() fails with .groupby() · Issue #13966 · pandas-dev/pandas (original) (raw)
Starting with this example:
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.arange(40)})
I can easily compute the rolling mean by identifier:
In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
...
2 30 28.5
31 29.5
3 32 NaN
33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
Now I want to add a timestamp column:
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
The timestamps are ordered within each identifier, but pandas complains:
In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic
Re-sorting leads to a different error:
In [26]: df.sort_values('B', inplace=True)
In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None
But we know that these column names are valid:
n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0 0.000000
20 10.000000
1 7.000000
21 10.500000
2 8.800000
22 11.000000
3 9.857143
23 11.500000
4 10.857143
24 12.500000
...
35 24.714286
15 23.500000
36 25.714286
16 24.500000
37 26.714286
17 25.500000
38 27.714286
18 26.500000
19 25.857143
39 27.500000
Name: C, dtype: float64
It seems like a bug that time-based .rolling()
does not work with .groupby()
.