BUG: Unable to aggregate TimeGrouper · Issue #7453 · pandas-dev/pandas (original) (raw)
Derived from #7373. There seems to be 3 issues related to TimeGrouper
aggregation.
1. var, std, mean
var/std/mean raises ValueError
when group key contains NaT
.
import pandas as pd
import numpy as np
data = np.random.randn(20, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
df['dt'] = [datetime.datetime(2013, 1, 1), datetime.datetime(2013, 1, 2),
datetime.datetime(2013, 1, 3), datetime.datetime(2013, 1, 4),
datetime.datetime(2013, 1, 5)] * 4
df['dt_nat'] = [datetime.datetime(2013, 1, 1), datetime.datetime(2013, 1, 2),
pd.NaT, datetime.datetime(2013, 1, 4),
datetime.datetime(2013, 1, 5)] * 4
df.groupby(pd.TimeGrouper(key='dt', freq='D')).mean()
# OK
df.groupby(pd.TimeGrouper(key='dt_nat', freq='D')).mean()
# ValueError: month must be in 1..12
2. size (#7600)
size
raises AttributeError
regardless of NaT
existence.
df.groupby(pd.TimeGrouper(key='dt', freq='D')).size()
# AttributeError: 'BinGrouper' object has no attribute 'groupings'
3. first, last, nth
It looks work, but TimeGrouper
outputs different result from normal groupby
.
df.groupby('dt').first()
# A B C D key dt_nat
# dt
#2013-01-01 -1.868691 -0.554116 -0.094949 0.009740 1 2013-01-01
#2013-01-02 0.272139 -0.106543 1.319331 -0.532377 2 2013-01-02
#2013-01-03 -1.637544 2.699557 -0.164414 -1.451295 3 NaT
#2013-01-04 1.642609 -0.313832 0.494468 -0.698104 4 2013-01-04
#2013-01-05 -1.554106 1.230299 -1.408515 -0.000722 5 2013-01-05
df.groupby(pd.TimeGrouper(key='dt', freq='D')).first()
# A B C D key dt_nat
# dt
#2013-01-01 -1.868691 -0.554116 -0.094949 0.009740 1 2013-01-01
#2013-01-02 0.272139 -0.106543 1.319331 -0.532377 2 2013-01-02
#2013-01-03 -1.637544 2.699557 -0.164414 -1.451295 3 NaT
#2013-01-04 1.642609 -0.313832 0.494468 -0.698104 4 2013-01-04
#2013-01-05 -0.024332 1.668172 -0.328200 1.731480 5 2013-01-05
# Compare 5th row
I assume the difference derived from BinGrouper
sorts rows differently from normal groupby. Thus, result of normal groupby and TimeGrouper
can differ.
df.groupby('dt').get_group(datetime.datetime(2013, 1, 5))
# A B C D dt dt_nat
#4 0.632937 0.224670 -0.201186 -0.340428 2013-01-05 2013-01-05
#9 -1.238944 -0.031075 -1.173326 -0.314716 2013-01-05 2013-01-05
#14 2.108985 0.993430 1.300605 1.452049 2013-01-05 2013-01-05
#19 0.315452 -0.817634 -0.526728 0.201415 2013-01-05 2013-01-05
df.groupby(pd.TimeGrouper(key='dt', freq='D')).get_group(datetime.datetime(2013, 1, 5))
# A B C D dt dt_nat
#9 -1.238944 -0.031075 -1.173326 -0.314716 2013-01-05 2013-01-05
#4 0.632937 0.224670 -0.201186 -0.340428 2013-01-05 2013-01-05
#14 2.108985 0.993430 1.300605 1.452049 2013-01-05 2013-01-05
#19 0.315452 -0.817634 -0.526728 0.201415 2013-01-05 2013-01-05