ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' by hasB4K · Pull Request #31809 · pandas-dev/pandas (original) (raw)
I would rather suggest the following:
I always thought that the base argument has kind of an ambiguous name. And the current behavior is quite confusing. It needs to be an integer (or a floating point) that matches the unit of the frequency:
- So
base=1with a frequency of5Dis equal to1Dthat we add as an offset. - So
base=2with a frequency of5minis equal to2minthat we add as an offset.
This behavior is very confusing for the users (myself included), but it also creates bugs: see #25161, #25226
Instead of relying on base I would rather deprecate this argument. The argument loffset (currently broken for pd.Grouper as shown in #28302, but fixable in the current PR) is kind of equivalent to what base is doing (especially since it is a Timedelta).
Example of the current use of loffset with resample:
start, end = "1/1/2000 00:00:00", "1/31/2000 00:00" rng = pd.date_range(start, end, freq="1231min") ts = pd.Series(np.arange(len(rng)), index=rng) ts.resample("1min", loffset=-pd.Timedelta("1min")).count()
1999-12-31 23:59:00 1
2000-01-01 00:00:00 0
2000-01-01 00:01:00 0
2000-01-01 00:02:00 0
2000-01-01 00:03:00 0
..
2000-01-30 22:00:00 0
2000-01-30 22:01:00 0
2000-01-30 22:02:00 0
2000-01-30 22:03:00 0
2000-01-30 22:04:00 1
Freq: T, Length: 43086, dtype: int64
Example of the current broken loffset argument:
ts.groupby(pd.Grouper(freq="1min", loffset=-pd.Timedelta("1min"))).count()
2000-01-01 00:00:00 1
2000-01-01 00:01:00 0
2000-01-01 00:02:00 0
2000-01-01 00:03:00 0
2000-01-01 00:04:00 0
..
2000-01-30 22:01:00 0
2000-01-30 22:02:00 0
2000-01-30 22:03:00 0
2000-01-30 22:04:00 0
2000-01-30 22:05:00 1
Freq: T, Length: 43086, dtype: int64
That being said, I agree that the naming of adjust_timestamp is not ideal. I would rename it into: origin or base_timestamp.
The line https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728 would be replaced by something roughly equivalent to:
origin = start_of_day if origin is None else start_of_day
origin = origin.value + loffset.value
TL;DR:
- I would fix in this PR
loffsetfor pd.Grouper and deprecate the confusingbaseargument - I would rename the added argument
adjust_timestampintoorigin - We would have
origin = origin.value + loffset.value - I would add more tests to check the behavior of
loffsetandorigin
What do you think?