ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' by hasB4K · Pull Request #31809 · pandas-dev/pandas (original) (raw)

I would rather suggest the following:

I always thought that the base argument has kind of an ambiguous name. And the current behavior is quite confusing. It needs to be an integer (or a floating point) that matches the unit of the frequency:

This behavior is very confusing for the users (myself included), but it also creates bugs: see #25161, #25226

Instead of relying on base I would rather deprecate this argument. The argument loffset (currently broken for pd.Grouper as shown in #28302, but fixable in the current PR) is kind of equivalent to what base is doing (especially since it is a Timedelta).

Example of the current use of loffset with resample:

start, end = "1/1/2000 00:00:00", "1/31/2000 00:00" rng = pd.date_range(start, end, freq="1231min") ts = pd.Series(np.arange(len(rng)), index=rng) ts.resample("1min", loffset=-pd.Timedelta("1min")).count()

1999-12-31 23:59:00    1
2000-01-01 00:00:00    0
2000-01-01 00:01:00    0
2000-01-01 00:02:00    0
2000-01-01 00:03:00    0
                      ..
2000-01-30 22:00:00    0
2000-01-30 22:01:00    0
2000-01-30 22:02:00    0
2000-01-30 22:03:00    0
2000-01-30 22:04:00    1
Freq: T, Length: 43086, dtype: int64

Example of the current broken loffset argument:

ts.groupby(pd.Grouper(freq="1min", loffset=-pd.Timedelta("1min"))).count()

2000-01-01 00:00:00    1
2000-01-01 00:01:00    0
2000-01-01 00:02:00    0
2000-01-01 00:03:00    0
2000-01-01 00:04:00    0
                      ..
2000-01-30 22:01:00    0
2000-01-30 22:02:00    0
2000-01-30 22:03:00    0
2000-01-30 22:04:00    0
2000-01-30 22:05:00    1
Freq: T, Length: 43086, dtype: int64

That being said, I agree that the naming of adjust_timestamp is not ideal. I would rename it into: origin or base_timestamp.

The line https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728 would be replaced by something roughly equivalent to:

origin = start_of_day if origin is None else start_of_day
origin = origin.value + loffset.value

TL;DR:

  1. I would fix in this PR loffset for pd.Grouper and deprecate the confusing base argument
  2. I would rename the added argument adjust_timestamp into origin
  3. We would have origin = origin.value + loffset.value
  4. I would add more tests to check the behavior of loffset and origin

What do you think?