Incorrect resampling from hourly to monthly values · Issue #2665 · pandas-dev/pandas (original) (raw)

EDIT: Modern reproducible example:

In [1]: import pandas as pd

In [2]: dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')

In [4]: ts = pd.Series(range(len(dates)), index=dates)

In [5]: def start(x):
   ...:     return x.index[0]
   ...:
   ...: def end(x):
   ...:     return x.index[-1]
   ...:

In [6]: ts.resample('1M', closed='right', label='right').apply(start)
Out[6]:
2000-03-31   2000-03-20
2000-04-30   2000-04-01
2000-05-31   2000-05-01
Freq: M, dtype: datetime64[ns]

In [7]: ts.resample('1M', closed='right', label='right').apply(end)
Out[7]:
2000-03-31   2000-03-31 23:00:00
2000-04-30   2000-04-30 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

In [8]: ts.resample('1M', closed='left', label='right').apply(start)
Out[8]:
2000-03-31   2000-03-20
2000-04-30   2000-03-31
2000-05-31   2000-04-30
Freq: M, dtype: datetime64[ns]

In [9]: ts.resample('1M', closed='left', label='right').apply(end)
Out[9]:
2000-03-31   2000-03-30 23:00:00
2000-04-30   2000-04-29 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

Expected behavior: #2665 (comment)

There seems to be something wrong when resampling hourly values to monthly values. The 'closed=' argument does not do what it should. I am using pandas 0.10.0.

import pandas as pd
from pandas import TimeSeries
import numpy as np

# Timeseries with hourly values
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = TimeSeries(np.arange(len(dates)), index=dates)

# I want monthly values equal to the summed values of the hours in that month
# With "closed=left" I want the hourly value (1 value) at midnight to be 
# included in the month after that very midnight
data_monthly_1 = ts.resample('1M', how='sum', closed='left', label='right') 
sum_april1 = data_monthly_1.ix[1]

# If we check this, the summed values per month do not match : 
data_hourly_april2 = ts[ts.index.month == 4]
sum_april2 = sum(data_hourly_april2)

# Instead, "closed=left" above resulted in an inclusion of the entire last day of the
# previous month (25 values !) (and exclusion of the last day of the month)
# This is also strange as I am nowhere concerned with using 'days'.
data_hourly_april3 = ts[   ((ts.index.month == 3) & (ts.index.day == 31)) | 
                           ((ts.index.month == 4) & (ts.index.day < 30))      ]
sum_april3 = sum(data_hourly_april3)

# PS: "closed=right" results in the answer I would expect for "closed = left":
data_monthly_4 = ts.resample('1M', how='sum', closed='right', label='right') 
sum_april4 = data_monthly_4.ix[1]

When resampling these hourly to daily values, there was no problem.