AmbiguousTimeError on groupby when including a DST change · Issue #14682 · pandas-dev/pandas (original) (raw)

A small, complete example of the issue

#!/usr/bin/env python import pandas as pd df=pd.DataFrame([1477786980,1477790580],columns=['ts']) df['date']=pd.to_datetime(df.ts, unit='s').dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid') df.set_index('date', inplace=True)

dfo = df.groupby(pd.TimeGrouper('5min'))

Expected Output

                           ts
date                         
2016-10-30 02:20:00+02:00   1
2016-10-30 02:25:00+02:00   0
2016-10-30 02:30:00+02:00   0
2016-10-30 02:35:00+02:00   0
2016-10-30 02:40:00+02:00   0
2016-10-30 02:45:00+02:00   0
2016-10-30 02:50:00+02:00   0
2016-10-30 02:55:00+02:00   0
2016-10-30 02:00:00+01:00   0
2016-10-30 02:05:00+01:00   0
2016-10-30 02:10:00+01:00   0
2016-10-30 02:15:00+01:00   0
2016-10-30 02:20:00+01:00   1

Output of pd.show_versions()

# Paste the output here pd.show_versions() here >>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.6.1
Cython: 0.25.1
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: 1.4.8
patsy: None
dateutil: 2.4.2
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: None
openpyxl: 2.2.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.999
httplib2: 0.9.1
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The above code raises an AmbiguousTimeError exception, when grouping by a time-date series including a DST change. In the above example the unix timestamps are for the recent DST change in Europe.

The stack trace is:

Traceback (most recent call last):
  File "./t.py", line 7, in <module>
    dfo = df.groupby(pd.TimeGrouper('5min'))
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3984, in groupby
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1501, in groupby
    return klass(obj, by, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 370, in __init__
    mutated=self.mutated)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 2382, in _get_grouper
    binner, grouper, obj = key._get_grouper(obj)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1062, in _get_grouper
    r._set_binner()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 237, in _set_binner
    self.binner, self.grouper = self._get_binner()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 245, in _get_binner
    binner, bins, binlabels = self._get_binner_for_time()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 660, in _get_binner_for_time
    return self.groupby._get_time_bins(self.ax)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1118, in _get_time_bins
    base=self.base)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1262, in _get_range_edges
    closed=closed, base=base)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.py", line 1326, in _adjust_dates_anchored
    return (Timestamp(fresult).tz_localize(first_tzinfo),
  File "pandas/tslib.pyx", line 621, in pandas.tslib.Timestamp.tz_localize (pandas/tslib.c:13694)
  File "pandas/tslib.pyx", line 4308, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:74816)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2016-10-30 02:20:00'), try using the 'ambiguous' argument

Code works if the series does not include a DST change (e.g. one day earlier):

#!/usr/bin/env python import pandas as pd df=pd.DataFrame([1477700580,1477704180],columns=['ts']) df['date']=pd.to_datetime(df.ts, unit='s').dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid') df.set_index('date', inplace=True)

dfo = df.groupby(pd.TimeGrouper('5min'))

print dfo.count()

gets:

                           ts
date                         
2016-10-29 02:20:00+02:00   1
2016-10-29 02:25:00+02:00   0
2016-10-29 02:30:00+02:00   0
2016-10-29 02:35:00+02:00   0
2016-10-29 02:40:00+02:00   0
2016-10-29 02:45:00+02:00   0
2016-10-29 02:50:00+02:00   0
2016-10-29 02:55:00+02:00   0
2016-10-29 03:00:00+02:00   0
2016-10-29 03:05:00+02:00   0
2016-10-29 03:10:00+02:00   0
2016-10-29 03:15:00+02:00   0
2016-10-29 03:20:00+02:00   1