BUG: timezone lost in groupby-agg with cython functions (original) (raw)

xref #10668 (for more examples)

Hello!

I'm running into some odd behavior trying to group rows of a pandas dataframe by ID and then selecting out max/min datetimes (w/ timezones). This is with python 2.7, pandas 0.18.1 and numpy 1.11.1 (I saw in earlier posts a similar problem was apparently fixed w/ pandas 0.15).

Specifically, if I try:
print orders.groupby('OrderID')['start_time'].agg(np.min).iloc[:5]

I get:

OrderID
O161101XVS100000044   2016-11-01 12:03:12.920000-04:00
O161101XVS100000047   2016-11-01 12:03:36.693000-04:00
O161101XVS100000098   2016-11-01 12:09:08.330000-04:00
O161101XVS100000122   2016-11-01 12:09:59.950000-04:00
O161101XVS100000152   2016-11-01 12:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]

Where the raw data had times closer to 8 am (US/Eastern). In other words, it reverted back to UTC times, even though it says it's eastern times, and has UTC-4 offset.

But if I instead try:
print orders.groupby('OrderID')['start_time'].agg(lambda x: np.min(x)).iloc[:5]

I now get:

OrderID
O161101XVS100000044   2016-11-01 08:03:12.920000-04:00
O161101XVS100000047   2016-11-01 08:03:36.693000-04:00
O161101XVS100000098   2016-11-01 08:09:08.330000-04:00
O161101XVS100000122   2016-11-01 08:09:59.950000-04:00
O161101XVS100000152   2016-11-01 08:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]

Which is the behavior I intended. This second method is vastly slower, and I would have assumed the two approaches would yield identical results ...