Taking first row from each group in groupby sometimes strips tzinfo (original) (raw)

xref #12898 (same fix)

(c.f. http://stackoverflow.com/questions/31617084/how-to-have-groupby-first-not-remove-timezone-info-from-datetime-columns)
Take a dataframe with a column of tz-aware datetime.datetime objects, and group it by a different column, then return the first row from each group. There are some ways to do this that leave the datetime as it is; and then at least two ways that convert it to a tz-naive pandas Timestamp object.

In [1]: import pandas as pd

In [2]: import datetime

In [3]: import pytz

In [4]: dates = [datetime.datetime(2015,1,i,tzinfo=pytz.timezone('US/Pacific')) for i in range(1,5)]

In [5]: df = pd.DataFrame({'A': ['a','b']*2,'B': dates})

In [6]: df
Out[6]: 
   A                          B
0  a  2015-01-01 00:00:00-08:00
1  b  2015-01-02 00:00:00-08:00
2  a  2015-01-03 00:00:00-08:00
3  b  2015-01-04 00:00:00-08:00

In [7]: grouped = df.groupby('A') 

In [8]: grouped.nth(0) #B stays a datetime.datetime with timezone info
Out[8]: 
                           B
A                           
a  2015-01-01 00:00:00-08:00
b  2015-01-02 00:00:00-08:00

In [9]: grouped.head(1) #B stays a datetime.datetime with timezone 
Out[9]: 
                           B
0  2015-01-01 00:00:00-08:00
1  2015-01-02 00:00:00-08:00

In [10]: grouped.first() #B is naive pd.TimeStamp in UTC
Out[10]: 
                    B
A                    
a 2015-01-01 08:00:00
b 2015-01-02 08:00:00

And apparently grouped.apply(lambda x: x.iloc[0]) does the same as .first().