DataFrame.iterrows()
breaks timezone on index · Issue #8951 · pandas-dev/pandas (original) (raw)
Duplicate of #8890.
As far as I can tell, the Timestamps for the index generated by iterrows()
are 5 hours behind where they should be in this example:
In [33]: idx = pd.date_range("2010-01-01 00:00:00-0500", freq='D', periods=3)
In [34]: df = pd.DataFrame([1,2,3], index=[idx])
In [35]: df # this looks correct Out[35]: 0 2010-01-01 00:00:00-05:00 1 2010-01-02 00:00:00-05:00 2 2010-01-03 00:00:00-05:00 3
In [36]: [index for index, row in df.iterrows()] # but this looks wrong: Out[36]: [Timestamp('2009-12-31 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'), Timestamp('2010-01-01 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'), Timestamp('2010-01-02 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D')]
I would have expected iterrows()
to produce the same indices as this code:
In [38]: [df.index[i] for i in range(len(df))] Out[38]: [Timestamp('2010-01-01 00:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'), Timestamp('2010-01-02 00:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'), Timestamp('2010-01-03 00:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D')]
The row.name
is also incorrect:
In [37]: [row.name for index, row in df.iterrows()] Out[37]: [Timestamp('2009-12-31 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'), Timestamp('2010-01-01 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D'), Timestamp('2010-01-02 19:00:00-0500', tz='pytz.FixedOffset(-300)', offset='D')]
But all is fine if we use a geographical timezone instead of a pytz.FixedOffset
:
In [47]: idx = pd.date_range("2010-01-01 00:00:00", freq='D', periods=3, tz="America/New_York")
In [48]: df = pd.DataFrame([1,2,3], index=[idx])
In [49]: [index for index, row in df.iterrows()] Out[49]: [Timestamp('2010-01-01 00:00:00-0456', tz='America/New_York', offset='D'), Timestamp('2010-01-02 00:00:00-0456', tz='America/New_York', offset='D'), Timestamp('2010-01-03 00:00:00-0456', tz='America/New_York', offset='D')]
In [50]: df Out[50]: 0 2010-01-01 00:00:00-04:56 1 2010-01-02 00:00:00-04:56 2 2010-01-03 00:00:00-04:56 3
Forgive me if I am using Pandas incorrectly!
Versions:
In [51]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-25-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.15.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.8.2
scipy: 0.14.0
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: None
dateutil: 2.2
pytz: 2014.10
bottleneck: 0.6.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: 1.8.6
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.6
bs4: None
html5lib: 0.999
httplib2: 0.9
apiclient: None
rpy2: 2.3.8
sqlalchemy: None
pymysql: None
psycopg2: None
(it goes without saying that I'm a huge fan of Pandas!)