BUG or DOC: pd.read_csv with parse_dates does not recognize timezone · Issue #22256 · pandas-dev/pandas (original) (raw)

When parsing a timezone-aware datetime in a csv file with pd.read_csv + parse_dates, it returns naive timestampes converted to UTC, and it was a surprise for me.

Example

Consider we are reading the following data. Let's say its name is pandas_read_csv_bug.csv.
It is a simple timeseries data with timezone (UTC+09:00) specified.

dt,val
2018-01-04 09:01:00+09:00,23350
2018-01-04 09:02:00+09:00,23400
2018-01-04 09:03:00+09:00,23400
2018-01-04 09:04:00+09:00,23400
2018-01-04 09:05:00+09:00,23400

I want to read it with pd.read_csv using parse_dates keyword argument activated.
If working properly, this seems to be the most elegant solution.

import pandas as pd df = pd.read_csv('pandas_read_csv_bug.csv', parse_dates=['dt'])

However, the result is a data frame df with strange timestamps.

| | dt | val | | | ----- | ------------------- | ----- | | 0 | 2018-01-04 00:01:00 | 23350 | | 1 | 2018-01-04 00:02:00 | 23400 | | 2 | 2018-01-04 00:03:00 | 23400 | | 3 | 2018-01-04 00:04:00 | 23400 | | 4 | 2018-01-04 00:05:00 | 23400 |

Problem description

My surprise was,

  1. The parsed datetimes are timezone-naive. df['dt'].iloc[0].tz is None == True
  2. The timestampe is automatically converted to UTC.

My first impression was that it shouldn't be the best possible behavior.
However, as an UTC offset does not uniquely corresponds to a single timezone, this could be the safest/most reasonable behavior.
In that case, the documentation should mention this behavior.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.3.1
pip: 9.0.3
setuptools: 38.5.1
Cython: None
numpy: 1.15.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None