BUG or DOC: pd.read_csv with parse_dates does not recognize timezone · Issue #22256 · pandas-dev/pandas (original) (raw)
When parsing a timezone-aware datetime in a csv file with pd.read_csv
+ parse_dates
, it returns naive timestampes converted to UTC, and it was a surprise for me.
Example
Consider we are reading the following data. Let's say its name is pandas_read_csv_bug.csv
.
It is a simple timeseries data with timezone (UTC+09:00) specified.
dt,val
2018-01-04 09:01:00+09:00,23350
2018-01-04 09:02:00+09:00,23400
2018-01-04 09:03:00+09:00,23400
2018-01-04 09:04:00+09:00,23400
2018-01-04 09:05:00+09:00,23400
I want to read it with pd.read_csv
using parse_dates
keyword argument activated.
If working properly, this seems to be the most elegant solution.
import pandas as pd df = pd.read_csv('pandas_read_csv_bug.csv', parse_dates=['dt'])
However, the result is a data frame df
with strange timestamps.
| | dt | val | | | ----- | ------------------- | ----- | | 0 | 2018-01-04 00:01:00 | 23350 | | 1 | 2018-01-04 00:02:00 | 23400 | | 2 | 2018-01-04 00:03:00 | 23400 | | 3 | 2018-01-04 00:04:00 | 23400 | | 4 | 2018-01-04 00:05:00 | 23400 |
Problem description
My surprise was,
- The parsed datetimes are timezone-naive.
df['dt'].iloc[0].tz is None == True
- The timestampe is automatically converted to UTC.
My first impression was that it shouldn't be the best possible behavior.
However, as an UTC offset does not uniquely corresponds to a single timezone, this could be the safest/most reasonable behavior.
In that case, the documentation should mention this behavior.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.3.1
pip: 9.0.3
setuptools: 38.5.1
Cython: None
numpy: 1.15.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None