to_datetime() throws ValueError: Cannot pass a tz argument when parsing strings with timezone information. (original) (raw)

I think pandas should support passing %z in the format but also utc=True. In my opinion, one thing is the format, which tells pandas how to parse the datetime string. The other argument is just telling to return the dates in UTC, no matter which timezone they were in the beginning.

Here's a repl that shows the issue: https://repl.it/@eparizzi/Pandas-todatetime-in-UTC-with-format

If you replace that simple CSV with some big 50K row time-series CSV, the call to to_datetime without the format takes more than 20 seconds. On the contrary, passing the format and without utc=True takes less than 2 seconds. Unfortunately, this doesn't seem to work properly when there are multiple timezones in the column. It simply can't set a proper dtype in this case.

So, why can't we have a way to specify the format including timezone but also specify that we want everything in datetime64(UTC)?

I've already gone over this issue: #25571 but I still think this deserves a discussion.

import pandas as pd

I know the format, I want to use it so that Pandas to_datetime() runs faster.

DATETIME_FORMAT = '%m/%d/%Y %H:%M:%S.%f%z'

try: data = ['10/11/2018 00:00:00.045-07:00', '10/11/2018 01:00:00.045-07:00', '10/11/2018 01:00:00.045-08:00', '10/11/2018 02:00:00.045-08:00', '10/11/2018 04:00:00.045-07:00', '10/11/2018 05:00:00.045-07:00']

df = pd.DataFrame(data, columns=["Timestamp"])

This raises "ValueError: Cannot pass a tz argument when parsing strings with timezone information."

df.Timestamp = pd.to_datetime(df.Timestamp, format=DATETIME_FORMAT, utc=True)

except ValueError as valueError:

I don't know why a %z in the format is not compatible with utc=True. The %z is telling pandas that it needs to deal with timezones. Then, utc=True should just convert all to UTC. It shouldn't be more complicated than that I think.

print(f"ERROR: {str(valueError)}") print("...why not?")

This works, but it's A LOT slower when parsing a lot of rows.

df.Timestamp = pd.to_datetime(df.Timestamp, infer_datetime_format=True, utc=True)

finally: print(df) print(df.Timestamp.dtype)

Expected output:

Timestamp

0 2018-10-11 07:00:00.045000+00:00

1 2018-10-11 08:00:00.045000+00:00

2 2018-10-11 09:00:00.045000+00:00

3 2018-10-11 10:00:00.045000+00:00

4 2018-10-11 11:00:00.045000+00:00

5 2018-10-11 12:00:00.045000+00:00