BUG: Fix to_datetime to properly deal with tz offsets #3944 by danbirken · Pull Request #5958 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation15 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Currently for certain formats of datetime strings, the tz offset will
just be ignored.
#3944
I'm going to stop pretending I haven't seen this PR for just long enough to give you a glimpse
into the psyche of a maintainer.
Convince me that this can't possibly break anything for anyone and that the change is what
the code should have been doing all along.
Some of those variations seem like they stretch iso8601 to the breaking point, I may be wrong.
Perhaps I should have written the test in a more readable way to highlight the bug. I was just writing it for the future to get all the test cases in a very concise manner and try to flex various ways the datetime strings are parsed.
The current situation is just wrong:
In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1357027200000000000
In [4]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[4]: 1356998400000000000
These timestamp strings 01-01-2013 08:00:00+08:00
and 2013-01-01T08:00:00+0800
represent the same thing. However, depending on which internal parsing method is triggered, the timezone offset is either used or ignored. It should be consistent, and the most logical system would be to use the timezone offset if given. And in fact, every other method that parses things into datetime64s does use the offset, so this just fixes it for the one case where it didn't (which I assume was an oversight). This is that same code after the fix:
In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1356998400000000000
In [2]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[2]: 1356998400000000000
This is certainly what the code should be doing.
This could break somebodies code who depends on this bug, but on the plus side:
a) All of the existing tests pass
b) If it did break somebody's code it would be really easy to fix (either strip the timezones out of your datetime strings or just accept with the fact that the code is now properly converting them to UTC)
c) As this code gets more complicated (like with the potential changes to speed it up), keeping this inconsistency is just going to make everything more confusing because the bug will be triggered in less predictable ways
The timestring is not a valid iso8061 string, so the behavior might be said to be undefined.
But I see that dateutil.parse does produce the same result for both and that's fairly convincing
in that this is a bug.
Can you show an example where another code path in pandas parses this differently fromto_datetime()
?
I'm half convinced this can go in to 0.14.0, @jreback?
Can you do a test_perf.sh run to check for possible perf impact?
I agree that this is a bug (the very fact that we have different results for this one case). For the most part this path is not hit very often AFAICT (e.g. you need to have no specified format and its not parsed by the ISO8601 c-parser), so other that our tests I wouldn't see how this is even hit.
@danbirken can you you contrive a case where its actually hit either in read_csv
or in to_datetime
?
(and not by directly calling array_to_datetime
)
I guess this proves the point (but this is way esoteric, who specifies dtypes when constructing Series!)
In [11]: Series(['2013-01-01T08:00:00.000000000+0800','12-31-2012 23:00:00-01:00','01-01-2013 08:00:00+08:00','01-01-2013 08:00:00'],dtype='M8[ns]')
Out[11]:
0 2013-01-01 00:00:00
1 2012-12-31 23:00:00
2 2013-01-01 08:00:00
3 2013-01-01 08:00:00
dtype: datetime64[ns]
So the original reason I saw this bug was the code path in Timestamp()
uses parse_date
properly:
In [13]: pd.Timestamp('01-01-2013 08:00:00+08:00').value
Out[13]: 1356998400000000000
In [14]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[14]: 1357027200000000000
Here is a case with read_csv
that is wrong:
timestamp,data
01-01-2013 08:00:00+08:00,1
01-01-2013 09:00:00+08:00,2
01-01-2013 10:00:00+08:00,3
Old (timezone offset is ignored):
In [12]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[12]:
data
timestamp
2013-01-01 08:00:00 1
2013-01-01 09:00:00 2
2013-01-01 10:00:00 3
Fixed (timezone offset used):
In [2]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[2]:
data
timestamp
2013-01-01 00:00:00 1
2013-01-01 01:00:00 2
2013-01-01 02:00:00 3
I'd actually imagine this gets triggered quite a bit, since people use pandas to import log data a lot (which will have timestamps in weird formats, probably including a tz offset).
Running the ./test_perf.sh stuff now...
I'm sold on the first example. The rest are just cases where a strange timestamp doesn't
get picked up correctly, it's not a guarantee we're bound by (although we'd like to match datutil parser
in general).
start of 0.14, if no objections @jreback ?
technically this is a bug fix but I agree that people may actually be relying in this
0.14 it is then
Currently for certain formats of datetime strings, the tz offset will just be ignored.
Well I can't get ./test_perf.sh to cooperate, but I ran some ad-hoc tests and it appears to me to be within 1% of whatever the previous performance was. This changes from one fully cython-ed function to another, so it makes sense that it still would be quite fast.
I admit I don't fully understand what is going on in that particular issue (#3844), so I really have no idea if this is related.
FYI: I just made a small update because I just saw that convert_to_tsobject
already calls check_dts_bounds
, so I got rid of the double call.
Oh I see, you probably mean #3944.
I think that issue should be closed as "will not fix", and this is really a different bug (though it is related). I can make another GH issue for this change if you would like for tracking purposes.
@danbirken, please open an issue on test_perf if there is one and cc me.
ghost mentioned this pull request
ghost pushed a commit that referenced this pull request
BUG: Fix to_datetime to properly deal with tz offsets #3944
This pull request was closed.
2 participants