BUG: Fix to_datetime to properly deal with tz offsets #3944 by danbirken · Pull Request #5958 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation15 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

danbirken

Currently for certain formats of datetime strings, the tz offset will
just be ignored.
#3944

@ghost

I'm going to stop pretending I haven't seen this PR for just long enough to give you a glimpse
into the psyche of a maintainer.

Convince me that this can't possibly break anything for anyone and that the change is what
the code should have been doing all along.

Some of those variations seem like they stretch iso8601 to the breaking point, I may be wrong.

@danbirken

Perhaps I should have written the test in a more readable way to highlight the bug. I was just writing it for the future to get all the test cases in a very concise manner and try to flex various ways the datetime strings are parsed.

The current situation is just wrong:

In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1357027200000000000

In [4]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[4]: 1356998400000000000

These timestamp strings 01-01-2013 08:00:00+08:00 and 2013-01-01T08:00:00+0800 represent the same thing. However, depending on which internal parsing method is triggered, the timezone offset is either used or ignored. It should be consistent, and the most logical system would be to use the timezone offset if given. And in fact, every other method that parses things into datetime64s does use the offset, so this just fixes it for the one case where it didn't (which I assume was an oversight). This is that same code after the fix:

In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1356998400000000000

In [2]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[2]: 1356998400000000000

This is certainly what the code should be doing.

This could break somebodies code who depends on this bug, but on the plus side:

a) All of the existing tests pass
b) If it did break somebody's code it would be really easy to fix (either strip the timezones out of your datetime strings or just accept with the fact that the code is now properly converting them to UTC)
c) As this code gets more complicated (like with the potential changes to speed it up), keeping this inconsistency is just going to make everything more confusing because the bug will be triggered in less predictable ways

@ghost

The timestring is not a valid iso8061 string, so the behavior might be said to be undefined.
But I see that dateutil.parse does produce the same result for both and that's fairly convincing
in that this is a bug.

Can you show an example where another code path in pandas parses this differently from
to_datetime()?

I'm half convinced this can go in to 0.14.0, @jreback?

@ghost

Can you do a test_perf.sh run to check for possible perf impact?

@jreback

I agree that this is a bug (the very fact that we have different results for this one case). For the most part this path is not hit very often AFAICT (e.g. you need to have no specified format and its not parsed by the ISO8601 c-parser), so other that our tests I wouldn't see how this is even hit.

@danbirken can you you contrive a case where its actually hit either in read_csv or in to_datetime?

(and not by directly calling array_to_datetime)

@jreback

I guess this proves the point (but this is way esoteric, who specifies dtypes when constructing Series!)

In [11]: Series(['2013-01-01T08:00:00.000000000+0800','12-31-2012 23:00:00-01:00','01-01-2013 08:00:00+08:00','01-01-2013 08:00:00'],dtype='M8[ns]')
Out[11]: 
0   2013-01-01 00:00:00
1   2012-12-31 23:00:00
2   2013-01-01 08:00:00
3   2013-01-01 08:00:00
dtype: datetime64[ns]

@danbirken

So the original reason I saw this bug was the code path in Timestamp() uses parse_date properly:

In [13]: pd.Timestamp('01-01-2013 08:00:00+08:00').value
Out[13]: 1356998400000000000

In [14]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[14]: 1357027200000000000

Here is a case with read_csv that is wrong:

timestamp,data
01-01-2013 08:00:00+08:00,1
01-01-2013 09:00:00+08:00,2
01-01-2013 10:00:00+08:00,3

Old (timezone offset is ignored):

In [12]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[12]:
                     data
timestamp
2013-01-01 08:00:00     1
2013-01-01 09:00:00     2
2013-01-01 10:00:00     3

Fixed (timezone offset used):

In [2]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[2]:
                     data
timestamp
2013-01-01 00:00:00     1
2013-01-01 01:00:00     2
2013-01-01 02:00:00     3

I'd actually imagine this gets triggered quite a bit, since people use pandas to import log data a lot (which will have timestamps in weird formats, probably including a tz offset).

Running the ./test_perf.sh stuff now...

@ghost

I'm sold on the first example. The rest are just cases where a strange timestamp doesn't
get picked up correctly, it's not a guarantee we're bound by (although we'd like to match datutil parser
in general).

start of 0.14, if no objections @jreback ?

@jreback

technically this is a bug fix but I agree that people may actually be relying in this
0.14 it is then

@jreback

@danbirken

Currently for certain formats of datetime strings, the tz offset will just be ignored.

@danbirken

Well I can't get ./test_perf.sh to cooperate, but I ran some ad-hoc tests and it appears to me to be within 1% of whatever the previous performance was. This changes from one fully cython-ed function to another, so it makes sense that it still would be quite fast.

I admit I don't fully understand what is going on in that particular issue (#3844), so I really have no idea if this is related.

FYI: I just made a small update because I just saw that convert_to_tsobject already calls check_dts_bounds, so I got rid of the double call.

@danbirken

Oh I see, you probably mean #3944.

I think that issue should be closed as "will not fix", and this is really a different bug (though it is related). I can make another GH issue for this change if you would like for tracking purposes.

@jreback

@ghost

@danbirken, please open an issue on test_perf if there is one and cc me.

@ghost ghost mentioned this pull request

Jan 20, 2014

@ghost

ghost pushed a commit that referenced this pull request

Feb 4, 2014

BUG: Fix to_datetime to properly deal with tz offsets #3944

This pull request was closed.

2 participants

@danbirken @jreback