Infer datetime format by danbirken · Pull Request #6021 · pandas-dev/pandas (original) (raw)
- If
_timelex
or_timelex.split
doesn't exist, pandas will work fine except for this feature, which just will do nothing. - I added
%Y%m%d
support to_guess_datetime_format
. Unlike the iso-8601 fast-path, this particular fast-path is actually hard to opt into (I don't think you can do it fromread_csv
), but nowinfer_datetime_format
will do it for you if enabled. This is about a ~20-30x speedup for those cases. - As for overloading
parse_dates
, I think adding a new field is better for 2 reasons.
parse_dates
already supports a wide variety of input formats, which makes squeezing in something else more complicated.- Since this is theoretically going from not-enabled to enabled by default, having it be a separate field is really nice because we can flip the one boolean value and that is that. If it were in
parse_dates
, then to flip it we would probably have to add in something likeparse_dates='not_infer'
(and keep supportingparse_dates='infer'
) for the people who explicitly want to opt-out for whatever reason, which would be really confusing.
- As for the sentinel values, I thought a lot about this. We can use
dateutil.parser.DEFAULTPARSER._parse
which gives the raw values without any defaults (I tried a bunch of ways to trickdateutil.parser.parse
into doing it for me, and couldn't find a way). However, it is using another private method and requires essentially repeating existing/tested/good code indateutil.parser.parse
to get it into a proper datetime. So I thought about doing this, but as it turns out, the fact thatdatetime.datetime
puts 0 as default values is fine. 0 is a nice sentinel value, because 0 is invalid for both month and day, the only two fields it could possibly be confused with (as of right now, this function only supports 4-digit years). So the only potential failure mode is that a datetime like:"2011/01/01 00:00"
will default to"%Y/%m/%d %H:%M"
. This is ambiguous, as the time-string could be referring to%H:%M
or%M:%S
(but it seems incredibly likely it is%H:%M
, which is the default guess). It is impossible for something like "2011/01/01" to be mis-guessed as "%Y/%M/%S", because 0 is an invalid value for either month or date so they will be immediately ignored.
However, the situation isn't perfect. It will still mess up cases a human wouldn't:
In [4]: tools._guess_datetime_format('01:01 2011/01/01')
Out[4]: '%m:%d %Y/%H/%M' # wrong!
In [6]: tools._guess_datetime_format('00:00 2011/01/01')
Out[6]: '%H:%M %Y/%m/%d' # right!
But sentinel values don't actually improve this case, this is just a problem with the current guessing method. However, this is a pretty rare edge case, as pretty much every standard datetime format puts the Y-m-d information first, which is what the guesser expects.
So in conclusion, I think the sentinel values of 0 are actually perfectly good and I can't think of any case where they cause the guesser to do the wrong thing.
New questions:
Assuming everybody is content with adding the infer_datetime_format
keyword to read_csv
, should I also add this to Series.from_csv
and DataFrame.from_csv
?