Infer datetime format by danbirken · Pull Request #6021 · pandas-dev/pandas (original) (raw)

If _timelex or _timelex.split doesn't exist, pandas will work fine except for this feature, which just will do nothing.
I added %Y%m%d support to _guess_datetime_format. Unlike the iso-8601 fast-path, this particular fast-path is actually hard to opt into (I don't think you can do it from read_csv), but now infer_datetime_format will do it for you if enabled. This is about a ~20-30x speedup for those cases.
As for overloading parse_dates, I think adding a new field is better for 2 reasons.

parse_dates already supports a wide variety of input formats, which makes squeezing in something else more complicated.
Since this is theoretically going from not-enabled to enabled by default, having it be a separate field is really nice because we can flip the one boolean value and that is that. If it were in parse_dates, then to flip it we would probably have to add in something like parse_dates='not_infer' (and keep supporting parse_dates='infer') for the people who explicitly want to opt-out for whatever reason, which would be really confusing.

As for the sentinel values, I thought a lot about this. We can use dateutil.parser.DEFAULTPARSER._parse which gives the raw values without any defaults (I tried a bunch of ways to trick dateutil.parser.parse into doing it for me, and couldn't find a way). However, it is using another private method and requires essentially repeating existing/tested/good code in dateutil.parser.parse to get it into a proper datetime. So I thought about doing this, but as it turns out, the fact that datetime.datetime puts 0 as default values is fine. 0 is a nice sentinel value, because 0 is invalid for both month and day, the only two fields it could possibly be confused with (as of right now, this function only supports 4-digit years). So the only potential failure mode is that a datetime like: "2011/01/01 00:00" will default to "%Y/%m/%d %H:%M". This is ambiguous, as the time-string could be referring to %H:%M or %M:%S (but it seems incredibly likely it is %H:%M, which is the default guess). It is impossible for something like "2011/01/01" to be mis-guessed as "%Y/%M/%S", because 0 is an invalid value for either month or date so they will be immediately ignored.

However, the situation isn't perfect. It will still mess up cases a human wouldn't:

In [4]: tools._guess_datetime_format('01:01 2011/01/01')
Out[4]: '%m:%d %Y/%H/%M'  # wrong!

In [6]: tools._guess_datetime_format('00:00 2011/01/01')
Out[6]: '%H:%M %Y/%m/%d'  # right!

But sentinel values don't actually improve this case, this is just a problem with the current guessing method. However, this is a pretty rare edge case, as pretty much every standard datetime format puts the Y-m-d information first, which is what the guesser expects.

So in conclusion, I think the sentinel values of 0 are actually perfectly good and I can't think of any case where they cause the guesser to do the wrong thing.

New questions:

Assuming everybody is content with adding the infer_datetime_format keyword to read_csv, should I also add this to Series.from_csv and DataFrame.from_csv?