PDEP-4: consistent parsing of datetimes by MarcoGorelli · Pull Request #48621 · pandas-dev/pandas (original) (raw)

Thanks both for taking a look!

@Dr-Irv this proposal wouldn't change how dayfirst and yearfirst operate. The format will try to be guessed in accordance with these parameters, just like it is on main - the difference is that with this proposal, the format guessed from the first non-NaN row will be used to parse the rest of the Series

@attack68 in the rare case that it's not possible to guess the format from the first element, then a UserWarning would be raised, check lines 49-55 of this PR

You're very right to bring up mm/dd/yy 👍 - indeed the vast majority of the world doesn't use that format. That's why the current behaviour is so dangerous. For example, suppose your data is in %d-%m-%Y %H:%M format:

On main, the first row's date would be parsed as mm-dd-yyyy, whilst the second one as dd-mm-yyyy. No error, no warning, this is very easy to miss (and I almost did once in a prod setting 😳 ):

In [1]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'])
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

With this PDEP, you could just check the format of your first row, and you'd know the rest of the Series was parsed in accordance to that. If it can't be, then with errors='raise' (the default), you'd get an error

ValueError: time data '13-01-2000 00:00' does not match format '%m-%d-%Y %H:%M' (match)

and you'd see that the guessed format wasn't right. You could get around that either by explicitly passing format, or with dayfirst=True:

In [2]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'], dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Totally agree on better documenting this, and that inference could be optimised by using multiple samples to guess - first, I just wanted to get agreement that we want to_datetime to parse consistently