to_datetime %Y%m%d does not coerce correctly · Issue #7930 · pandas-dev/pandas (original) (raw)
# imagine a dataframe with a from/to
x_df = DataFrame([[20120101, 20121231], [20130101, 20131231], [20140101, 20141231], [20150101, 99991231]])
x_df.columns = ['date_from', 'date_to']
date_def = '%Y%m%d'
# so everything is ok & peachy for the from dates
x_df['date_from_2'] = pd.to_datetime(x_df['date_from'], format=date_def, coerce=True)
x_df['date_from_2'].dtype
list(x_df['date_from_2'])
# but with out of bound dates it goes horribly wrong
x_df['date_to_2'] = pd.to_datetime(x_df['date_to'], format=date_def, coerce=True)
x_df['date_to_2'].dtype
list(x_df['date_to_2']) # note the lack of NATs and conversion to datetime.datetime instead of np.datetime64
# now we can do
x_df['date_to_3'] = [np.datetime64(date_val, unit='s') for date_val in x_df['date_to_2']] # works great but unfortunately pandas chose to aim for nanoseconds as a standard for date detail...
x_df['date_to_3'] = [pd.Timestamp(date_val) for date_val in x_df['date_to_2']] # which breaks on the 99991231 example
This is a bug, but also related to the discussion in #7307. Probably has to do with:
"Note Specifying a format argument will potentially speed up the conversion considerably and on versions later then 0.13.0 explicitly specifying a format string of ‘%Y%m%d’ takes a faster path still."