PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse · Issue #5490 · pandas-dev/pandas (original) (raw)
I had a series containing strings like these:
"November 1, 2013"
Series length was about 500,000
A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!
My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).
Basically it looks like the code is always using date_util parser for case A.
My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).
In a nutshell:
- figure out the date format of the first entry.
- try to use that against the entire series, using the speedy code in tslib.array_strptime
- if that works, we've saved heaps of time, if not fall back to the current slower behaviour of using dateutil parse each time.
Here's some pseudo-code::
datestr1 = s[0]
# I'm assuming dateutil has something like this, that can tell you what the format is for a given date string.
date_format = figure_out_datetime_format(datestr1)
try:
# use the super speed code that pandas uses when you tell it what the format is.
dt_series = tslib.array_strptime(s, format=datestr1, *, ...)
except:
# date strings aren't consistent after all. Let's do it the old slow way.
dt_series = tslib.array_to_datetime(s, format=None)
return dt_series