PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse · Issue #5490 · pandas-dev/pandas (original) (raw)

I had a series containing strings like these:

"November 1, 2013"

Series length was about 500,000

A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!

My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).

Basically it looks like the code is always using date_util parser for case A.

My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).

In a nutshell:

figure out the date format of the first entry.
try to use that against the entire series, using the speedy code in tslib.array_strptime
if that works, we've saved heaps of time, if not fall back to the current slower behaviour of using dateutil parse each time.

Here's some pseudo-code::

datestr1 = s[0]
# I'm assuming dateutil has something like this, that can tell you what the format is for a given date string.
date_format = figure_out_datetime_format(datestr1)

try:
    # use the super speed code that pandas uses when you tell it what the format is.
    dt_series = tslib.array_strptime(s, format=datestr1, *, ...)
except:
    # date strings aren't consistent after all. Let's do it the old slow way.
    dt_series = tslib.array_to_datetime(s, format=None)

return dt_series