PERF: add datetime caching kw in to_datetime · Issue #11665 · pandas-dev/pandas (original) (raw)

I'll propose a

cache_datetime=False keyword as an addition to read_csv and pd.to_datetime

this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.

Care must be taken if a format kw is provided (in to_datetime as the cache will have to be exposed). This would be an optional (and default False) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).

This might need also want to accept a list of column names (like parse_dates) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).

possibly we could overload parse_dates='cache' to mean this as well

trivial example

In [1]: pd.DataFrame({'A' : ['20130101 00:00:00']*10000}).to_csv('test.csv',index=True)

In [14]: def parser(x):
   ....:         uniques = pd.Series(pd.unique(x))
   ....:         d = pd.to_datetime(uniques)
   ....:         d.index = uniques
   ....:         return Series(x).map(d).values
   ....: 
In [3]: df1 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'])

In [4]: df2 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)

In [17]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
1 loops, best of 3: 969 ms per loop

In [18]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
100 loops, best of 3: 5.31 ms per loop

In [7]: df1.equals(df2)
Out[7]: True