PERF: add datetime caching kw in to_datetime · Issue #11665 · pandas-dev/pandas (original) (raw)
I'll propose a
cache_datetime=False
keyword as an addition to read_csv
and pd.to_datetime
this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.
Care must be taken if a format
kw is provided (in to_datetime
as the cache will have to be exposed). This would be an optional (and default False
) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).
This might need also want to accept a list of column names (like parse_dates
) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).
possibly we could overload parse_dates='cache'
to mean this as well
trivial example
In [1]: pd.DataFrame({'A' : ['20130101 00:00:00']*10000}).to_csv('test.csv',index=True)
In [14]: def parser(x):
....: uniques = pd.Series(pd.unique(x))
....: d = pd.to_datetime(uniques)
....: d.index = uniques
....: return Series(x).map(d).values
....:
In [3]: df1 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
In [4]: df2 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
In [17]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
1 loops, best of 3: 969 ms per loop
In [18]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
100 loops, best of 3: 5.31 ms per loop
In [7]: df1.equals(df2)
Out[7]: True