PERF: improve DTI string parse by sinhrks · Pull Request #13692 · pandas-dev/pandas (original) (raw)
- closes test_constructor_compound_dtypes and test_invalid_index_types fail in parse_datetime_string #11169, closes TEST: failing test pandas.tseries.tests.test_frequencies.TestFrequencyInference #11287
- tests added / passed
- passes
git diff upstream/master | flake8 --diff
- whatsnew entry
cleaned up DatetimeIndex
constructor removing slower string-parsing path.
Performance Improvement
related to #7599, internally use to_datetime
always as it tries some fastpath.
inp = np.array(['2011-01-01 09:00' for i in range(10000)])
# on current master
%timeit pd.DatetimeIndex(inp)
1 loops, best of 3: 3.41 s per loop
%timeit pd.to_datetime(inp)
100 loops, best of 3: 4.77 ms per loop
# after the PR
%timeit pd.DatetimeIndex(inp)
#100 loops, best of 3: 4.23 ms per loop
%timeit pd.to_datetime(inp)
#100 loops, best of 3: 4.25 ms per loop
Bug Fixes
The cleanup fixed these 2 kind of issues:
1. #11169 and #11287 Invalid string parsing may raise TypeError
I met the same issue on travis and fixed with try-except clause (I can't reproduce it on my local Mac).
2. Index may incorrectly coerces mismatched tz
on current master, DatetimeIndex
and normal Index
behaves differently.
# OK
pd.DatetimeIndex([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
# TypeError: Already tz-aware, use tz_convert to convert.
# NG, it ignores mismatch and coerce to passed tz
pd.Index([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
DatetimeIndex(['2010-12-31 21:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)
after the PR both behave the same, showing understandable error.
pd.Index([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
# TypeError: data is already tz-aware US/Eastern, unable to set specified tz: US/Pacific
pd.DatetimeIndex([pd.Timestamp('2011-01-01', tz='US/Eastern')], tz='US/Pacific')
# TypeError: data is already tz-aware US/Eastern, unable to set specified tz: US/Pacific