PERF/BUG: improve factorize for datetimetz by sinhrks · Pull Request #13750 · pandas-dev/pandas (original) (raw)

because factorize internally localize datetimetz, it raises when data contains DST boundary.

dti = pd.date_range('2016-11-06', freq='H', periods=5, tz='US/Eastern')
dti.factorize()
# AmbiguousTimeError: Cannot infer dst time from Timestamp('2016-11-06 01:00:00'), try using the 'ambiguous' argument

Skipped this localization to fix, also it improves perf.

dti = pd.date_range('2011-01-01', freq='H', periods=1000000, tz='Asia/Tokyo')
%timeit dti.factorize()
# on current master
#1 loop, best of 3: 475 ms per loop

# after this PR
#1 loop, best of 3: 262 ms per loop

asv:

   before     after       ratio
  [bb6b5e54] [a2c3370a]
-   22.46ms     9.97ms      0.44  timeseries.datetime_algorithm.time_dti_tz_factorize
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.