cache DateOffset attrs now that they are immutable by jbrockmendel · Pull Request #21582 · pandas-dev/pandas (original) (raw)

TL;DR ~6x speedup in set_index for PeriodIndex-like column.

Alright! Now that DateOffset objects are immutable (#21341), we can can start caching stuff. This was pretty much the original motivation that brought me here, so I'm pretty psyched to finally make this happen.

The motivating super-slow operation is df.set_index. Profiling before/after with:

idx = pd.period_range('May 1973', freq='M', periods=10**5)
df = pd.DataFrame({"A": 1, "B": idx})
out = df.set_index("B", append=True)

Total Runtime Before: 32.708 seconds
Total Runtime After: 5.340 seconds

pstats output (truncated) before:

        1    0.000    0.000   31.903   31.903 pandas/core/frame.py:3807(set_index)
        1    0.000    0.000   31.897   31.897 pandas/core/indexes/base.py:4823(_ensure_index_from_sequences)
        1    0.000    0.000   31.896   31.896 pandas/core/indexes/multi.py:1246(from_arrays)
        1    0.000    0.000   31.896   31.896 pandas/core/arrays/categorical.py:2590(_factorize_from_iterables)
        2    0.001    0.000   31.896   15.948 pandas/core/arrays/categorical.py:2553(_factorize_from_iterable)
        2    0.000    0.000   31.895   15.948 pandas/core/arrays/categorical.py:318(__init__)
        2    0.000    0.000   31.512   15.756 pandas/util/_decorators.py:136(wrapper)
        2    0.002    0.001   31.512   15.756 pandas/core/algorithms.py:576(factorize)
  1600011    1.168    0.000   28.211    0.000 pandas/tseries/offsets.py:338(__ne__)
        4    1.820    0.455   28.042    7.010 {method 'argsort' of 'numpy.ndarray' objects}
  1600011    4.016    0.000   27.042    0.000 pandas/tseries/offsets.py:324(__eq__)
  3200022   16.987    0.000   21.856    0.000 pandas/tseries/offsets.py:291(_params)
        2    0.000    0.000    3.460    1.730 pandas/core/algorithms.py:449(_factorize_array)
        1    0.617    0.617    3.445    3.445 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
  3200023    3.200    0.000    3.200    0.000 {sorted}
3400729/3400727    1.235    0.000    1.235    0.000 {isinstance}
   400004    0.984    0.000    1.060    0.000 pandas/tseries/offsets.py:400(freqstr)
  3200022    0.840    0.000    0.840    0.000 {method 'copy' of 'dict' objects}
  3200023    0.829    0.000    0.829    0.000 {method 'items' of 'dict' objects}

pstats output (truncated) after:

        1    0.000    0.000    4.571    4.571 pandas/core/frame.py:3807(set_index)
        1    0.000    0.000    4.561    4.561 pandas/core/indexes/base.py:4823(_ensure_index_from_sequences)
        1    0.000    0.000    4.561    4.561 pandas/core/indexes/multi.py:1246(from_arrays)
        1    0.000    0.000    4.561    4.561 pandas/core/arrays/categorical.py:2590(_factorize_from_iterables)
        2    0.001    0.000    4.561    2.280 pandas/core/arrays/categorical.py:2553(_factorize_from_iterable)
        2    0.000    0.000    4.560    2.280 pandas/core/arrays/categorical.py:318(__init__)
        2    0.000    0.000    4.506    2.253 pandas/util/_decorators.py:136(wrapper)
        2    0.003    0.001    4.506    2.253 pandas/core/algorithms.py:576(factorize)
        4    1.170    0.292    4.090    1.022 {method 'argsort' of 'numpy.ndarray' objects}
  1600011    0.870    0.000    3.138    0.000 pandas/tseries/offsets.py:337(__ne__)
  1600011    1.475    0.000    2.267    0.000 pandas/tseries/offsets.py:325(__eq__)
3400729/3400727    0.845    0.000    0.846    0.000 {isinstance}

The _params calls that make up half of the runtime in the before version doesn't even make the cut for the pstats output in the after version.

There is some more tweaking around the edges we can do for perf, but this is the big one. (Also another big one when columns can have PeriodDtype).

passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry