PERF: Index.sort_values for already sorted index by lukemanley · Pull Request #56128 · pandas-dev/pandas (original) (raw)

lukemanley

Take advantage of the cached is_monotonic attributes.

import pandas as pd

N = 1_000_000

idx = pd._testing.makeStringIndex(N).sort_values()
%timeit idx.sort_values()

# 2.35 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
# 14.6 µs ± 3.48 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

idx = pd.date_range("2000-01-01", freq="s", periods=N)
%timeit idx.sort_values(ascending=False)

# 90.3 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)      <- main
# 11.6 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  <- PR

Existing ASV:


from asv_bench.benchmarks.categoricals import Indexing

b = Indexing()
b.setup()
%timeit b.time_sort_values()

# 4.85 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)     <- main
# 21.4 µs ± 319 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  <- PR

@lukemanley

@lukemanley

@phofl

Can you try this when this isn't cached? E.g. recreating the series for every pass

@lukemanley

Can you try this when this isn't cached? E.g. recreating the series for every pass

without having been pre-cached:

import pandas as pd

N = 1_000_000

values = pd._testing.makeStringIndex(N).sort_values().values
%timeit pd.Index(values).sort_values()

# 2.45 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 259 ms ± 9.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- PR

values = pd.date_range("2000-01-01", freq="s", periods=N).values
%timeit pd.Index(values).sort_values(ascending=False)

# 91.7 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- main
# 2.89 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

phofl

@phofl