PERF: pd.concat EA-backed indexes and sort=True by lukemanley · Pull Request #49178 · pandas-dev/pandas (original) (raw)

algos.safe_sort is currently converting EA-backed indexes to ndarrays which can cause a perf hit. It can be significant if the index contains pd.NA as np.argsort will raise which gets caught in the try-catch. This PR avoids the numpy conversion for EA-backed indexes.

Note: One test which relied on the numpy conversion was updated.

import numpy as np
import pandas as pd
from pandas.core.indexes.api import safe_sort_index

vals = [pd.NA] + list(np.arange(100_000))
idx = pd.Index(vals, dtype="Int64")

%timeit safe_sort_index(idx)

# 81.6 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)   <- main
# 2.73 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

I updated an existing ASV that was just added yesterday to cover this case as well:

       before           after         ratio
     [8b503a8c]       [2e2a02fa]
                      <safe-sort-index>
-        27.1±1ms       22.2±0.4ms     0.82  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'non_monotonic', 1, True)
-      15.9±0.2ms       12.3±0.1ms     0.78  join_merge.ConcatIndexDtype.time_concat_series('string[python]', 'has_na', 1, True)
-      31.4±0.7ms       24.1±0.7ms     0.77  join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'has_na', 1, True)
-      22.8±0.2ms       16.4±0.3ms     0.72  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'has_na', 1, True)
-      13.8±0.4ms       9.26±0.1ms     0.67  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 1, True)