PERF: pd.concat EA-backed indexes and sort=True by lukemanley · Pull Request #49178 · pandas-dev/pandas (original) (raw)
algos.safe_sort
is currently converting EA-backed indexes to ndarrays which can cause a perf hit. It can be significant if the index contains pd.NA as np.argsort
will raise which gets caught in the try-catch. This PR avoids the numpy conversion for EA-backed indexes.
Note: One test which relied on the numpy conversion was updated.
import numpy as np
import pandas as pd
from pandas.core.indexes.api import safe_sort_index
vals = [pd.NA] + list(np.arange(100_000))
idx = pd.Index(vals, dtype="Int64")
%timeit safe_sort_index(idx)
# 81.6 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <- main
# 2.73 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR
I updated an existing ASV that was just added yesterday to cover this case as well:
before after ratio
[8b503a8c] [2e2a02fa]
<safe-sort-index>
- 27.1±1ms 22.2±0.4ms 0.82 join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'non_monotonic', 1, True)
- 15.9±0.2ms 12.3±0.1ms 0.78 join_merge.ConcatIndexDtype.time_concat_series('string[python]', 'has_na', 1, True)
- 31.4±0.7ms 24.1±0.7ms 0.77 join_merge.ConcatIndexDtype.time_concat_series('string[pyarrow]', 'has_na', 1, True)
- 22.8±0.2ms 16.4±0.3ms 0.72 join_merge.ConcatIndexDtype.time_concat_series('Int64', 'has_na', 1, True)
- 13.8±0.4ms 9.26±0.1ms 0.67 join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 1, True)