PERF: StringEngine for string dtype indexing ops by lukemanley · Pull Request #56997 · pandas-dev/pandas (original) (raw)

Seems to be about 2x faster than ObjectEngine for indexing ops that use the hashmap:

import pandas as pd

N = 100_000
dtype = "string[pyarrow_numpy]"

strings = [f"i-{i}" for i in range(N)]

idx1 = pd.Index(strings[10:], dtype=dtype)
idx2 = pd.Index(strings[:-10], dtype=dtype)

%timeit idx1.get_indexer_for(idx2)

# 52.6 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> main
# 25 ms ± 790 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    -> PR