gh-119702: New dynamic algorithm selection for string search (+ rfind
alignment) by dg-pb · Pull Request #120025 · python/cpython (original) (raw)
1. Work
I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.
- Instead of 3 different implementations only one (
horspool_find
) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find
) if it predicts it will perform better. - Direction agnostic logic allowed
rfind
to use exact same code asfind
. - Special case
n == m
to usememcmp
added.
2. Results
Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance surface
and improves on general code structure and documentation.
Benefits:
- Performance
surface
is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic. - Direction agnostic logic works well and eases the strain of the alternative of having to keep 2 implementations in sync.
- Benchmarks:
- Average
75%
performance increase offind
for artificial benchmark of shuffled alphabet. - Average
34%
performance increase offind
for real file search of different slice lengths. - Average
247%
performance increase ofrfind
for artificial benchmark of shuffled alphabet.
Worth noting:
- Splitting 2 directions (forward and reversed) into 2 implementations would result in 10-30% better performance based on tested benchmarks. However, I think it is a good trade-off, given the advantages of unified approach.
- There are areas and cases where new algorithm performs worse (see benchmark). However, they are either not clustered or where they are, the performance decrease is non-substantial.
3. Benchmarks:
Benchmark result value:
current_runtime = run time of current python version
new_runtime = run time of this PR
result = (new_runtime - current_runtime) / min(new_runtime, current_runtime)
3.1. Artificial dataset via randomized alphabet.
Case Generation Gode
shuffled alphabet
alphabet = 'DHUXYEZQCLFKISBVRGNAMWPTOJ' zipf = [1/x for x in range(1, 1+len(alphabet))]
def zipf_string(length, seed): letters = random.Random(seed).choices(alphabet, weights=zipf, k=length) return ''.join(letters)
NLS = [ 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 500, 1000, 10000, 100_000 ]
HSS = [ 500, 750, 1000, 1500, 2_000, 3_000, 4_000, 6_000, 8_000, 12_000, 16_000, 24_000, 32_000, 48_000, 64_000, 96_000, 1_000_000 ]
def generate_benchmarks(): output = [] for m in NLS: for n in HSS: if n < m: continue for s in (1, 2, 3): seed = (s*n + m) % 1_000_003 needle = zipf_string(m, seed) haystack = zipf_string(n, seed ** 2) name = f"needle={m}, haystack={n}, seed={s}" output.append((name, needle, haystack)) with open(f"{PATH}/_generated.py", 'w') as f: print("benches = [", file=f) for name, needle, haystack in output: print(f" {(name, needle, haystack)!r},", file=f) print("]", file=f)
1.a. Results. Current vs new `str.find/str.count`.
Comparison for len(haystack) == 1000
for str.find
. x-axis is "{needle_len}:{seed}"
. Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.
1.b. Results. Current vs new `str.rfind/str.rcount`..
3.2. Search for arbitrary chunks in real files.
Case Generation Gode. `str.rfind/str.rcount`..
/ FILES = { "c": (CPYTHON_PATH / "Objects" / "unicodeobject.c").read_text(), "py": (CPYTHON_PATH / "Lib" / "_pydecimal.py").read_text(), "en": (CPYTHON_PATH / "Doc" / "library" / "stdtypes.rst").read_text(), "bin": (CPYTHON_PATH / "python.exe").read_bytes(), }
MS = [10, 15, 20, 30, 40, 60, 80, 120, 160, 240, 320, 640, 1280] MR = range(12)
def generate_benchmarks(): results = dict() for file_label, haystack in FILES.items(): n = len(haystack) for m in MS: for i in MR: stt = (1_000_003 * i) % (n - m) needle = haystack[stt:stt + m] results[(m, file_label, i)] = haystack, needle return results
2.a. Results. Current vs new `str.find/str.count`.