gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython (original) (raw)

1. Work

I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.

Instead of 3 different implementations only one (horspool_find) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find) if it predicts it will perform better.
Direction agnostic logic allowed rfind to use exact same code as find.
Special case n == m to use memcmp added.

2. Results

Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance surface and improves on general code structure and documentation.

Benefits:

Performance surface is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic.
Direction agnostic logic works well and eases the strain of the alternative of having to keep 2 implementations in sync.
Benchmarks:

Average 75% performance increase of find for artificial benchmark of shuffled alphabet.
Average 34% performance increase of find for real file search of different slice lengths.
Average 247% performance increase of rfind for artificial benchmark of shuffled alphabet.

Worth noting:

Splitting 2 directions (forward and reversed) into 2 implementations would result in 10-30% better performance based on tested benchmarks. However, I think it is a good trade-off, given the advantages of unified approach.
There are areas and cases where new algorithm performs worse (see benchmark). However, they are either not clustered or where they are, the performance decrease is non-substantial.

3. Benchmarks:

Benchmark result value:

current_runtime = run time of current python version new_runtime = run time of this PR

result = (new_runtime - current_runtime) / min(new_runtime, current_runtime)

3.1. Artificial dataset via randomized alphabet.

Case Generation Gode

shuffled alphabet

alphabet = 'DHUXYEZQCLFKISBVRGNAMWPTOJ' zipf = [1/x for x in range(1, 1+len(alphabet))]

def zipf_string(length, seed): letters = random.Random(seed).choices(alphabet, weights=zipf, k=length) return ''.join(letters)

NLS = [ 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 500, 1000, 10000, 100_000 ]

HSS = [ 500, 750, 1000, 1500, 2_000, 3_000, 4_000, 6_000, 8_000, 12_000, 16_000, 24_000, 32_000, 48_000, 64_000, 96_000, 1_000_000 ]

def generate_benchmarks(): output = [] for m in NLS: for n in HSS: if n < m: continue for s in (1, 2, 3): seed = (s*n + m) % 1_000_003 needle = zipf_string(m, seed) haystack = zipf_string(n, seed ** 2) name = f"needle={m}, haystack={n}, seed={s}" output.append((name, needle, haystack)) with open(f"{PATH}/_generated.py", 'w') as f: print("benches = [", file=f) for name, needle, haystack in output: print(f" {(name, needle, haystack)!r},", file=f) print("]", file=f)

1.a. Results. Current vs new `str.find/str.count`.

Comparison for len(haystack) == 1000 for str.find. x-axis is "{needle_len}:{seed}". Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.

1.b. Results. Current vs new `str.rfind/str.rcount`..

3.2. Search for arbitrary chunks in real files.

Case Generation Gode. `str.rfind/str.rcount`..

/ FILES = { "c": (CPYTHON_PATH / "Objects" / "unicodeobject.c").read_text(), "py": (CPYTHON_PATH / "Lib" / "_pydecimal.py").read_text(), "en": (CPYTHON_PATH / "Doc" / "library" / "stdtypes.rst").read_text(), "bin": (CPYTHON_PATH / "python.exe").read_bytes(), }

MS = [10, 15, 20, 30, 40, 60, 80, 120, 160, 240, 320, 640, 1280] MR = range(12)

def generate_benchmarks(): results = dict() for file_label, haystack in FILES.items(): n = len(haystack) for m in MS: for i in MR: stt = (1_000_003 * i) % (n - m) needle = haystack[stt:stt + m] results[(m, file_label, i)] = haystack, needle return results

2.a. Results. Current vs new `str.find/str.count`.

Issue: Better calibration of str.find aka FASTSEARCH #119702