Use find for search_n when n=1 by AlexGuteniev · Pull Request #5346 · microsoft/STL (original) (raw)

📜 The optimization

There are two implementations of search_n — in std and in std::ranges. For bidirectional iterators, both implementations take advantage of the contiguous range to search for. They jump forward by the value of n and try to match from the end. This allows skipping some comparisons. When there are more mismatches than matches, it ends up in fast pass over the range and few comparisons.

This means than for large values of n and non-pathological input, the algorithm is not even likely to benefit from vectorization.

For small values of n, however, the algorithm performs worse.

The worst case is n=1, where the algortihm is just find with extra steps. The PR forwards this case directly to find, where it may pick the vectorization or memchr, and even if it doesn't, it would still stop looking into doing extra steps.

⚖️ Predicate check

Unlike many other algorithms, such as find, the search_n algorithm takes both value and predicate. We want to forward to predicate-less find, as we're trying to engage vectorization, so we can do this when seeing the default equal_to predicate. Binding the value and the predicate into a bigger predicate and passing that to find_if would work for more cases, but would not be (manually) vectorized.

Since the value type and iterator type are unrelated, the comparison is potentially heterogenous, so it is hard to verify if non-void specialization of std::equal_to<T> does the same as default comparison, or not. We'll skip that, and check just for std::equal_to<void> and ranges::equal_to.

✅ Test coverage

There's no attempt of comprehensive coverage of std::search_n 🙀. Just some ad-hoc tests, mostly negative. Creating one seems out of scope for this PR. The n=1 case seems to be covered indirectly via P0024R2_parallel_algorithms_search_n test, along with many other cases.

For ranges::search_n there's a pre-existing test that does at least some minimum coverage, expanded that with n=1 case.

⏱️Benchmark results

Benchmark Before After
bm<uint8_t, AlgType::Std>/3000 525 ns 17.5 ns
bm<uint8_t, AlgType::Rng>/3000 995 ns 17.5 ns
bm<uint16_t, AlgType::Std>/3000 587 ns 40.0 ns
bm<uint16_t, AlgType::Rng>/3000 1506 ns 38.8 ns
bm<uint32_t, AlgType::Std>/3000 582 ns 67.8 ns
bm<uint32_t, AlgType::Rng>/3000 1500 ns 68.5 ns
bm<uint64_t, AlgType::Std>/3000 571 ns 146 ns
bm<uint64_t, AlgType::Rng>/3000 1466 ns 147 ns