Use find
for search_n
when n=1 by AlexGuteniev · Pull Request #5346 · microsoft/STL (original) (raw)
📜 The optimization
There are two implementations of search_n
— in std
and in std::ranges
. For bidirectional iterators, both implementations take advantage of the contiguous range to search for. They jump forward by the value of n and try to match from the end. This allows skipping some comparisons. When there are more mismatches than matches, it ends up in fast pass over the range and few comparisons.
This means than for large values of n and non-pathological input, the algorithm is not even likely to benefit from vectorization.
For small values of n, however, the algorithm performs worse.
The worst case is n=1, where the algortihm is just find
with extra steps. The PR forwards this case directly to find
, where it may pick the vectorization or memchr
, and even if it doesn't, it would still stop looking into doing extra steps.
⚖️ Predicate check
Unlike many other algorithms, such as find
, the search_n
algorithm takes both value and predicate. We want to forward to predicate-less find
, as we're trying to engage vectorization, so we can do this when seeing the default equal_to
predicate. Binding the value and the predicate into a bigger predicate and passing that to find_if
would work for more cases, but would not be (manually) vectorized.
Since the value type and iterator type are unrelated, the comparison is potentially heterogenous, so it is hard to verify if non-void
specialization of std::equal_to<T>
does the same as default comparison, or not. We'll skip that, and check just for std::equal_to<void>
and ranges::equal_to
.
✅ Test coverage
There's no attempt of comprehensive coverage of std::search_n
🙀. Just some ad-hoc tests, mostly negative. Creating one seems out of scope for this PR. The n=1 case seems to be covered indirectly via P0024R2_parallel_algorithms_search_n
test, along with many other cases.
For ranges::search_n
there's a pre-existing test that does at least some minimum coverage, expanded that with n=1 case.
⏱️Benchmark results
Benchmark | Before | After |
---|---|---|
bm<uint8_t, AlgType::Std>/3000 | 525 ns | 17.5 ns |
bm<uint8_t, AlgType::Rng>/3000 | 995 ns | 17.5 ns |
bm<uint16_t, AlgType::Std>/3000 | 587 ns | 40.0 ns |
bm<uint16_t, AlgType::Rng>/3000 | 1506 ns | 38.8 ns |
bm<uint32_t, AlgType::Std>/3000 | 582 ns | 67.8 ns |
bm<uint32_t, AlgType::Rng>/3000 | 1500 ns | 68.5 ns |
bm<uint64_t, AlgType::Std>/3000 | 571 ns | 146 ns |
bm<uint64_t, AlgType::Rng>/3000 | 1466 ns | 147 ns |