Help the compiler vectorize adjacent_difference by AlexGuteniev · Pull Request #4958 · microsoft/STL (original) (raw)

📜 The approach

The following things prevented the original algorithm from vectorization:

🛑 Correctness concern

The standard defines exact steps for this algorithm. The optimization alters the steps.
In particular the standard wants the subtracted value to be saved from the previous iteration, rather than being read again.
The two below sections explain what precautions are made to make the change unobservable, so I hope the change is correct.

✅ Checks for eligibility

The following checks were added:

There's no need in check for integral types or so, since the compiler makes the final decision anyway, and it may be able optimize even something that wouldn't pass a strict check.

⚠️ No Aliasing

Apparently there's no rule that the source and the destination ranges may not overlap.
We should handle aliasing.

Unlike the #4431 precedent, we can't yield to the compiler here. The compiler is able to insert overlaps check that prevents vectorization and go to the scalar fallback in case of checks failure, but:

So we do our own checks.

Then we tell the compiler with __restrict that we already checked, and it should not bother. This is done in a separate function, because the __restrict is not aliased within scope, so saying __restrict within the original algorithm would apparently be a lie.

The extra check by the compiler, if not prevented would slightly add run time and dead code size.

😾 Compiler warnings

We have a great feature called integral promotion. Smaller types are converted to integers, and there is a warning about converting them back. Local pragma suppresses them in benchmark, but not in the test.

@StephanTLavavej used a function object with static_cast to avoid warnings in the test.

⏱️ Benchmark results

Benchmark main this this + AVX2
bm<uint8_t>/2255 745 ns 563 ns 562 ns
bm<uint16_t>/2255 799 ns 83.3 ns 75.1 ns
bm<uint32_t>/2255 731 ns 154 ns 141 ns
bm<uint64_t>/2255 805 ns 293 ns 272 ns
bm/2255 751 ns 154 ns 123 ns
bm/2255 753 ns 304 ns 233 ns

🥇 Results interpretation