Vectorize remove_copy for 4 and 8 byte elements by AlexGuteniev · Pull Request #5062 · microsoft/STL (original) (raw)

Follow up on #4987

For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.

This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.

AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.

⏱️ Benchmark results

Benchmark main this
r<alg_type::std_fn, std::uint8_t> 301 ns 270 ns
r<alg_type::std_fn, std::uint16_t> 276 ns 275 ns
r<alg_type::std_fn, std::uint32_t> 336 ns 333 ns
r<alg_type::std_fn, std::uint64_t> 778 ns 761 ns
r<alg_type::rng, std::uint8_t> 278 ns 284 ns
r<alg_type::rng, std::uint16_t> 301 ns 281 ns
r<alg_type::rng, std::uint32_t> 338 ns 331 ns
r<alg_type::rng, std::uint64_t> 779 ns 768 ns
rc<alg_type::std_fn, std::uint32_t> 1445 ns 475 ns
rc<alg_type::std_fn, std::uint64_t> 2187 ns 1101 ns
rc<alg_type::rng, std::uint32_t> 897 ns 472 ns
rc<alg_type::rng, std::uint64_t> 1918 ns 1110 ns

Expectedly remmove_copy vectorized is better than non-vectorized.

Expectedly remmove_copy vectorized does not reach the remove vectorized performance.

As usual, some minor variations in unchanged remove vectorized.

⚠️ AMD benchmark wanted ⚠️

I'm worried about the vmaskmov* timings.
They seem to be bad enough for AMD to turn this into a pessimization.