Vectorize remove_copy
for 4 and 8 byte elements by AlexGuteniev · Pull Request #5062 · microsoft/STL (original) (raw)
Follow up on #4987
For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.
This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.
AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.
⏱️ Benchmark results
Benchmark | main | this |
---|---|---|
r<alg_type::std_fn, std::uint8_t> | 301 ns | 270 ns |
r<alg_type::std_fn, std::uint16_t> | 276 ns | 275 ns |
r<alg_type::std_fn, std::uint32_t> | 336 ns | 333 ns |
r<alg_type::std_fn, std::uint64_t> | 778 ns | 761 ns |
r<alg_type::rng, std::uint8_t> | 278 ns | 284 ns |
r<alg_type::rng, std::uint16_t> | 301 ns | 281 ns |
r<alg_type::rng, std::uint32_t> | 338 ns | 331 ns |
r<alg_type::rng, std::uint64_t> | 779 ns | 768 ns |
rc<alg_type::std_fn, std::uint32_t> | 1445 ns | 475 ns |
rc<alg_type::std_fn, std::uint64_t> | 2187 ns | 1101 ns |
rc<alg_type::rng, std::uint32_t> | 897 ns | 472 ns |
rc<alg_type::rng, std::uint64_t> | 1918 ns | 1110 ns |
Expectedly remmove_copy
vectorized is better than non-vectorized.
Expectedly remmove_copy
vectorized does not reach the remove
vectorized performance.
As usual, some minor variations in unchanged remove
vectorized.
⚠️ AMD benchmark wanted ⚠️
I'm worried about the vmaskmov*
timings.
They seem to be bad enough for AMD to turn this into a pessimization.