Vectorize `remove_copy` for 4 and 8 byte elements by AlexGuteniev · Pull Request #5062 · microsoft/STL (original) (raw)

Follow up on #4987

For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.

This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.

AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.

⏱️ Benchmark results

Benchmark	main	this
r<alg_type::std_fn, std::uint8_t>	301 ns	270 ns
r<alg_type::std_fn, std::uint16_t>	276 ns	275 ns
r<alg_type::std_fn, std::uint32_t>	336 ns	333 ns
r<alg_type::std_fn, std::uint64_t>	778 ns	761 ns
r<alg_type::rng, std::uint8_t>	278 ns	284 ns
r<alg_type::rng, std::uint16_t>	301 ns	281 ns
r<alg_type::rng, std::uint32_t>	338 ns	331 ns
r<alg_type::rng, std::uint64_t>	779 ns	768 ns
rc<alg_type::std_fn, std::uint32_t>	1445 ns	475 ns
rc<alg_type::std_fn, std::uint64_t>	2187 ns	1101 ns
rc<alg_type::rng, std::uint32_t>	897 ns	472 ns
rc<alg_type::rng, std::uint64_t>	1918 ns	1110 ns

Expectedly remmove_copy vectorized is better than non-vectorized.

Expectedly remmove_copy vectorized does not reach the remove vectorized performance.

As usual, some minor variations in unchanged remove vectorized.

⚠️ AMD benchmark wanted ⚠️

I'm worried about the vmaskmov* timings.
They seem to be bad enough for AMD to turn this into a pessimization.

Vectorize remove_copy for 4 and 8 byte elements by AlexGuteniev · Pull Request #5062 · microsoft/STL (original) (raw)

⏱️ Benchmark results

⚠️ AMD benchmark wanted ⚠️

Vectorize `remove_copy` for 4 and 8 byte elements by AlexGuteniev · Pull Request #5062 · microsoft/STL (original) (raw)