Optimize std::transform for vector<bool> by AlexGuteniev 路 Pull Request #5769 路 microsoft/STL (original) (raw)

Towards #625, specifically #625 (comment) items 1 and 2.

馃 Optimization

When a standard functor, either transparent or integer-specialized, is passed to transform, along with all vector<bool> iterators, map that functor to a bitwise one to operate on the underlying type.

The mapping is done via template specialization, and not via if constexpr to make the dispatch working fine without <functional> included and functors defined.

Only do this for zero offset. Supporting all possible offset combination is much complexity for a little gain. Remember copy.

Extract pointers from iterators to help the compiler auto-vectorize. Yes, it does not auto-vectorize when using the whole iterators. Auto-vecotrization needs simplest ways of implementing loops.

Don't call transform again, to avoid unnecessary recursion, the operation is simple.

Don't process tails explicitly, yield to the existing loop for now.
Actually lets go for it, it is not that hard. Process tails with applying bit mask.

Don't do ranges yet. Other vector<bool> optimizations don't do them either. It is getting complicated, so instead of doing ranges separately, need to look into #1754 at last.

馃弫 Benchmark

Feed the randomizer with some seed to make the inputs different 馃惁

Since (auto-)vectorization is (expected to be) engaged, use alignment controlling allocator.

鈴憋笍 Benchmark results

Benchmark Before After Speedup
transform_two_inputs_aligned<logical_and<>>/64 108 ns 2.55 ns 42.4
transform_two_inputs_aligned<logical_and<>>/4096 13869 ns 9.44 ns 1470
transform_two_inputs_aligned<logical_and<>>/65536 416424 ns 115 ns 3620
transform_two_inputs_aligned<logical_or<>>/64 123 ns 2.59 ns 47.40
transform_two_inputs_aligned<logical_or<>>/4096 14377 ns 9.07 ns 1590
transform_two_inputs_aligned<logical_or<>>/65536 409012 ns 112 ns 3650
transform_one_input_aligned<logical_not<>>/64 83.7 ns 2.14 ns 39.10
transform_one_input_aligned<logical_not<>>/4096 6891 ns 7.28 ns 947
transform_one_input_aligned<logical_not<>>/65536 264957 ns 82.7 ns 3200