Optimize std::transform for vector<bool> by AlexGuteniev 路 Pull Request #5769 路 microsoft/STL (original) (raw)
Towards #625, specifically #625 (comment) items 1 and 2.
馃 Optimization
When a standard functor, either transparent or integer-specialized, is passed to transform, along with all vector<bool> iterators, map that functor to a bitwise one to operate on the underlying type.
The mapping is done via template specialization, and not via if constexpr to make the dispatch working fine without <functional> included and functors defined.
Only do this for zero offset. Supporting all possible offset combination is much complexity for a little gain. Remember copy.
Extract pointers from iterators to help the compiler auto-vectorize. Yes, it does not auto-vectorize when using the whole iterators. Auto-vecotrization needs simplest ways of implementing loops.
Don't call transform again, to avoid unnecessary recursion, the operation is simple.
Don't process tails explicitly, yield to the existing loop for now.
Actually lets go for it, it is not that hard. Process tails with applying bit mask.
Don't do ranges yet. Other vector<bool> optimizations don't do them either. It is getting complicated, so instead of doing ranges separately, need to look into #1754 at last.
馃弫 Benchmark
Feed the randomizer with some seed to make the inputs different 馃惁
Since (auto-)vectorization is (expected to be) engaged, use alignment controlling allocator.
鈴憋笍 Benchmark results
| Benchmark | Before | After | Speedup |
|---|---|---|---|
| transform_two_inputs_aligned<logical_and<>>/64 | 108 ns | 2.55 ns | 42.4 |
| transform_two_inputs_aligned<logical_and<>>/4096 | 13869 ns | 9.44 ns | 1470 |
| transform_two_inputs_aligned<logical_and<>>/65536 | 416424 ns | 115 ns | 3620 |
| transform_two_inputs_aligned<logical_or<>>/64 | 123 ns | 2.59 ns | 47.40 |
| transform_two_inputs_aligned<logical_or<>>/4096 | 14377 ns | 9.07 ns | 1590 |
| transform_two_inputs_aligned<logical_or<>>/65536 | 409012 ns | 112 ns | 3650 |
| transform_one_input_aligned<logical_not<>>/64 | 83.7 ns | 2.14 ns | 39.10 |
| transform_one_input_aligned<logical_not<>>/4096 | 6891 ns | 7.28 ns | 947 |
| transform_one_input_aligned<logical_not<>>/65536 | 264957 ns | 82.7 ns | 3200 |