Vectorize rotate
even better by AlexGuteniev · Pull Request #5525 · microsoft/STL (original) (raw)
Follow up for #5502.
Reasons to consider follow up:
- Some cases have no improvement
- After thinking about it, discovered that there are even cases with slight degradation
Weakness of the original approach
It deals well with extremes:
- Small rotation
- Rotation close to the middle
The worst case is when the rotation is small, but still large enough to not engage the small rotation branch.
Mitigation approaches
Generally, we need to do multi-range rotating swap to make fewer element assignments. From the original PR:
A hypothetical functions like
swap_3_ranges
,swap_4_ranges
, etc could reduce the number of assignments for more cases. But going further in optimization will result in less and less improvement for more and more code added, and at some point will cause the complex decisions to take noticeable amount of time, resulting in negative improvement, so we need to stop somewhere. Probably stopping on just small rotation and two ranges swap strategy would be a good idea.
So how we can do some improvement while avoiding unnecessary complication:
- Implement few of
swap_3_ranges
,swap_4_ranges
, etc, but no more than two of them as separate functions.- ❌ Will not squeeze away unnecessary assignments too hard
- ✅ This should be the easiest thing to do
- Spawn many
swap_N_ranges
using single source and metaprogramming, pick the best one at runtime- ✅ Will squeeze away unnecessary assignments harder
- ❌ Complex metaprogramming to do that using template fold expression or macros
- ❌ In case of template implementation will heavily rely on compiler optimization
- 🤮 Macro implementation is just not a good thing
- ❌ Will add a lot of machine code, which will make binary bigger, and there will be a lot of "cold" code at runtime
- Implement single
swap_N_ranges
that would work with variable at runtime number of ranges- ✅ Will squeeze away unnecessary assignments to the maximum
- ❓ Swapping too many ranges at the same time is likely to break prefetch, need to see the impact of that
- ❓ Will have additional runtime cost for iterating over pointers to iterate
- ⚠️ Large power-of-two stride will cause cache conflict eviction, if N exceeds CPU cache associativity, which would result in dramatic performance degradation
- ❌ Will have complex flow at runtime
This makes me think that it would be good to:
- First try the simple approach of having one or two additional swap functions
- If there's a strong indication of success in this direction, try the runtime-variable
swap_N_ranges
The code chages
So I've tried _Swap_3_ranges
, It resulted in at most 1.40 speedup, and that fixed the slightly regressed cases.
I think it is indication to both that the approach is good enough to use, and not too good to try something more complex.
I've moved _Rotating
closer to __std_swap_ranges_trivially_swappable_noalias
to make the similarity between that and _Swap_3_ranges
more obvious.
Coverage
Tests were lacking too long arrays to execute the ranges swapping properly. I've expanded the test to have more elements; to save some run time, I've did this for one of 8-bit elements only. The algorithm does not distinguish element sizes internally anyway.
The same for benchmark, I've added just two examples of the case that became worse.
Benchmark results
Before #5502 / After #5502 may slightly wary from the previous PR description, I've ran the benchmarks again.
Benchmark | Before #5502 | After #5502 | After this | #5502 ⬆️ | This ⬆️ | Total ⬆️ |
---|---|---|---|---|---|---|
u8//Std/3333/2242 | 93.8 ns | 67.0 ns | 50.6 ns | 1.40 | 1.32 | 1.85 |
u8//Std/3332/1666 | 94.6 ns | 40.0 ns | 40.5 ns | 2.37 | 0.99 | 2.34 |
u8//Std/3333/1111 | 91.4 ns | 60.4 ns | 44.0 ns | 1.51 | 1.37 | 2.08 |
u8//Std/3333/501 | 89.9 ns | 32.1 ns | 32.1 ns | 2.80 | 1.00 | 2.80 |
u8//Std/3333/3300 | 91.3 ns | 32.3 ns | 32.1 ns | 2.83 | 1.01 | 2.84 |
u8//Std/3333/12 | 87.8 ns | 25.9 ns | 25.8 ns | 3.39 | 1.00 | 3.40 |
u8//Std/3333/5 | 90.8 ns | 29.0 ns | 28.8 ns | 3.13 | 1.01 | 3.15 |
u8//Std/3333/1 | 82.2 ns | 28.8 ns | 29.8 ns | 2.85 | 0.97 | 2.76 |
u8//Std/333/101 | 19.0 ns | 12.1 ns | 10.1 ns | 1.57 | 1.20 | 1.88 |
u8//Std/123/32 | 22.7 ns | 6.57 ns | 6.36 ns | 3.46 | 1.03 | 3.57 |
u8//Std/23/7 | 18.3 ns | 5.24 ns | 5.51 ns | 3.49 | 0.95 | 3.32 |
u8//Std/12/5 | 12.9 ns | 5.26 ns | 5.03 ns | 2.45 | 1.05 | 2.56 |
u8//Std/3/2 | 3.42 ns | 4.77 ns | 4.71 ns | 0.72 | 1.01 | 0.73 |
u8//Rng/3333/2242 | 94.3 ns | 67.4 ns | 52.8 ns | 1.40 | 1.28 | 1.79 |
u8//Rng/3332/1666 | 95.9 ns | 39.9 ns | 41.7 ns | 2.40 | 0.96 | 2.30 |
u8//Rng/3333/1111 | 93.2 ns | 58.4 ns | 45.9 ns | 1.60 | 1.27 | 2.03 |
u8//Rng/3333/501 | 89.8 ns | 31.9 ns | 32.3 ns | 2.82 | 0.99 | 2.78 |
u8//Rng/3333/3300 | 93.5 ns | 32.5 ns | 33.3 ns | 2.88 | 0.98 | 2.81 |
u8//Rng/3333/12 | 89.3 ns | 25.9 ns | 26.0 ns | 3.45 | 1.00 | 3.43 |
u8//Rng/3333/5 | 87.4 ns | 29.0 ns | 29.2 ns | 3.01 | 0.99 | 2.99 |
u8//Rng/3333/1 | 83.1 ns | 29.0 ns | 28.9 ns | 2.87 | 1.00 | 2.88 |
u8//Rng/333/101 | 18.4 ns | 12.1 ns | 11.3 ns | 1.52 | 1.07 | 1.63 |
u8//Rng/123/32 | 26.1 ns | 6.56 ns | 6.44 ns | 3.98 | 1.02 | 4.05 |
u8//Rng/23/7 | 18.5 ns | 5.20 ns | 5.22 ns | 3.56 | 1.00 | 3.54 |
u8//Rng/12/5 | 13.2 ns | 5.28 ns | 4.93 ns | 2.50 | 1.07 | 2.68 |
u8//Rng/3/2 | 3.33 ns | 4.77 ns | 4.73 ns | 0.70 | 1.01 | 0.70 |
u16//Std/3333/2242 | 180 ns | 131 ns | 106 ns | 1.37 | 1.24 | 1.70 |
u16//Std/3332/1666 | 184 ns | 84.0 ns | 83.6 ns | 2.19 | 1.00 | 2.20 |
u16//Std/3333/1111 | 185 ns | 132 ns | 86.5 ns | 1.40 | 1.53 | 2.14 |
u16//Std/3333/501 | 184 ns | 170 ns | 143 ns | 1.08 | 1.19 | 1.29 |
u16//Std/3333/3300 | 179 ns | 61.9 ns | 61.7 ns | 2.89 | 1.00 | 2.90 |
u16//Std/3333/12 | 166 ns | 46.8 ns | 46.3 ns | 3.55 | 1.01 | 3.59 |
u16//Std/3333/5 | 176 ns | 54.3 ns | 53.6 ns | 3.24 | 1.01 | 3.28 |
u16//Std/3333/1 | 176 ns | 53.4 ns | 53.8 ns | 3.30 | 0.99 | 3.27 |
u16//Std/333/101 | 27.4 ns | 13.0 ns | 11.9 ns | 2.11 | 1.09 | 2.30 |
u16//Std/123/32 | 16.5 ns | 11.8 ns | 10.5 ns | 1.40 | 1.12 | 1.57 |
u16//Std/23/7 | 11.5 ns | 4.93 ns | 5.14 ns | 2.33 | 0.96 | 2.24 |
u16//Std/12/5 | 11.9 ns | 5.15 ns | 4.94 ns | 2.31 | 1.04 | 2.41 |
u16//Std/3/2 | 3.33 ns | 4.73 ns | 4.68 ns | 0.70 | 1.01 | 0.71 |
u16//Rng/3333/2242 | 180 ns | 129 ns | 104 ns | 1.40 | 1.24 | 1.73 |
u16//Rng/3332/1666 | 185 ns | 82.5 ns | 84.0 ns | 2.24 | 0.98 | 2.20 |
u16//Rng/3333/1111 | 183 ns | 112 ns | 87.9 ns | 1.63 | 1.27 | 2.08 |
u16//Rng/3333/501 | 182 ns | 167 ns | 146 ns | 1.09 | 1.14 | 1.25 |
u16//Rng/3333/3300 | 181 ns | 61.2 ns | 63.9 ns | 2.96 | 0.96 | 2.83 |
u16//Rng/3333/12 | 167 ns | 46.4 ns | 47.5 ns | 3.60 | 0.98 | 3.52 |
u16//Rng/3333/5 | 176 ns | 53.3 ns | 53.4 ns | 3.30 | 1.00 | 3.30 |
u16//Rng/3333/1 | 175 ns | 53.6 ns | 54.8 ns | 3.26 | 0.98 | 3.19 |
u16//Rng/333/101 | 27.0 ns | 13.3 ns | 11.8 ns | 2.03 | 1.13 | 2.29 |
u16//Rng/123/32 | 16.5 ns | 11.8 ns | 10.9 ns | 1.40 | 1.08 | 1.51 |
u16//Rng/23/7 | 11.9 ns | 4.92 ns | 5.04 ns | 2.42 | 0.98 | 2.36 |
u16//Rng/12/5 | 12.4 ns | 5.15 ns | 5.35 ns | 2.41 | 0.96 | 2.32 |
u16//Rng/3/2 | 3.34 ns | 4.73 ns | 4.85 ns | 0.71 | 0.98 | 0.69 |
u32//Std/3333/2242 | 337 ns | 258 ns | 206 ns | 1.31 | 1.25 | 1.64 |
u32//Std/3332/1666 | 343 ns | 169 ns | 169 ns | 2.03 | 1.00 | 2.03 |
u32//Std/3333/1111 | 339 ns | 206 ns | 152 ns | 1.65 | 1.36 | 2.23 |
u32//Std/3333/501 | 336 ns | 310 ns | 265 ns | 1.08 | 1.17 | 1.27 |
u32//Std/3333/3300 | 340 ns | 106 ns | 110 ns | 3.21 | 0.96 | 3.09 |
u32//Std/3333/12 | 337 ns | 90.5 ns | 93.1 ns | 3.72 | 0.97 | 3.62 |
u32//Std/3333/5 | 333 ns | 89.7 ns | 92.4 ns | 3.71 | 0.97 | 3.60 |
u32//Std/3333/1 | 331 ns | 90.8 ns | 92.7 ns | 3.65 | 0.98 | 3.57 |
u32//Std/333/101 | 35.3 ns | 16.3 ns | 16.9 ns | 2.17 | 0.96 | 2.09 |
u32//Std/123/32 | 14.5 ns | 12.1 ns | 11.2 ns | 1.20 | 1.08 | 1.29 |
u32//Std/23/7 | 11.4 ns | 6.89 ns | 7.11 ns | 1.65 | 0.97 | 1.60 |
u32//Std/12/5 | 8.91 ns | 6.77 ns | 7.04 ns | 1.32 | 0.96 | 1.27 |
u32//Std/3/2 | 3.12 ns | 4.68 ns | 4.76 ns | 0.67 | 0.98 | 0.66 |
u32//Rng/3333/2242 | 331 ns | 252 ns | 204 ns | 1.31 | 1.24 | 1.62 |
u32//Rng/3332/1666 | 341 ns | 164 ns | 167 ns | 2.08 | 0.98 | 2.04 |
u32//Rng/3333/1111 | 335 ns | 202 ns | 148 ns | 1.66 | 1.36 | 2.26 |
u32//Rng/3333/501 | 341 ns | 306 ns | 266 ns | 1.11 | 1.15 | 1.28 |
u32//Rng/3333/3300 | 336 ns | 106 ns | 109 ns | 3.17 | 0.97 | 3.08 |
u32//Rng/3333/12 | 332 ns | 90.8 ns | 96.3 ns | 3.66 | 0.94 | 3.45 |
u32//Rng/3333/5 | 335 ns | 88.8 ns | 99.1 ns | 3.77 | 0.90 | 3.38 |
u32//Rng/3333/1 | 332 ns | 89.3 ns | 92.8 ns | 3.72 | 0.96 | 3.58 |
u32//Rng/333/101 | 35.5 ns | 16.3 ns | 17.1 ns | 2.18 | 0.95 | 2.08 |
u32//Rng/123/32 | 14.5 ns | 12.5 ns | 10.9 ns | 1.16 | 1.15 | 1.33 |
u32//Rng/23/7 | 11.3 ns | 7.03 ns | 7.21 ns | 1.61 | 0.98 | 1.57 |
u32//Rng/12/5 | 9.03 ns | 7.37 ns | 7.19 ns | 1.23 | 1.03 | 1.26 |
u32//Rng/3/2 | 3.08 ns | 4.68 ns | 4.74 ns | 0.66 | 0.99 | 0.65 |
u64//Std/3333/2242 | 661 ns | 436 ns | 333 ns | 1.52 | 1.31 | 1.98 |
u64//Std/3332/1666 | 670 ns | 325 ns | 332 ns | 2.06 | 0.98 | 2.02 |
u64//Std/3333/1111 | 596 ns | 392 ns | 281 ns | 1.52 | 1.40 | 2.12 |
u64//Std/3333/501 | 659 ns | 581 ns | 506 ns | 1.13 | 1.15 | 1.30 |
u64//Std/3333/3300 | 668 ns | 207 ns | 227 ns | 3.23 | 0.91 | 2.94 |
u64//Std/3333/12 | 655 ns | 134 ns | 134 ns | 4.89 | 1.00 | 4.89 |
u64//Std/3333/5 | 661 ns | 175 ns | 186 ns | 3.78 | 0.94 | 3.55 |
u64//Std/3333/1 | 661 ns | 182 ns | 183 ns | 3.63 | 0.99 | 3.61 |
u64//Std/333/101 | 63.2 ns | 48.7 ns | 39.4 ns | 1.30 | 1.24 | 1.60 |
u64//Std/123/32 | 22.0 ns | 13.5 ns | 11.9 ns | 1.63 | 1.13 | 1.85 |
u64//Std/23/7 | 11.3 ns | 11.2 ns | 9.53 ns | 1.01 | 1.18 | 1.19 |
u64//Std/12/5 | 11.9 ns | 10.6 ns | 9.53 ns | 1.12 | 1.11 | 1.25 |
u64//Std/3/2 | 3.11 ns | 4.68 ns | 4.78 ns | 0.66 | 0.98 | 0.65 |
u64//Rng/3333/2242 | 659 ns | 435 ns | 328 ns | 1.51 | 1.33 | 2.01 |
u64//Rng/3332/1666 | 671 ns | 325 ns | 326 ns | 2.06 | 1.00 | 2.06 |
u64//Rng/3333/1111 | 596 ns | 391 ns | 286 ns | 1.52 | 1.37 | 2.08 |
u64//Rng/3333/501 | 668 ns | 583 ns | 506 ns | 1.15 | 1.15 | 1.32 |
u64//Rng/3333/3300 | 665 ns | 206 ns | 233 ns | 3.23 | 0.88 | 2.85 |
u64//Rng/3333/12 | 668 ns | 133 ns | 135 ns | 5.02 | 0.99 | 4.95 |
u64//Rng/3333/5 | 661 ns | 175 ns | 178 ns | 3.78 | 0.98 | 3.71 |
u64//Rng/3333/1 | 659 ns | 182 ns | 184 ns | 3.62 | 0.99 | 3.58 |
u64//Rng/333/101 | 62.3 ns | 48.4 ns | 39.8 ns | 1.29 | 1.22 | 1.57 |
u64//Rng/123/32 | 22.2 ns | 13.6 ns | 12.4 ns | 1.63 | 1.10 | 1.79 |
u64//Rng/23/7 | 11.2 ns | 11.4 ns | 9.91 ns | 0.98 | 1.15 | 1.13 |
u64//Rng/12/5 | 11.7 ns | 10.7 ns | 9.97 ns | 1.09 | 1.07 | 1.17 |
u64//Rng/3/2 | 3.04 ns | 4.66 ns | 4.66 ns | 0.65 | 1.00 | 0.65 |
c6//Std/3333/2242 | 1742 ns | 363 ns | 290 ns | 4.80 | 1.25 | 6.01 |
c6//Std/3332/1666 | 1733 ns | 244 ns | 246 ns | 7.10 | 0.99 | 7.04 |
c6//Std/3333/1111 | 1756 ns | 323 ns | 250 ns | 5.44 | 1.29 | 7.02 |
c6//Std/3333/501 | 1750 ns | 477 ns | 411 ns | 3.67 | 1.16 | 4.26 |
c6//Std/3333/3300 | 1740 ns | 162 ns | 164 ns | 10.74 | 0.99 | 10.61 |
c6//Std/3333/12 | 1734 ns | 132 ns | 133 ns | 13.14 | 0.99 | 13.04 |
c6//Std/3333/5 | 1826 ns | 152 ns | 155 ns | 12.01 | 0.98 | 11.78 |
c6//Std/3333/1 | 1733 ns | 154 ns | 154 ns | 11.25 | 1.00 | 11.25 |
c6//Std/333/101 | 180 ns | 46.6 ns | 38.4 ns | 3.86 | 1.21 | 4.69 |
c6//Std/123/32 | 66.6 ns | 13.6 ns | 12.1 ns | 4.90 | 1.12 | 5.50 |
c6//Std/23/7 | 12.2 ns | 11.0 ns | 9.32 ns | 1.11 | 1.18 | 1.31 |
c6//Std/12/5 | 7.16 ns | 7.26 ns | 7.33 ns | 0.99 | 0.99 | 0.98 |
c6//Std/3/2 | 2.10 ns | 5.02 ns | 5.10 ns | 0.42 | 0.98 | 0.41 |
c6//Rng/3333/2242 | 1747 ns | 363 ns | 291 ns | 4.81 | 1.25 | 6.00 |
c6//Rng/3332/1666 | 1736 ns | 243 ns | 247 ns | 7.14 | 0.98 | 7.03 |
c6//Rng/3333/1111 | 1726 ns | 323 ns | 247 ns | 5.34 | 1.31 | 6.99 |
c6//Rng/3333/501 | 1746 ns | 476 ns | 409 ns | 3.67 | 1.16 | 4.27 |
c6//Rng/3333/3300 | 1741 ns | 163 ns | 164 ns | 10.68 | 0.99 | 10.62 |
c6//Rng/3333/12 | 1728 ns | 133 ns | 135 ns | 12.99 | 0.99 | 12.80 |
c6//Rng/3333/5 | 1829 ns | 155 ns | 157 ns | 11.80 | 0.99 | 11.65 |
c6//Rng/3333/1 | 1724 ns | 154 ns | 157 ns | 11.19 | 0.98 | 10.98 |
c6//Rng/333/101 | 178 ns | 46.7 ns | 38.5 ns | 3.81 | 1.21 | 4.62 |
c6//Rng/123/32 | 66.0 ns | 14.1 ns | 12.7 ns | 4.68 | 1.11 | 5.20 |
c6//Rng/23/7 | 12.3 ns | 11.4 ns | 10.0 ns | 1.08 | 1.14 | 1.23 |
c6//Rng/12/5 | 7.05 ns | 7.44 ns | 7.78 ns | 0.95 | 0.96 | 0.91 |
c6//Rng/3/2 | 2.10 ns | 5.14 ns | 5.54 ns | 0.41 | 0.93 | 0.38 |
u8//Std/35000/520 | 785 ns | 797 ns | 598 ns | 0.98 | 1.33 | 1.31 |
u8//Std/35000/3000 | 725 ns | 759 ns | 583 ns | 0.96 | 1.30 | 1.24 |