Vectorize rotate even better by AlexGuteniev · Pull Request #5525 · microsoft/STL (original) (raw)

Follow up for #5502.

Reasons to consider follow up:

Weakness of the original approach

It deals well with extremes:

The worst case is when the rotation is small, but still large enough to not engage the small rotation branch.

Mitigation approaches

Generally, we need to do multi-range rotating swap to make fewer element assignments. From the original PR:

A hypothetical functions like swap_3_ranges, swap_4_ranges, etc could reduce the number of assignments for more cases. But going further in optimization will result in less and less improvement for more and more code added, and at some point will cause the complex decisions to take noticeable amount of time, resulting in negative improvement, so we need to stop somewhere. Probably stopping on just small rotation and two ranges swap strategy would be a good idea.

So how we can do some improvement while avoiding unnecessary complication:

This makes me think that it would be good to:

The code chages

So I've tried _Swap_3_ranges, It resulted in at most 1.40 speedup, and that fixed the slightly regressed cases.
I think it is indication to both that the approach is good enough to use, and not too good to try something more complex.

I've moved _Rotating closer to __std_swap_ranges_trivially_swappable_noalias to make the similarity between that and _Swap_3_ranges more obvious.

Coverage

Tests were lacking too long arrays to execute the ranges swapping properly. I've expanded the test to have more elements; to save some run time, I've did this for one of 8-bit elements only. The algorithm does not distinguish element sizes internally anyway.

The same for benchmark, I've added just two examples of the case that became worse.

Benchmark results

Before #5502 / After #5502 may slightly wary from the previous PR description, I've ran the benchmarks again.

Benchmark Before #5502 After #5502 After this #5502 ⬆️ This ⬆️ Total ⬆️
u8//Std/3333/2242 93.8 ns 67.0 ns 50.6 ns 1.40 1.32 1.85
u8//Std/3332/1666 94.6 ns 40.0 ns 40.5 ns 2.37 0.99 2.34
u8//Std/3333/1111 91.4 ns 60.4 ns 44.0 ns 1.51 1.37 2.08
u8//Std/3333/501 89.9 ns 32.1 ns 32.1 ns 2.80 1.00 2.80
u8//Std/3333/3300 91.3 ns 32.3 ns 32.1 ns 2.83 1.01 2.84
u8//Std/3333/12 87.8 ns 25.9 ns 25.8 ns 3.39 1.00 3.40
u8//Std/3333/5 90.8 ns 29.0 ns 28.8 ns 3.13 1.01 3.15
u8//Std/3333/1 82.2 ns 28.8 ns 29.8 ns 2.85 0.97 2.76
u8//Std/333/101 19.0 ns 12.1 ns 10.1 ns 1.57 1.20 1.88
u8//Std/123/32 22.7 ns 6.57 ns 6.36 ns 3.46 1.03 3.57
u8//Std/23/7 18.3 ns 5.24 ns 5.51 ns 3.49 0.95 3.32
u8//Std/12/5 12.9 ns 5.26 ns 5.03 ns 2.45 1.05 2.56
u8//Std/3/2 3.42 ns 4.77 ns 4.71 ns 0.72 1.01 0.73
u8//Rng/3333/2242 94.3 ns 67.4 ns 52.8 ns 1.40 1.28 1.79
u8//Rng/3332/1666 95.9 ns 39.9 ns 41.7 ns 2.40 0.96 2.30
u8//Rng/3333/1111 93.2 ns 58.4 ns 45.9 ns 1.60 1.27 2.03
u8//Rng/3333/501 89.8 ns 31.9 ns 32.3 ns 2.82 0.99 2.78
u8//Rng/3333/3300 93.5 ns 32.5 ns 33.3 ns 2.88 0.98 2.81
u8//Rng/3333/12 89.3 ns 25.9 ns 26.0 ns 3.45 1.00 3.43
u8//Rng/3333/5 87.4 ns 29.0 ns 29.2 ns 3.01 0.99 2.99
u8//Rng/3333/1 83.1 ns 29.0 ns 28.9 ns 2.87 1.00 2.88
u8//Rng/333/101 18.4 ns 12.1 ns 11.3 ns 1.52 1.07 1.63
u8//Rng/123/32 26.1 ns 6.56 ns 6.44 ns 3.98 1.02 4.05
u8//Rng/23/7 18.5 ns 5.20 ns 5.22 ns 3.56 1.00 3.54
u8//Rng/12/5 13.2 ns 5.28 ns 4.93 ns 2.50 1.07 2.68
u8//Rng/3/2 3.33 ns 4.77 ns 4.73 ns 0.70 1.01 0.70
u16//Std/3333/2242 180 ns 131 ns 106 ns 1.37 1.24 1.70
u16//Std/3332/1666 184 ns 84.0 ns 83.6 ns 2.19 1.00 2.20
u16//Std/3333/1111 185 ns 132 ns 86.5 ns 1.40 1.53 2.14
u16//Std/3333/501 184 ns 170 ns 143 ns 1.08 1.19 1.29
u16//Std/3333/3300 179 ns 61.9 ns 61.7 ns 2.89 1.00 2.90
u16//Std/3333/12 166 ns 46.8 ns 46.3 ns 3.55 1.01 3.59
u16//Std/3333/5 176 ns 54.3 ns 53.6 ns 3.24 1.01 3.28
u16//Std/3333/1 176 ns 53.4 ns 53.8 ns 3.30 0.99 3.27
u16//Std/333/101 27.4 ns 13.0 ns 11.9 ns 2.11 1.09 2.30
u16//Std/123/32 16.5 ns 11.8 ns 10.5 ns 1.40 1.12 1.57
u16//Std/23/7 11.5 ns 4.93 ns 5.14 ns 2.33 0.96 2.24
u16//Std/12/5 11.9 ns 5.15 ns 4.94 ns 2.31 1.04 2.41
u16//Std/3/2 3.33 ns 4.73 ns 4.68 ns 0.70 1.01 0.71
u16//Rng/3333/2242 180 ns 129 ns 104 ns 1.40 1.24 1.73
u16//Rng/3332/1666 185 ns 82.5 ns 84.0 ns 2.24 0.98 2.20
u16//Rng/3333/1111 183 ns 112 ns 87.9 ns 1.63 1.27 2.08
u16//Rng/3333/501 182 ns 167 ns 146 ns 1.09 1.14 1.25
u16//Rng/3333/3300 181 ns 61.2 ns 63.9 ns 2.96 0.96 2.83
u16//Rng/3333/12 167 ns 46.4 ns 47.5 ns 3.60 0.98 3.52
u16//Rng/3333/5 176 ns 53.3 ns 53.4 ns 3.30 1.00 3.30
u16//Rng/3333/1 175 ns 53.6 ns 54.8 ns 3.26 0.98 3.19
u16//Rng/333/101 27.0 ns 13.3 ns 11.8 ns 2.03 1.13 2.29
u16//Rng/123/32 16.5 ns 11.8 ns 10.9 ns 1.40 1.08 1.51
u16//Rng/23/7 11.9 ns 4.92 ns 5.04 ns 2.42 0.98 2.36
u16//Rng/12/5 12.4 ns 5.15 ns 5.35 ns 2.41 0.96 2.32
u16//Rng/3/2 3.34 ns 4.73 ns 4.85 ns 0.71 0.98 0.69
u32//Std/3333/2242 337 ns 258 ns 206 ns 1.31 1.25 1.64
u32//Std/3332/1666 343 ns 169 ns 169 ns 2.03 1.00 2.03
u32//Std/3333/1111 339 ns 206 ns 152 ns 1.65 1.36 2.23
u32//Std/3333/501 336 ns 310 ns 265 ns 1.08 1.17 1.27
u32//Std/3333/3300 340 ns 106 ns 110 ns 3.21 0.96 3.09
u32//Std/3333/12 337 ns 90.5 ns 93.1 ns 3.72 0.97 3.62
u32//Std/3333/5 333 ns 89.7 ns 92.4 ns 3.71 0.97 3.60
u32//Std/3333/1 331 ns 90.8 ns 92.7 ns 3.65 0.98 3.57
u32//Std/333/101 35.3 ns 16.3 ns 16.9 ns 2.17 0.96 2.09
u32//Std/123/32 14.5 ns 12.1 ns 11.2 ns 1.20 1.08 1.29
u32//Std/23/7 11.4 ns 6.89 ns 7.11 ns 1.65 0.97 1.60
u32//Std/12/5 8.91 ns 6.77 ns 7.04 ns 1.32 0.96 1.27
u32//Std/3/2 3.12 ns 4.68 ns 4.76 ns 0.67 0.98 0.66
u32//Rng/3333/2242 331 ns 252 ns 204 ns 1.31 1.24 1.62
u32//Rng/3332/1666 341 ns 164 ns 167 ns 2.08 0.98 2.04
u32//Rng/3333/1111 335 ns 202 ns 148 ns 1.66 1.36 2.26
u32//Rng/3333/501 341 ns 306 ns 266 ns 1.11 1.15 1.28
u32//Rng/3333/3300 336 ns 106 ns 109 ns 3.17 0.97 3.08
u32//Rng/3333/12 332 ns 90.8 ns 96.3 ns 3.66 0.94 3.45
u32//Rng/3333/5 335 ns 88.8 ns 99.1 ns 3.77 0.90 3.38
u32//Rng/3333/1 332 ns 89.3 ns 92.8 ns 3.72 0.96 3.58
u32//Rng/333/101 35.5 ns 16.3 ns 17.1 ns 2.18 0.95 2.08
u32//Rng/123/32 14.5 ns 12.5 ns 10.9 ns 1.16 1.15 1.33
u32//Rng/23/7 11.3 ns 7.03 ns 7.21 ns 1.61 0.98 1.57
u32//Rng/12/5 9.03 ns 7.37 ns 7.19 ns 1.23 1.03 1.26
u32//Rng/3/2 3.08 ns 4.68 ns 4.74 ns 0.66 0.99 0.65
u64//Std/3333/2242 661 ns 436 ns 333 ns 1.52 1.31 1.98
u64//Std/3332/1666 670 ns 325 ns 332 ns 2.06 0.98 2.02
u64//Std/3333/1111 596 ns 392 ns 281 ns 1.52 1.40 2.12
u64//Std/3333/501 659 ns 581 ns 506 ns 1.13 1.15 1.30
u64//Std/3333/3300 668 ns 207 ns 227 ns 3.23 0.91 2.94
u64//Std/3333/12 655 ns 134 ns 134 ns 4.89 1.00 4.89
u64//Std/3333/5 661 ns 175 ns 186 ns 3.78 0.94 3.55
u64//Std/3333/1 661 ns 182 ns 183 ns 3.63 0.99 3.61
u64//Std/333/101 63.2 ns 48.7 ns 39.4 ns 1.30 1.24 1.60
u64//Std/123/32 22.0 ns 13.5 ns 11.9 ns 1.63 1.13 1.85
u64//Std/23/7 11.3 ns 11.2 ns 9.53 ns 1.01 1.18 1.19
u64//Std/12/5 11.9 ns 10.6 ns 9.53 ns 1.12 1.11 1.25
u64//Std/3/2 3.11 ns 4.68 ns 4.78 ns 0.66 0.98 0.65
u64//Rng/3333/2242 659 ns 435 ns 328 ns 1.51 1.33 2.01
u64//Rng/3332/1666 671 ns 325 ns 326 ns 2.06 1.00 2.06
u64//Rng/3333/1111 596 ns 391 ns 286 ns 1.52 1.37 2.08
u64//Rng/3333/501 668 ns 583 ns 506 ns 1.15 1.15 1.32
u64//Rng/3333/3300 665 ns 206 ns 233 ns 3.23 0.88 2.85
u64//Rng/3333/12 668 ns 133 ns 135 ns 5.02 0.99 4.95
u64//Rng/3333/5 661 ns 175 ns 178 ns 3.78 0.98 3.71
u64//Rng/3333/1 659 ns 182 ns 184 ns 3.62 0.99 3.58
u64//Rng/333/101 62.3 ns 48.4 ns 39.8 ns 1.29 1.22 1.57
u64//Rng/123/32 22.2 ns 13.6 ns 12.4 ns 1.63 1.10 1.79
u64//Rng/23/7 11.2 ns 11.4 ns 9.91 ns 0.98 1.15 1.13
u64//Rng/12/5 11.7 ns 10.7 ns 9.97 ns 1.09 1.07 1.17
u64//Rng/3/2 3.04 ns 4.66 ns 4.66 ns 0.65 1.00 0.65
c6//Std/3333/2242 1742 ns 363 ns 290 ns 4.80 1.25 6.01
c6//Std/3332/1666 1733 ns 244 ns 246 ns 7.10 0.99 7.04
c6//Std/3333/1111 1756 ns 323 ns 250 ns 5.44 1.29 7.02
c6//Std/3333/501 1750 ns 477 ns 411 ns 3.67 1.16 4.26
c6//Std/3333/3300 1740 ns 162 ns 164 ns 10.74 0.99 10.61
c6//Std/3333/12 1734 ns 132 ns 133 ns 13.14 0.99 13.04
c6//Std/3333/5 1826 ns 152 ns 155 ns 12.01 0.98 11.78
c6//Std/3333/1 1733 ns 154 ns 154 ns 11.25 1.00 11.25
c6//Std/333/101 180 ns 46.6 ns 38.4 ns 3.86 1.21 4.69
c6//Std/123/32 66.6 ns 13.6 ns 12.1 ns 4.90 1.12 5.50
c6//Std/23/7 12.2 ns 11.0 ns 9.32 ns 1.11 1.18 1.31
c6//Std/12/5 7.16 ns 7.26 ns 7.33 ns 0.99 0.99 0.98
c6//Std/3/2 2.10 ns 5.02 ns 5.10 ns 0.42 0.98 0.41
c6//Rng/3333/2242 1747 ns 363 ns 291 ns 4.81 1.25 6.00
c6//Rng/3332/1666 1736 ns 243 ns 247 ns 7.14 0.98 7.03
c6//Rng/3333/1111 1726 ns 323 ns 247 ns 5.34 1.31 6.99
c6//Rng/3333/501 1746 ns 476 ns 409 ns 3.67 1.16 4.27
c6//Rng/3333/3300 1741 ns 163 ns 164 ns 10.68 0.99 10.62
c6//Rng/3333/12 1728 ns 133 ns 135 ns 12.99 0.99 12.80
c6//Rng/3333/5 1829 ns 155 ns 157 ns 11.80 0.99 11.65
c6//Rng/3333/1 1724 ns 154 ns 157 ns 11.19 0.98 10.98
c6//Rng/333/101 178 ns 46.7 ns 38.5 ns 3.81 1.21 4.62
c6//Rng/123/32 66.0 ns 14.1 ns 12.7 ns 4.68 1.11 5.20
c6//Rng/23/7 12.3 ns 11.4 ns 10.0 ns 1.08 1.14 1.23
c6//Rng/12/5 7.05 ns 7.44 ns 7.78 ns 0.95 0.96 0.91
c6//Rng/3/2 2.10 ns 5.14 ns 5.54 ns 0.41 0.93 0.38
u8//Std/35000/520 785 ns 797 ns 598 ns 0.98 1.33 1.31
u8//Std/35000/3000 725 ns 759 ns 583 ns 0.96 1.30 1.24