Use AVX2 in minmax_element
vectorization by AlexGuteniev · Pull Request #4659 · microsoft/STL (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation13 Commits24 Checks39 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Resolves #2803
This is not final optimization. At least, we should use AVX masks here too.
But this one is complex enough already, so the rest would be follow-up PR(s).
I also notice that Both_val
8-bit and 16-bit cases are slow.
The vectorization for them is not engaged, it is a separate issue from the AVX.
Benchmark resultsBefore:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min> 248 ns 96.3 ns 7466667
bm<uint8_t, 8021, Op::Max> 238 ns 96.9 ns 6291661
bm<uint8_t, 8021, Op::Both> 371 ns 124 ns 8432941
bm<uint8_t, 8021, Op::Min_val> 131 ns 56.6 ns 16000000
bm<uint8_t, 8021, Op::Max_val> 128 ns 67.0 ns 7466667
bm<uint8_t, 8021, Op::Both_val> 4197 ns 1611 ns 407273
bm<uint16_t, 8021, Op::Min> 459 ns 226 ns 4072727
bm<uint16_t, 8021, Op::Max> 445 ns 200 ns 3521123
bm<uint16_t, 8021, Op::Both> 685 ns 331 ns 2357895
bm<uint16_t, 8021, Op::Min_val> 247 ns 122 ns 4480000
bm<uint16_t, 8021, Op::Max_val> 252 ns 105 ns 7466667
bm<uint16_t, 8021, Op::Both_val> 4239 ns 3299 ns 203636
bm<uint32_t, 8021, Op::Min> 979 ns 364 ns 1544828
bm<uint32_t, 8021, Op::Max> 932 ns 586 ns 1120000
bm<uint32_t, 8021, Op::Both> 1439 ns 921 ns 746667
bm<uint32_t, 8021, Op::Min_val> 501 ns 321 ns 1947826
bm<uint32_t, 8021, Op::Max_val> 494 ns 414 ns 2036364
bm<uint32_t, 8021, Op::Both_val> 673 ns 500 ns 1000000
bm<uint64_t, 8021, Op::Min> 4252 ns 3128 ns 344615
bm<uint64_t, 8021, Op::Max> 4360 ns 2567 ns 280000
bm<uint64_t, 8021, Op::Both> 4397 ns 2783 ns 213333
bm<uint64_t, 8021, Op::Min_val> 3844 ns 2441 ns 320000
bm<uint64_t, 8021, Op::Max_val> 3857 ns 2518 ns 235789
bm<uint64_t, 8021, Op::Both_val> 3862 ns 2319 ns 235789
bm<int8_t, 8021, Op::Min> 246 ns 176 ns 3733333
bm<int8_t, 8021, Op::Max> 235 ns 179 ns 4977778
bm<int8_t, 8021, Op::Both> 361 ns 239 ns 3200000
bm<int8_t, 8021, Op::Min_val> 126 ns 89.3 ns 11200000
bm<int8_t, 8021, Op::Max_val> 128 ns 94.2 ns 7466667
bm<int8_t, 8021, Op::Both_val> 3842 ns 2441 ns 224000
bm<int16_t, 8021, Op::Min> 460 ns 321 ns 2968994
bm<int16_t, 8021, Op::Max> 445 ns 308 ns 2133333
bm<int16_t, 8021, Op::Both> 683 ns 517 ns 1723077
bm<int16_t, 8021, Op::Min_val> 251 ns 176 ns 6400000
bm<int16_t, 8021, Op::Max_val> 249 ns 174 ns 5575111
bm<int16_t, 8021, Op::Both_val> 3318 ns 1709 ns 320000
bm<int32_t, 8021, Op::Min> 965 ns 719 ns 1000000
bm<int32_t, 8021, Op::Max> 903 ns 628 ns 1120000
bm<int32_t, 8021, Op::Both> 1405 ns 893 ns 1120000
bm<int32_t, 8021, Op::Min_val> 497 ns 307 ns 2036364
bm<int32_t, 8021, Op::Max_val> 505 ns 377 ns 1947826
bm<int32_t, 8021, Op::Both_val> 690 ns 401 ns 1792000
bm<int64_t, 8021, Op::Min> 4466 ns 3024 ns 263529
bm<int64_t, 8021, Op::Max> 4385 ns 2860 ns 224000
bm<int64_t, 8021, Op::Both> 4845 ns 2567 ns 280000
bm<int64_t, 8021, Op::Min_val> 5156 ns 1883 ns 1120000
bm<int64_t, 8021, Op::Max_val> 4003 ns 1664 ns 497778
bm<int64_t, 8021, Op::Both_val> 3847 ns 2052 ns 464593
bm<float, 8021, Op::Min> 1965 ns 928 ns 640000
bm<float, 8021, Op::Max> 2014 ns 949 ns 560000
bm<float, 8021, Op::Both> 2254 ns 1067 ns 746667
bm<float, 8021, Op::Min_val> 1870 ns 767 ns 1120000
bm<float, 8021, Op::Max_val> 1838 ns 984 ns 746667
bm<float, 8021, Op::Both_val> 1886 ns 894 ns 768981
bm<double, 8021, Op::Min> 3931 ns 2009 ns 373333
bm<double, 8021, Op::Max> 4004 ns 1646 ns 560000
bm<double, 8021, Op::Both> 4809 ns 1842 ns 280000
bm<double, 8021, Op::Min_val> 3776 ns 1855 ns 320000
bm<double, 8021, Op::Max_val> 3769 ns 1807 ns 320000
bm<double, 8021, Op::Both_val> 3850 ns 1674 ns 448000
After:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min> 170 ns 80.2 ns 8960000
bm<uint8_t, 8021, Op::Max> 170 ns 82.3 ns 11200000
bm<uint8_t, 8021, Op::Both> 295 ns 85.0 ns 7719385
bm<uint8_t, 8021, Op::Min_val> 76.7 ns 26.2 ns 29866667
bm<uint8_t, 8021, Op::Max_val> 75.4 ns 28.0 ns 40727273
bm<uint8_t, 8021, Op::Both_val> 4312 ns 1283 ns 560000
bm<uint16_t, 8021, Op::Min> 336 ns 157 ns 8960000
bm<uint16_t, 8021, Op::Max> 318 ns 146 ns 7466667
bm<uint16_t, 8021, Op::Both> 553 ns 231 ns 3446154
bm<uint16_t, 8021, Op::Min_val> 142 ns 69.8 ns 22400000
bm<uint16_t, 8021, Op::Max_val> 139 ns 68.8 ns 10000000
bm<uint16_t, 8021, Op::Both_val> 4306 ns 2574 ns 248889
bm<uint32_t, 8021, Op::Min> 615 ns 363 ns 3009055
bm<uint32_t, 8021, Op::Max> 623 ns 325 ns 2357895
bm<uint32_t, 8021, Op::Both> 1071 ns 609 ns 1000000
bm<uint32_t, 8021, Op::Min_val> 258 ns 64.3 ns 12389136
bm<uint32_t, 8021, Op::Max_val> 258 ns 58.6 ns 11200000
bm<uint32_t, 8021, Op::Both_val> 374 ns 100 ns 10000000
bm<uint64_t, 8021, Op::Min> 3540 ns 663 ns 896000
bm<uint64_t, 8021, Op::Max> 3468 ns 922 ns 1000000
bm<uint64_t, 8021, Op::Both> 4271 ns 907 ns 1120000
bm<uint64_t, 8021, Op::Min_val> 2917 ns 680 ns 896000
bm<uint64_t, 8021, Op::Max_val> 2974 ns 797 ns 1000000
bm<uint64_t, 8021, Op::Both_val> 3090 ns 893 ns 1120000
bm<int8_t, 8021, Op::Min> 177 ns 38.1 ns 16000000
bm<int8_t, 8021, Op::Max> 177 ns 48.1 ns 14933333
bm<int8_t, 8021, Op::Both> 288 ns 85.4 ns 6400000
bm<int8_t, 8021, Op::Min_val> 77.4 ns 29.9 ns 20363636
bm<int8_t, 8021, Op::Max_val> 74.1 ns 22.6 ns 37333333
bm<int8_t, 8021, Op::Both_val> 3843 ns 1086 ns 719369
bm<int16_t, 8021, Op::Min> 339 ns 69.8 ns 8960000
bm<int16_t, 8021, Op::Max> 321 ns 73.4 ns 10000000
bm<int16_t, 8021, Op::Both> 553 ns 141 ns 11200000
bm<int16_t, 8021, Op::Min_val> 140 ns 49.8 ns 16000000
bm<int16_t, 8021, Op::Max_val> 139 ns 45.6 ns 20906668
bm<int16_t, 8021, Op::Both_val> 3377 ns 628 ns 746667
bm<int32_t, 8021, Op::Min> 620 ns 178 ns 7466667
bm<int32_t, 8021, Op::Max> 625 ns 159 ns 5120000
bm<int32_t, 8021, Op::Both> 1059 ns 261 ns 2635294
bm<int32_t, 8021, Op::Min_val> 254 ns 62.5 ns 10000000
bm<int32_t, 8021, Op::Max_val> 254 ns 64.5 ns 8960000
bm<int32_t, 8021, Op::Both_val> 379 ns 100 ns 11200000
bm<int64_t, 8021, Op::Min> 3532 ns 1287 ns 497778
bm<int64_t, 8021, Op::Max> 3462 ns 1221 ns 448000
bm<int64_t, 8021, Op::Both> 4130 ns 1953 ns 640000
bm<int64_t, 8021, Op::Min_val> 3011 ns 1726 ns 298667
bm<int64_t, 8021, Op::Max_val> 2945 ns 1632 ns 497778
bm<int64_t, 8021, Op::Both_val> 3264 ns 1378 ns 669014
bm<float, 8021, Op::Min> 1176 ns 401 ns 1947826
bm<float, 8021, Op::Max> 1208 ns 750 ns 1000000
bm<float, 8021, Op::Both> 1358 ns 1151 ns 746667
bm<float, 8021, Op::Min_val> 894 ns 802 ns 896000
bm<float, 8021, Op::Max_val> 891 ns 310 ns 4683093
bm<float, 8021, Op::Both_val> 949 ns 844 ns 1000000
bm<double, 8021, Op::Min> 2230 ns 1918 ns 448000
bm<double, 8021, Op::Max> 2421 ns 2134 ns 373333
bm<double, 8021, Op::Both> 2765 ns 2532 ns 407273
bm<double, 8021, Op::Min_val> 1860 ns 1674 ns 448000
bm<double, 8021, Op::Max_val> 1869 ns 1500 ns 448000
bm<double, 8021, Op::Both_val> 1939 ns 1381 ns 407273
I also dropped the bool _Unused
.
With extra dispatcher, the inlining decisions are different. Now, the dispatcher is inlined into the exported functions, along with the scalar implementation. The vector implementations are tail called, and signature variations are not likely to prevent that.
Results as a table
Benchmark | Before | After |
---|---|---|
bm<uint8_t, 8021, Op::Min> | 248 ns | 170 ns |
bm<uint8_t, 8021, Op::Max> | 238 ns | 170 ns |
bm<uint8_t, 8021, Op::Both> | 371 ns | 295 ns |
bm<uint8_t, 8021, Op::Min_val> | 131 ns | 76.7 ns |
bm<uint8_t, 8021, Op::Max_val> | 128 ns | 75.4 ns |
bm<uint8_t, 8021, Op::Both_val> | 4197 ns | 4312 ns |
bm<uint16_t, 8021, Op::Min> | 459 ns | 336 ns |
bm<uint16_t, 8021, Op::Max> | 445 ns | 318 ns |
bm<uint16_t, 8021, Op::Both> | 685 ns | 553 ns |
bm<uint16_t, 8021, Op::Min_val> | 247 ns | 142 ns |
bm<uint16_t, 8021, Op::Max_val> | 252 ns | 139 ns |
bm<uint16_t, 8021, Op::Both_val> | 4239 ns | 4306 ns |
bm<uint32_t, 8021, Op::Min> | 979 ns | 615 ns |
bm<uint32_t, 8021, Op::Max> | 932 ns | 623 ns |
bm<uint32_t, 8021, Op::Both> | 1439 ns | 1071 ns |
bm<uint32_t, 8021, Op::Min_val> | 501 ns | 258 ns |
bm<uint32_t, 8021, Op::Max_val> | 494 ns | 258 ns |
bm<uint32_t, 8021, Op::Both_val> | 673 ns | 374 ns |
bm<uint64_t, 8021, Op::Min> | 4252 ns | 3540 ns |
bm<uint64_t, 8021, Op::Max> | 4360 ns | 3468 ns |
bm<uint64_t, 8021, Op::Both> | 4397 ns | 4271 ns |
bm<uint64_t, 8021, Op::Min_val> | 3844 ns | 2917 ns |
bm<uint64_t, 8021, Op::Max_val> | 3857 ns | 2974 ns |
bm<uint64_t, 8021, Op::Both_val> | 3862 ns | 3090 ns |
bm<int8_t, 8021, Op::Min> | 246 ns | 177 ns |
bm<int8_t, 8021, Op::Max> | 235 ns | 177 ns |
bm<int8_t, 8021, Op::Both> | 361 ns | 288 ns |
bm<int8_t, 8021, Op::Min_val> | 126 ns | 77.4 ns |
bm<int8_t, 8021, Op::Max_val> | 128 ns | 74.1 ns |
bm<int8_t, 8021, Op::Both_val> | 3842 ns | 3843 ns |
bm<int16_t, 8021, Op::Min> | 460 ns | 339 ns |
bm<int16_t, 8021, Op::Max> | 445 ns | 321 ns |
bm<int16_t, 8021, Op::Both> | 683 ns | 553 ns |
bm<int16_t, 8021, Op::Min_val> | 251 ns | 140 ns |
bm<int16_t, 8021, Op::Max_val> | 249 ns | 139 ns |
bm<int16_t, 8021, Op::Both_val> | 3318 ns | 3377 ns |
bm<int32_t, 8021, Op::Min> | 965 ns | 620 ns |
bm<int32_t, 8021, Op::Max> | 903 ns | 625 ns |
bm<int32_t, 8021, Op::Both> | 1405 ns | 1059 ns |
bm<int32_t, 8021, Op::Min_val> | 497 ns | 254 ns |
bm<int32_t, 8021, Op::Max_val> | 505 ns | 254 ns |
bm<int32_t, 8021, Op::Both_val> | 690 ns | 379 ns |
bm<int64_t, 8021, Op::Min> | 4466 ns | 3532 ns |
bm<int64_t, 8021, Op::Max> | 4385 ns | 3462 ns |
bm<int64_t, 8021, Op::Both> | 4845 ns | 4130 ns |
bm<int64_t, 8021, Op::Min_val> | 5156 ns | 3011 ns |
bm<int64_t, 8021, Op::Max_val> | 4003 ns | 2945 ns |
bm<int64_t, 8021, Op::Both_val> | 3847 ns | 3264 ns |
bm<float, 8021, Op::Min> | 1965 ns | 1176 ns |
bm<float, 8021, Op::Max> | 2014 ns | 1208 ns |
bm<float, 8021, Op::Both> | 2254 ns | 1358 ns |
bm<float, 8021, Op::Min_val> | 1870 ns | 894 ns |
bm<float, 8021, Op::Max_val> | 1838 ns | 891 ns |
bm<float, 8021, Op::Both_val> | 1886 ns | 949 ns |
bm<double, 8021, Op::Min> | 3931 ns | 2230 ns |
bm<double, 8021, Op::Max> | 4004 ns | 2421 ns |
bm<double, 8021, Op::Both> | 4809 ns | 2765 ns |
bm<double, 8021, Op::Min_val> | 3776 ns | 1860 ns |
bm<double, 8021, Op::Max_val> | 3769 ns | 1869 ns |
bm<double, 8021, Op::Both_val> | 3850 ns | 1939 ns |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.
AVX2 fast, AVX2 furious! 🚗 🚙 🏎️
2 participants