Vectorize std::search
of 1 and 2 bytes elements with pcmpestri
by AlexGuteniev · Pull Request #4745 · microsoft/STL (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation29 Commits49 Checks39 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.
For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.
In benchmark results 0 is small needle, 1 is large needle.
Benchmark | main | ths |
---|---|---|
c_strstr/0 | 186 ns | 184 ns |
c_strstr/1 | 213 ns | 213 ns |
classic_searchstd::uint8_t/0 | 2045 ns | 270 ns |
classic_searchstd::uint8_t/1 | 2221 ns | 302 ns |
classic_searchstd::uint16_t/0 | 1588 ns | 531 ns |
classic_searchstd::uint16_t/1 | 1766 ns | 586 ns |
ranges_searchstd::uint8_t/0 | 1748 ns | 268 ns |
ranges_searchstd::uint8_t/1 | 1989 ns | 306 ns |
ranges_searchstd::uint16_t/0 | 1673 ns | 585 ns |
ranges_searchstd::uint16_t/1 | 1843 ns | 600 ns |
search_default_searcherstd::uint8_t/0 | 1494 ns | 269 ns |
search_default_searcherstd::uint8_t/1 | 1626 ns | 309 ns |
search_default_searcherstd::uint16_t/0 | 2002 ns | 528 ns |
search_default_searcherstd::uint16_t/1 | 2286 ns | 599 ns |
Who's a good search? You are! Yes you!
…pred`.
_Equal_rev_pred_unchecked
is called by classic/parallel search
/find_end
.
_Equal_rev_pred
is called by ranges search
/find_end
.
This doesn't affect equal
etc.
might restore one or both later
Resolved conflicts in xutility.
This comment was marked as resolved.
This comment was marked as resolved.
Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:
Benchmark | main | PR | Speedup (Old/New) |
---|---|---|---|
c_strstr/0 | 142 ns | 143 ns | 0.99 |
c_strstr/1 | 157 ns | 162 ns | 0.97 |
classic_searchstd::uint8\_t/0 | 1976 ns | 160 ns | 12.35 |
classic_searchstd::uint8\_t/1 | 2153 ns | 175 ns | 12.30 |
classic_searchstd::uint16\_t/0 | 1432 ns | 310 ns | 4.62 |
classic_searchstd::uint16\_t/1 | 1557 ns | 344 ns | 4.53 |
ranges_searchstd::uint8\_t/0 | 1561 ns | 160 ns | 9.76 |
ranges_searchstd::uint8\_t/1 | 1689 ns | 176 ns | 9.60 |
ranges_searchstd::uint16\_t/0 | 1594 ns | 311 ns | 5.13 |
ranges_searchstd::uint16\_t/1 | 1747 ns | 345 ns | 5.06 |
search_default_searcherstd::uint8\_t/0 | 1660 ns | 160 ns | 10.38 |
search_default_searcherstd::uint8\_t/1 | 1796 ns | 174 ns | 10.32 |
search_default_searcherstd::uint16\_t/0 | 2222 ns | 309 ns | 7.19 |
search_default_searcherstd::uint16\_t/1 | 2421 ns | 345 ns | 7.02 |
Benchmark | main | PR | Speedup (Old/New) |
---|---|---|---|
classic_searchstd::uint32\_t/0 | 1970 ns | 1979 ns | 1.00 |
classic_searchstd::uint32\_t/1 | 2151 ns | 2148 ns | 1.00 |
classic_searchstd::uint64\_t/0 | 1423 ns | 1387 ns | 1.03 |
classic_searchstd::uint64\_t/1 | 1566 ns | 1527 ns | 1.03 |
ranges_searchstd::uint32\_t/0 | 1591 ns | 1611 ns | 0.99 |
ranges_searchstd::uint32\_t/1 | 1729 ns | 1760 ns | 0.98 |
ranges_searchstd::uint64\_t/0 | 1605 ns | 1543 ns | 1.04 |
ranges_searchstd::uint64\_t/1 | 1761 ns | 1691 ns | 1.04 |
search_default_searcherstd::uint32\_t/0 | 2234 ns | 1609 ns | 1.39 |
search_default_searcherstd::uint32\_t/1 | 2408 ns | 1752 ns | 1.37 |
search_default_searcherstd::uint64\_t/0 | 1620 ns | 2193 ns | 0.74 |
search_default_searcherstd::uint64\_t/1 | 1761 ns | 2366 ns | 0.74 |
Aside from c_strstr
which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.
I am mildly confused as to why performance for search_default_searcher
seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr
should be completely vanishing. Codegen gremlins? I don't think it should block merging though.
I am mildly confused as to why performance for
search_default_searcher
seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - theif constexpr
should be completely vanishing. Codegen gremlins? I don't think it should block merging though.
I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main
and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.
I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.