Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` by AlexGuteniev · Pull Request #4745 · microsoft/STL (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation29 Commits49 Checks39 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.

For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.

In benchmark results 0 is small needle, 1 is large needle.

Benchmark	main	ths
c_strstr/0	186 ns	184 ns
c_strstr/1	213 ns	213 ns
classic_searchstd::uint8_t/0	2045 ns	270 ns
classic_searchstd::uint8_t/1	2221 ns	302 ns
classic_searchstd::uint16_t/0	1588 ns	531 ns
classic_searchstd::uint16_t/1	1766 ns	586 ns
ranges_searchstd::uint8_t/0	1748 ns	268 ns
ranges_searchstd::uint8_t/1	1989 ns	306 ns
ranges_searchstd::uint16_t/0	1673 ns	585 ns
ranges_searchstd::uint16_t/1	1843 ns	600 ns
search_default_searcherstd::uint8_t/0	1494 ns	269 ns
search_default_searcherstd::uint8_t/1	1626 ns	309 ns
search_default_searcherstd::uint16_t/0	2002 ns	528 ns
search_default_searcherstd::uint16_t/1	2286 ns	599 ns

Who's a good search? You are! Yes you!

…pred`.

_Equal_rev_pred_unchecked is called by classic/parallel search/find_end.

_Equal_rev_pred is called by ranges search/find_end.

This doesn't affect equal etc.

might restore one or both later

Resolved conflicts in xutility.

This comment was marked as resolved.

Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:

Benchmark	main	PR	Speedup (Old/New)
c_strstr/0	142 ns	143 ns	0.99
c_strstr/1	157 ns	162 ns	0.97
classic_searchstd::uint8\_t/0	1976 ns	160 ns	12.35
classic_searchstd::uint8\_t/1	2153 ns	175 ns	12.30
classic_searchstd::uint16\_t/0	1432 ns	310 ns	4.62
classic_searchstd::uint16\_t/1	1557 ns	344 ns	4.53
ranges_searchstd::uint8\_t/0	1561 ns	160 ns	9.76
ranges_searchstd::uint8\_t/1	1689 ns	176 ns	9.60
ranges_searchstd::uint16\_t/0	1594 ns	311 ns	5.13
ranges_searchstd::uint16\_t/1	1747 ns	345 ns	5.06
search_default_searcherstd::uint8\_t/0	1660 ns	160 ns	10.38
search_default_searcherstd::uint8\_t/1	1796 ns	174 ns	10.32
search_default_searcherstd::uint16\_t/0	2222 ns	309 ns	7.19
search_default_searcherstd::uint16\_t/1	2421 ns	345 ns	7.02

Benchmark	main	PR	Speedup (Old/New)
classic_searchstd::uint32\_t/0	1970 ns	1979 ns	1.00
classic_searchstd::uint32\_t/1	2151 ns	2148 ns	1.00
classic_searchstd::uint64\_t/0	1423 ns	1387 ns	1.03
classic_searchstd::uint64\_t/1	1566 ns	1527 ns	1.03
ranges_searchstd::uint32\_t/0	1591 ns	1611 ns	0.99
ranges_searchstd::uint32\_t/1	1729 ns	1760 ns	0.98
ranges_searchstd::uint64\_t/0	1605 ns	1543 ns	1.04
ranges_searchstd::uint64\_t/1	1761 ns	1691 ns	1.04
search_default_searcherstd::uint32\_t/0	2234 ns	1609 ns	1.39
search_default_searcherstd::uint32\_t/1	2408 ns	1752 ns	1.37
search_default_searcherstd::uint64\_t/0	1620 ns	2193 ns	0.74
search_default_searcherstd::uint64\_t/1	1761 ns	2366 ns	0.74

Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.

I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

Vectorize std::search of 1 and 2 bytes elements with pcmpestri by AlexGuteniev · Pull Request #4745 · microsoft/STL (original) (raw)

Conversation

This comment was marked as resolved.

This comment was marked as resolved.

🔍 🕵️ 🔎

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` by AlexGuteniev · Pull Request #4745 · microsoft/STL (original) (raw)