Improve basic_string::find_first_of
and basic_string::find_last_of
vectorization for large needles or very large haystacks by AlexGuteniev · Pull Request #5029 · microsoft/STL (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation62 Commits52 Checks39 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Follow up on #4934 (comment):
The case
bm<AlgType::str_member_last, char>/400/50
is changing rom 113 ns to 195 ns, a speedup of 0.58.
Looked closer into that case, and made it even faster than it was.
🗺️ Summary of changes
This PR consists of the following changes:
- Introduced
__std_find_first_of_trivial_pos_N
family that is used by strings and string view. The existing__std_find_first_of_trivial_N
is still used by the standalone algorithm - Moved most of the vectorization decision making into the separately compiled code (further simplifying control flow in the header code as a side effect)
- Added vectorized bitmap algorithm, in addition to the existing vectorized nested loop (two of them for different element sizes), scalar bitmap, and scalar nested loop algorithms
- Reimplemented a copy of scalar bitmap algorithm in the separately compiled code
- Implemented threshold system that better corresponds to the expected run time
- Restored using scalar bitmap algorithm in header in
constexpr
context, because why not
⚙️ Vector bitmap algorithm
It is an AVX2-only algorithm. It processes 8 values at once.
In a similar way to the existing scalar bitmap algorithm, can be used when all needle character values do not exceed 255. Instead of having an array of 256 bool
values, it uses an actual bitmap. The whole bitmap can fit into __m256i
variable, that is, an AVX2 register.
If another AVX2 register contains 8 32-bit values, which are indices to 32-bit bitmap parts, _mm256_permutevar8x32_epi32
(vpermd
) can look up 8 parts at once. The indices to the parts are high 3 bits of 8 bit values. The low 5 bits can be then used to obtain the exact bit in 32-bit sequence by a shift. In AVX2 there's are variable 32-bit shift that use a vector of shift values instead of just one for all: _mm256_srlv_epi32
, _mm256_sllv_epi32
. The resulting mask can be obtained by _mm256_movemask_ps
.
Bitmap building
Small needles
Unfortunately, there's no instruction in AVX2 that can combine bits from different values of the same vector in a single element. This means that the bitmap building has to be fully scalar, or at least partially (when doing some processing in parallel, but doing final steps in scalar)
The scalar bitmap building loop performs rather poorly, worse than a loop that builds bool
array. So I implemented a loop that uses vector instructions for that, so it uses vector registers and no stack, it seems faster than creating a stack array and loading it after. The key things in this approach is that a value from one of the shifts is expanded via _mm256_cvtepu8_epi64
, so a 32-bit shift becomes a 256-bit shift of a lower granularity, the granularity is added back by another shift.
I've managed to have only a slight improvement when trying to partially parallel it, and the complexity of bitmap building grew significantly, so let's probably don't to it.
A different variations of non-parellel bitmap building have about the same performance, so I kept almost the same that I tried at first, except that I adjusted it to work fine in 32-bit x86 too.
Large needles
The vector instructions loop that builds performs poorly relative to the bool
array building loop. At some point it makes sense to build bool
array and compress it to a bitmap. As the size of array/bitmap is constant, it is constant instructions sequence, without loop, and it takes constant time.
Test for suitable element values
This is done separately, before creating the bitmap. This separate check is vectorized, and allows to bail out quickly, if values aren't right, without building the bitmap. There isn't specific benchmark for that currently, but I think this would work.
Advantage over existing
The cases where the needle and haystack product is big enough to make the existing vector algorithms bad, but the haystack is still way bigger that the needle, so the scalar bitmap lookup is also bad. Added some of them to the benchmark.
Surprisingly, this extends to the case with very small needles. With over like 1000 element, vector bitmap wins over SSE4.2 even for just a few needle elements.
Can we have this in SSE?
No. There's _mm256_shuffle_epi8
to do the bitmap parts extraction. But there's no variable vector shift. There isn't even variable vector shift in AVX2 with vector element width smaller than 32. So probably nothing better than using 8-element AVX2 vector.
⚖️ Selecting algorithm
⚠️ Actual vs run time vs full haystack length
The problem with estimating run time in advance is that we don't know how long will it run. The algorithm doesn't run full haystack, if the position is found earlier.
But when selecting algorithm we know only full length. Knowing the full length we can at least estimate the worst case.
Let's still start with worst case, will get back to early return possibility later on.
Run time evaluation
The nested loop algorithms, both scalar and both vectors, are O(m*n), and definitely the vector algorithms is preferred for any noticeably high values of m and n.
Also any bitmap algorithm is faster than nested scalar, unless the element is found in the very first position. So we can safely exclude the nested loop scalar from consideration.
Both scalar and vector bitmap algorithms are some sort of O(n + m), and they have quite different weights of m and n. Specifically, vector bitmap algorithm treat needle length way worse than haystack length, because this part is not parallel, and scalar bitmap algorithm treats them almost equally (surprisingly, needle has slightly less weight). Due to large needle mode, the difference of needle impact on run time between vector and scalar bitmap is constant, in favor of scalar bitmap. This justifies a constant threshold, eventuated during benchmarking at about 48.
Vector nested loop algorithm clearly outperforms when both n and m are small, so their product is also small. In specific cases, vector algorithm is linear, if either n or m is within a single vectorization unit. In this case it doesn't even have a nested loop (for short needle it is a deliberate optimization, for small haystack it is the result of the separate haystack tail processing).
After benchmarking these edge cases, it can be seen that vector nested loop outperforms everything for long needle small haystack, but it doesn't always outperform vector bitmap for short needle / large haystack. The former allows to exclude scalar bitmap algorithm from the consideration: with any not very small haystack, vector bitmap algorithm advantage is noticeable. Very small set of cases where scalar bitmap can win (small but not very small haystack and long needle) still don't give it a solid win, these cases are ultimately bound by the same scalar bitmap building loop for both algorithms. The benchmark here still may show noticeable difference, but only because these are different instances of that loop, and some codegen factors or other random factors might affect it.
So we need to pick:
- Between AVX bitmap and scalar bitmap for AVX2, which we'll do using a threshold
- Between AVX bitmap and vector nested loop for AVX2 and enough haystack length fir AVX bitmap
- Between scalar bitmap and vector nested loop for SSE4.2 or enough haystack length fir AVX bitmap
It is hard to reason about the threshold functions, so the thresholds were obtained by aggressive benchmarking.
Considering early return
There is early return possibility.
If we don't consider it, we may pick a bitmap algorithm where vector nested loop is better.
If we will expect it, but it will not happen, we may pick vector nested loop when a bitmap algorithm is better.
Looks like that the latter gives worse error.
Generally the price of error is small for short needles. Long needles are gambling cases. But even for long needles the price for not picking vector nested loop when it is better is no more than 2x.
Why this dispatch is not in headers?
No big reason.
There's overflow multiply instrisic used from <intrin.h>
, but that one is not essential.
Maybe also this will make maintenance easier, by having fewer functions exposed from vector_algorithm.cpp
Otherwise I guess I'm just like hiding the complexity under a carpet.
🛑 Risks
This time I don't see anything that seems incorrect, it is a complex change with some risks to consider:
- Regressing some performance for some cases due to spending some time deciding/dispatching. I know, but it is a small one.
- Regressing some performance due to potentially sometimes worse choice of algorithms. The current thresholds give better big picture, still in some border cases it might give slightly worse answer
- In particular, might give worse choice for the best case, where the element is found immediately (discussed above)
- Different performance behavior on different CPUs might break fine tuning. Older AMDs that do AVX2 in two takes is most of the concern.
- Complexity of the vector tricks as usual
- Changed
__std_find_last_of_trivial_pos_N
usage, see below
Changed __std_find_last_of_trivial_pos_N
usage
__std_find_last_of_trivial_pos_N
has been shipped in #4934. Now it does the bitmap, which is not what old code expects. Although all bad would happen is when the header implementation would fail the scalar bitmap due to bad values, this would unnecessary try the bitmap again. This time the attempt would be even faster due to the vectorization of checking, unless the user does not have SSE4.2
I just don't want to add more functions with more names just for this reason
Not wanting to have this situation for another function is the reason I made this PR before the _not_
vectorization (remaining for find 🐱 family)
⏱️ Benchmark results
Click to expand:
Benchmark | main | this |
---|---|---|
bm<AlgType::str_member_first, char>/2/3 | 5.39 ns | 5.43 ns |
bm<AlgType::str_member_first, char>/6/81 | 35.0 ns | 23.2 ns |
bm<AlgType::str_member_first, char>/7/4 | 12.8 ns | 15.7 ns |
bm<AlgType::str_member_first, char>/9/3 | 11.1 ns | 13.8 ns |
bm<AlgType::str_member_first, char>/22/5 | 11.2 ns | 14.6 ns |
bm<AlgType::str_member_first, char>/58/2 | 12.7 ns | 14.7 ns |
bm<AlgType::str_member_first, char>/75/85 | 55.8 ns | 46.1 ns |
bm<AlgType::str_member_first, char>/102/4 | 16.2 ns | 17.5 ns |
bm<AlgType::str_member_first, char>/200/46 | 73.7 ns | 38.4 ns |
bm<AlgType::str_member_first, char>/325/1 | 34.0 ns | 36.8 ns |
bm<AlgType::str_member_first, char>/400/50 | 129 ns | 53.4 ns |
bm<AlgType::str_member_first, char>/1011/11 | 91.3 ns | 106 ns |
bm<AlgType::str_member_first, char>/1280/46 | 436 ns | 126 ns |
bm<AlgType::str_member_first, char>/1502/23 | 356 ns | 138 ns |
bm<AlgType::str_member_first, char>/2203/54 | 554 ns | 206 ns |
bm<AlgType::str_member_first, char>/3056/7 | 264 ns | 232 ns |
bm<AlgType::str_member_first, wchar_t>/2/3 | 14.3 ns | 13.3 ns |
bm<AlgType::str_member_first, wchar_t>/6/81 | 41.1 ns | 44.9 ns |
bm<AlgType::str_member_first, wchar_t>/7/4 | 17.3 ns | 18.3 ns |
bm<AlgType::str_member_first, wchar_t>/9/3 | 13.7 ns | 18.4 ns |
bm<AlgType::str_member_first, wchar_t>/22/5 | 14.4 ns | 19.2 ns |
bm<AlgType::str_member_first, wchar_t>/58/2 | 18.5 ns | 23.2 ns |
bm<AlgType::str_member_first, wchar_t>/75/85 | 76.0 ns | 60.6 ns |
bm<AlgType::str_member_first, wchar_t>/102/4 | 25.6 ns | 29.7 ns |
bm<AlgType::str_member_first, wchar_t>/200/46 | 110 ns | 54.5 ns |
bm<AlgType::str_member_first, wchar_t>/325/1 | 64.5 ns | 46.8 ns |
bm<AlgType::str_member_first, wchar_t>/400/50 | 184 ns | 65.1 ns |
bm<AlgType::str_member_first, wchar_t>/1011/11 | 479 ns | 117 ns |
bm<AlgType::str_member_first, wchar_t>/1280/46 | 487 ns | 154 ns |
bm<AlgType::str_member_first, wchar_t>/1502/23 | 692 ns | 163 ns |
bm<AlgType::str_member_first, wchar_t>/2203/54 | 809 ns | 269 ns |
bm<AlgType::str_member_first, wchar_t>/3056/7 | 557 ns | 327 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2/3 | 16.1 ns | 17.2 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/6/81 | 195 ns | 29.3 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/7/4 | 26.0 ns | 18.1 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/9/3 | 13.4 ns | 18.5 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/22/5 | 14.1 ns | 19.4 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/58/2 | 18.5 ns | 23.2 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/75/85 | 189 ns | 170 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/102/4 | 25.9 ns | 29.9 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/200/46 | 277 ns | 247 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/325/1 | 64.3 ns | 69.0 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/400/50 | 613 ns | 532 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1011/11 | 513 ns | 394 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1280/46 | 1631 ns | 1414 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1502/23 | 995 ns | 838 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2203/54 | 3135 ns | 2828 ns |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/3056/7 | 559 ns | 564 ns |
bm<AlgType::str_member_first, char32_t>/2/3 | 13.0 ns | 11.9 ns |
bm<AlgType::str_member_first, char32_t>/6/81 | 40.0 ns | 25.3 ns |
bm<AlgType::str_member_first, char32_t>/7/4 | 15.6 ns | 16.7 ns |
bm<AlgType::str_member_first, char32_t>/9/3 | 13.9 ns | 17.6 ns |
bm<AlgType::str_member_first, char32_t>/22/5 | 14.3 ns | 20.9 ns |
bm<AlgType::str_member_first, char32_t>/58/2 | 14.3 ns | 22.2 ns |
bm<AlgType::str_member_first, char32_t>/75/85 | 61.3 ns | 55.2 ns |
bm<AlgType::str_member_first, char32_t>/102/4 | 16.4 ns | 27.2 ns |
bm<AlgType::str_member_first, char32_t>/200/46 | 110 ns | 46.5 ns |
bm<AlgType::str_member_first, char32_t>/325/1 | 27.3 ns | 39.1 ns |
bm<AlgType::str_member_first, char32_t>/400/50 | 183 ns | 60.6 ns |
bm<AlgType::str_member_first, char32_t>/1011/11 | 333 ns | 127 ns |
bm<AlgType::str_member_first, char32_t>/1280/46 | 489 ns | 142 ns |
bm<AlgType::str_member_first, char32_t>/1502/23 | 555 ns | 164 ns |
bm<AlgType::str_member_first, char32_t>/2203/54 | 818 ns | 250 ns |
bm<AlgType::str_member_first, char32_t>/3056/7 | 539 ns | 281 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2/3 | 17.0 ns | 13.9 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/6/81 | 189 ns | 25.7 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/7/4 | 27.9 ns | 16.7 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/9/3 | 14.2 ns | 16.9 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/22/5 | 14.9 ns | 20.1 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/58/2 | 15.2 ns | 18.8 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/75/85 | 202 ns | 203 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/102/4 | 16.8 ns | 22.4 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/200/46 | 284 ns | 283 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/325/1 | 25.1 ns | 29.9 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/400/50 | 597 ns | 601 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1011/11 | 333 ns | 330 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1280/46 | 1731 ns | 1739 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1502/23 | 1011 ns | 1002 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2203/54 | 3445 ns | 3492 ns |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/3056/7 | 541 ns | 541 ns |
bm<AlgType::str_member_last, char>/2/3 | 5.15 ns | 5.19 ns |
bm<AlgType::str_member_last, char>/6/81 | 31.2 ns | 21.0 ns |
bm<AlgType::str_member_last, char>/7/4 | 11.8 ns | 16.2 ns |
bm<AlgType::str_member_last, char>/9/3 | 10.6 ns | 13.2 ns |
bm<AlgType::str_member_last, char>/22/5 | 11.2 ns | 13.7 ns |
bm<AlgType::str_member_last, char>/58/2 | 12.3 ns | 14.9 ns |
bm<AlgType::str_member_last, char>/75/85 | 58.2 ns | 43.1 ns |
bm<AlgType::str_member_last, char>/102/4 | 15.2 ns | 17.7 ns |
bm<AlgType::str_member_last, char>/200/46 | 60.6 ns | 34.9 ns |
bm<AlgType::str_member_last, char>/325/1 | 34.7 ns | 36.7 ns |
bm<AlgType::str_member_last, char>/400/50 | 138 ns | 50.3 ns |
bm<AlgType::str_member_last, char>/1011/11 | 94.9 ns | 91.4 ns |
bm<AlgType::str_member_last, char>/1280/46 | 363 ns | 113 ns |
bm<AlgType::str_member_last, char>/1502/23 | 290 ns | 128 ns |
bm<AlgType::str_member_last, char>/2203/54 | 606 ns | 204 ns |
bm<AlgType::str_member_last, char>/3056/7 | 270 ns | 251 ns |
bm<AlgType::str_member_last, wchar_t>/2/3 | 13.3 ns | 10.8 ns |
bm<AlgType::str_member_last, wchar_t>/6/81 | 42.0 ns | 49.9 ns |
bm<AlgType::str_member_last, wchar_t>/7/4 | 15.7 ns | 16.2 ns |
bm<AlgType::str_member_last, wchar_t>/9/3 | 13.6 ns | 17.0 ns |
bm<AlgType::str_member_last, wchar_t>/22/5 | 14.6 ns | 18.2 ns |
bm<AlgType::str_member_last, wchar_t>/58/2 | 18.0 ns | 20.8 ns |
bm<AlgType::str_member_last, wchar_t>/75/85 | 82.8 ns | 58.4 ns |
bm<AlgType::str_member_last, wchar_t>/102/4 | 24.7 ns | 29.9 ns |
bm<AlgType::str_member_last, wchar_t>/200/46 | 118 ns | 49.7 ns |
bm<AlgType::str_member_last, wchar_t>/325/1 | 61.5 ns | 43.5 ns |
bm<AlgType::str_member_last, wchar_t>/400/50 | 191 ns | 62.6 ns |
bm<AlgType::str_member_last, wchar_t>/1011/11 | 404 ns | 115 ns |
bm<AlgType::str_member_last, wchar_t>/1280/46 | 493 ns | 153 ns |
bm<AlgType::str_member_last, wchar_t>/1502/23 | 587 ns | 162 ns |
bm<AlgType::str_member_last, wchar_t>/2203/54 | 830 ns | 259 ns |
bm<AlgType::str_member_last, wchar_t>/3056/7 | 529 ns | 326 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2/3 | 15.7 ns | 13.5 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/6/81 | 159 ns | 28.9 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/7/4 | 25.4 ns | 17.3 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/9/3 | 14.3 ns | 18.1 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/22/5 | 15.3 ns | 18.5 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/58/2 | 18.2 ns | 21.6 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/75/85 | 189 ns | 166 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/102/4 | 24.7 ns | 29.1 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/200/46 | 265 ns | 255 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/325/1 | 62.0 ns | 67.4 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/400/50 | 568 ns | 525 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1011/11 | 507 ns | 400 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1280/46 | 1617 ns | 1391 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1502/23 | 1030 ns | 854 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2203/54 | 3165 ns | 2720 ns |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/3056/7 | 525 ns | 563 ns |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
AlexGuteniev changed the title
Improve Improve basic_string::find_first_of
and basic_string::find_last_of
vectorization for large needlesbasic_string::find_first_of
and basic_string::find_last_of
vectorization for large needles or very large haystacks
Click to expand 5950X benchmark results:
Benchmark | Before | After | Speedup |
---|---|---|---|
bm<AlgType::std_func, uint8_t>/2/3 | 5.74 ns | 6.08 ns | 0.94 |
bm<AlgType::std_func, uint8_t>/6/81 | 117 ns | 115 ns | 1.02 |
bm<AlgType::std_func, uint8_t>/7/4 | 11.1 ns | 12.6 ns | 0.88 |
bm<AlgType::std_func, uint8_t>/9/3 | 13.8 ns | 13.8 ns | 1.00 |
bm<AlgType::std_func, uint8_t>/22/5 | 14.4 ns | 13.8 ns | 1.04 |
bm<AlgType::std_func, uint8_t>/58/2 | 15.7 ns | 15.4 ns | 1.02 |
bm<AlgType::std_func, uint8_t>/75/85 | 35.2 ns | 35.9 ns | 0.98 |
bm<AlgType::std_func, uint8_t>/102/4 | 16.6 ns | 15.8 ns | 1.05 |
bm<AlgType::std_func, uint8_t>/200/46 | 44.5 ns | 44.6 ns | 1.00 |
bm<AlgType::std_func, uint8_t>/325/1 | 8.67 ns | 10.4 ns | 0.83 |
bm<AlgType::std_func, uint8_t>/400/50 | 111 ns | 114 ns | 0.97 |
bm<AlgType::std_func, uint8_t>/1011/11 | 59.5 ns | 53.1 ns | 1.12 |
bm<AlgType::std_func, uint8_t>/1280/46 | 241 ns | 251 ns | 0.96 |
bm<AlgType::std_func, uint8_t>/1502/23 | 203 ns | 204 ns | 1.00 |
bm<AlgType::std_func, uint8_t>/2203/54 | 559 ns | 525 ns | 1.06 |
bm<AlgType::std_func, uint8_t>/3056/7 | 141 ns | 149 ns | 0.95 |
bm<AlgType::std_func, uint16_t>/2/3 | 5.63 ns | 5.85 ns | 0.96 |
bm<AlgType::std_func, uint16_t>/6/81 | 117 ns | 118 ns | 0.99 |
bm<AlgType::std_func, uint16_t>/7/4 | 13.1 ns | 12.2 ns | 1.07 |
bm<AlgType::std_func, uint16_t>/9/3 | 15.4 ns | 15.2 ns | 1.01 |
bm<AlgType::std_func, uint16_t>/22/5 | 15.4 ns | 16.2 ns | 0.95 |
bm<AlgType::std_func, uint16_t>/58/2 | 17.2 ns | 17.1 ns | 1.01 |
bm<AlgType::std_func, uint16_t>/75/85 | 123 ns | 129 ns | 0.95 |
bm<AlgType::std_func, uint16_t>/102/4 | 20.4 ns | 21.2 ns | 0.96 |
bm<AlgType::std_func, uint16_t>/200/46 | 154 ns | 187 ns | 0.82 |
bm<AlgType::std_func, uint16_t>/325/1 | 12.4 ns | 14.1 ns | 0.88 |
bm<AlgType::std_func, uint16_t>/400/50 | 328 ns | 502 ns | 0.65 |
bm<AlgType::std_func, uint16_t>/1011/11 | 257 ns | 300 ns | 0.86 |
bm<AlgType::std_func, uint16_t>/1280/46 | 938 ns | 1120 ns | 0.84 |
bm<AlgType::std_func, uint16_t>/1502/23 | 533 ns | 650 ns | 0.82 |
bm<AlgType::std_func, uint16_t>/2203/54 | 1696 ns | 2178 ns | 0.78 |
bm<AlgType::std_func, uint16_t>/3056/7 | 272 ns | 292 ns | 0.93 |
bm<AlgType::std_func, uint32_t>/2/3 | 5.73 ns | 6.70 ns | 0.86 |
bm<AlgType::std_func, uint32_t>/6/81 | 117 ns | 139 ns | 0.84 |
bm<AlgType::std_func, uint32_t>/7/4 | 11.5 ns | 16.2 ns | 0.71 |
bm<AlgType::std_func, uint32_t>/9/3 | 9.94 ns | 11.4 ns | 0.87 |
bm<AlgType::std_func, uint32_t>/22/5 | 13.6 ns | 18.8 ns | 0.72 |
bm<AlgType::std_func, uint32_t>/58/2 | 9.74 ns | 11.8 ns | 0.83 |
bm<AlgType::std_func, uint32_t>/75/85 | 166 ns | 191 ns | 0.87 |
bm<AlgType::std_func, uint32_t>/102/4 | 15.5 ns | 16.1 ns | 0.96 |
bm<AlgType::std_func, uint32_t>/200/46 | 248 ns | 251 ns | 0.99 |
bm<AlgType::std_func, uint32_t>/325/1 | 18.2 ns | 19.2 ns | 0.95 |
bm<AlgType::std_func, uint32_t>/400/50 | 522 ns | 502 ns | 1.04 |
bm<AlgType::std_func, uint32_t>/1011/11 | 282 ns | 271 ns | 1.04 |
bm<AlgType::std_func, uint32_t>/1280/46 | 1459 ns | 1440 ns | 1.01 |
bm<AlgType::std_func, uint32_t>/1502/23 | 846 ns | 823 ns | 1.03 |
bm<AlgType::std_func, uint32_t>/2203/54 | 2967 ns | 2832 ns | 1.05 |
bm<AlgType::std_func, uint32_t>/3056/7 | 340 ns | 359 ns | 0.95 |
bm<AlgType::std_func, uint64_t>/2/3 | 6.22 ns | 5.72 ns | 1.09 |
bm<AlgType::std_func, uint64_t>/6/81 | 121 ns | 122 ns | 0.99 |
bm<AlgType::std_func, uint64_t>/7/4 | 13.0 ns | 11.8 ns | 1.10 |
bm<AlgType::std_func, uint64_t>/9/3 | 9.82 ns | 9.82 ns | 1.00 |
bm<AlgType::std_func, uint64_t>/22/5 | 15.4 ns | 12.4 ns | 1.24 |
bm<AlgType::std_func, uint64_t>/58/2 | 18.7 ns | 12.4 ns | 1.51 |
bm<AlgType::std_func, uint64_t>/75/85 | 352 ns | 362 ns | 0.97 |
bm<AlgType::std_func, uint64_t>/102/4 | 29.0 ns | 32.0 ns | 0.91 |
bm<AlgType::std_func, uint64_t>/200/46 | 500 ns | 524 ns | 0.95 |
bm<AlgType::std_func, uint64_t>/325/1 | 44.8 ns | 50.6 ns | 0.89 |
bm<AlgType::std_func, uint64_t>/400/50 | 1049 ns | 1075 ns | 0.98 |
bm<AlgType::std_func, uint64_t>/1011/11 | 581 ns | 580 ns | 1.00 |
bm<AlgType::std_func, uint64_t>/1280/46 | 3032 ns | 3120 ns | 0.97 |
bm<AlgType::std_func, uint64_t>/1502/23 | 1790 ns | 1865 ns | 0.96 |
bm<AlgType::std_func, uint64_t>/2203/54 | 6070 ns | 6541 ns | 0.93 |
bm<AlgType::std_func, uint64_t>/3056/7 | 1029 ns | 1135 ns | 0.91 |
bm<AlgType::str_member_first, char>/2/3 | 8.73 ns | 9.09 ns | 0.96 |
bm<AlgType::str_member_first, char>/6/81 | 27.7 ns | 27.7 ns | 1.00 |
bm<AlgType::str_member_first, char>/7/4 | 10.1 ns | 22.8 ns | 0.44 |
bm<AlgType::str_member_first, char>/9/3 | 13.8 ns | 19.8 ns | 0.70 |
bm<AlgType::str_member_first, char>/22/5 | 14.0 ns | 19.0 ns | 0.74 |
bm<AlgType::str_member_first, char>/58/2 | 15.8 ns | 19.8 ns | 0.80 |
bm<AlgType::str_member_first, char>/75/85 | 50.8 ns | 49.2 ns | 1.03 |
bm<AlgType::str_member_first, char>/102/4 | 22.8 ns | 20.6 ns | 1.11 |
bm<AlgType::str_member_first, char>/200/46 | 44.9 ns | 49.1 ns | 0.91 |
bm<AlgType::str_member_first, char>/325/1 | 27.9 ns | 32.8 ns | 0.85 |
bm<AlgType::str_member_first, char>/400/50 | 114 ns | 68.5 ns | 1.66 |
bm<AlgType::str_member_first, char>/1011/11 | 60.0 ns | 124 ns | 0.48 |
bm<AlgType::str_member_first, char>/1280/46 | 243 ns | 141 ns | 1.72 |
bm<AlgType::str_member_first, char>/1502/23 | 196 ns | 132 ns | 1.48 |
bm<AlgType::str_member_first, char>/2203/54 | 503 ns | 182 ns | 2.76 |
bm<AlgType::str_member_first, char>/3056/7 | 141 ns | 222 ns | 0.64 |
bm<AlgType::str_member_first, wchar_t>/2/3 | 11.8 ns | 12.7 ns | 0.93 |
bm<AlgType::str_member_first, wchar_t>/6/81 | 35.4 ns | 66.5 ns | 0.53 |
bm<AlgType::str_member_first, wchar_t>/7/4 | 14.4 ns | 30.1 ns | 0.48 |
bm<AlgType::str_member_first, wchar_t>/9/3 | 17.2 ns | 27.7 ns | 0.62 |
bm<AlgType::str_member_first, wchar_t>/22/5 | 17.9 ns | 27.9 ns | 0.64 |
bm<AlgType::str_member_first, wchar_t>/58/2 | 20.3 ns | 26.8 ns | 0.76 |
bm<AlgType::str_member_first, wchar_t>/75/85 | 70.6 ns | 56.0 ns | 1.26 |
bm<AlgType::str_member_first, wchar_t>/102/4 | 29.2 ns | 31.7 ns | 0.92 |
bm<AlgType::str_member_first, wchar_t>/200/46 | 156 ns | 64.4 ns | 2.42 |
bm<AlgType::str_member_first, wchar_t>/325/1 | 42.2 ns | 50.1 ns | 0.84 |
bm<AlgType::str_member_first, wchar_t>/400/50 | 252 ns | 74.7 ns | 3.37 |
bm<AlgType::str_member_first, wchar_t>/1011/11 | 261 ns | 128 ns | 2.04 |
bm<AlgType::str_member_first, wchar_t>/1280/46 | 590 ns | 166 ns | 3.55 |
bm<AlgType::str_member_first, wchar_t>/1502/23 | 675 ns | 194 ns | 3.48 |
bm<AlgType::str_member_first, wchar_t>/2203/54 | 969 ns | 242 ns | 4.00 |
bm<AlgType::str_member_first, wchar_t>/3056/7 | 264 ns | 305 ns | 0.87 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2/3 | 13.3 ns | 13.1 ns | 1.02 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/6/81 | 125 ns | 32.7 ns | 3.82 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/7/4 | 21.3 ns | 21.6 ns | 0.99 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/9/3 | 17.5 ns | 22.2 ns | 0.79 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/22/5 | 17.9 ns | 23.6 ns | 0.76 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/58/2 | 19.6 ns | 25.2 ns | 0.78 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/75/85 | 117 ns | 142 ns | 0.82 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/102/4 | 22.8 ns | 28.7 ns | 0.79 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/200/46 | 160 ns | 188 ns | 0.85 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/325/1 | 42.1 ns | 55.8 ns | 0.75 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/400/50 | 335 ns | 512 ns | 0.65 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1011/11 | 303 ns | 405 ns | 0.75 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1280/46 | 1062 ns | 1413 ns | 0.75 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1502/23 | 821 ns | 798 ns | 1.03 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2203/54 | 1979 ns | 2779 ns | 0.71 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/3056/7 | 269 ns | 362 ns | 0.74 |
bm<AlgType::str_member_first, char32_t>/2/3 | 11.9 ns | 14.9 ns | 0.80 |
bm<AlgType::str_member_first, char32_t>/6/81 | 35.7 ns | 38.2 ns | 0.93 |
bm<AlgType::str_member_first, char32_t>/7/4 | 13.5 ns | 24.0 ns | 0.56 |
bm<AlgType::str_member_first, char32_t>/9/3 | 13.0 ns | 23.9 ns | 0.54 |
bm<AlgType::str_member_first, char32_t>/22/5 | 14.8 ns | 22.1 ns | 0.67 |
bm<AlgType::str_member_first, char32_t>/58/2 | 12.5 ns | 22.9 ns | 0.55 |
bm<AlgType::str_member_first, char32_t>/75/85 | 71.2 ns | 50.4 ns | 1.41 |
bm<AlgType::str_member_first, char32_t>/102/4 | 18.5 ns | 28.4 ns | 0.65 |
bm<AlgType::str_member_first, char32_t>/200/46 | 112 ns | 54.7 ns | 2.05 |
bm<AlgType::str_member_first, char32_t>/325/1 | 24.1 ns | 42.1 ns | 0.57 |
bm<AlgType::str_member_first, char32_t>/400/50 | 198 ns | 75.9 ns | 2.61 |
bm<AlgType::str_member_first, char32_t>/1011/11 | 268 ns | 116 ns | 2.31 |
bm<AlgType::str_member_first, char32_t>/1280/46 | 566 ns | 141 ns | 4.01 |
bm<AlgType::str_member_first, char32_t>/1502/23 | 697 ns | 156 ns | 4.47 |
bm<AlgType::str_member_first, char32_t>/2203/54 | 960 ns | 203 ns | 4.73 |
bm<AlgType::str_member_first, char32_t>/3056/7 | 407 ns | 243 ns | 1.67 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2/3 | 14.9 ns | 11.7 ns | 1.27 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/6/81 | 123 ns | 30.3 ns | 4.06 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/7/4 | 19.8 ns | 18.2 ns | 1.09 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/9/3 | 12.9 ns | 18.6 ns | 0.69 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/22/5 | 15.0 ns | 21.5 ns | 0.70 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/58/2 | 12.4 ns | 18.4 ns | 0.67 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/75/85 | 174 ns | 181 ns | 0.96 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/102/4 | 18.4 ns | 24.9 ns | 0.74 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/200/46 | 247 ns | 272 ns | 0.91 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/325/1 | 23.6 ns | 29.5 ns | 0.80 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/400/50 | 489 ns | 495 ns | 0.99 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1011/11 | 265 ns | 273 ns | 0.97 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1280/46 | 1387 ns | 1409 ns | 0.98 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1502/23 | 806 ns | 805 ns | 1.00 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2203/54 | 2773 ns | 2885 ns | 0.96 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/3056/7 | 341 ns | 352 ns | 0.97 |
bm<AlgType::str_member_last, char>/2/3 | 8.54 ns | 7.87 ns | 1.09 |
bm<AlgType::str_member_last, char>/6/81 | 26.5 ns | 25.4 ns | 1.04 |
bm<AlgType::str_member_last, char>/7/4 | 9.84 ns | 21.1 ns | 0.47 |
bm<AlgType::str_member_last, char>/9/3 | 13.9 ns | 19.2 ns | 0.72 |
bm<AlgType::str_member_last, char>/22/5 | 14.6 ns | 19.1 ns | 0.76 |
bm<AlgType::str_member_last, char>/58/2 | 16.0 ns | 19.4 ns | 0.82 |
bm<AlgType::str_member_last, char>/75/85 | 46.3 ns | 42.8 ns | 1.08 |
bm<AlgType::str_member_last, char>/102/4 | 16.3 ns | 20.0 ns | 0.82 |
bm<AlgType::str_member_last, char>/200/46 | 44.9 ns | 42.2 ns | 1.06 |
bm<AlgType::str_member_last, char>/325/1 | 34.3 ns | 30.4 ns | 1.13 |
bm<AlgType::str_member_last, char>/400/50 | 156 ns | 54.4 ns | 2.87 |
bm<AlgType::str_member_last, char>/1011/11 | 77.0 ns | 93.1 ns | 0.83 |
bm<AlgType::str_member_last, char>/1280/46 | 237 ns | 119 ns | 1.99 |
bm<AlgType::str_member_last, char>/1502/23 | 196 ns | 128 ns | 1.53 |
bm<AlgType::str_member_last, char>/2203/54 | 497 ns | 177 ns | 2.81 |
bm<AlgType::str_member_last, char>/3056/7 | 142 ns | 219 ns | 0.65 |
bm<AlgType::str_member_last, wchar_t>/2/3 | 12.2 ns | 11.3 ns | 1.08 |
bm<AlgType::str_member_last, wchar_t>/6/81 | 50.5 ns | 52.9 ns | 0.95 |
bm<AlgType::str_member_last, wchar_t>/7/4 | 13.8 ns | 17.9 ns | 0.77 |
bm<AlgType::str_member_last, wchar_t>/9/3 | 16.9 ns | 20.1 ns | 0.84 |
bm<AlgType::str_member_last, wchar_t>/22/5 | 17.5 ns | 20.4 ns | 0.86 |
bm<AlgType::str_member_last, wchar_t>/58/2 | 19.7 ns | 22.3 ns | 0.88 |
bm<AlgType::str_member_last, wchar_t>/75/85 | 75.9 ns | 49.1 ns | 1.55 |
bm<AlgType::str_member_last, wchar_t>/102/4 | 22.6 ns | 27.3 ns | 0.83 |
bm<AlgType::str_member_last, wchar_t>/200/46 | 120 ns | 55.3 ns | 2.17 |
bm<AlgType::str_member_last, wchar_t>/325/1 | 48.3 ns | 45.6 ns | 1.06 |
bm<AlgType::str_member_last, wchar_t>/400/50 | 207 ns | 68.9 ns | 3.00 |
bm<AlgType::str_member_last, wchar_t>/1011/11 | 446 ns | 125 ns | 3.57 |
bm<AlgType::str_member_last, wchar_t>/1280/46 | 576 ns | 153 ns | 3.76 |
bm<AlgType::str_member_last, wchar_t>/1502/23 | 724 ns | 171 ns | 4.23 |
bm<AlgType::str_member_last, wchar_t>/2203/54 | 993 ns | 224 ns | 4.43 |
bm<AlgType::str_member_last, wchar_t>/3056/7 | 266 ns | 276 ns | 0.96 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2/3 | 12.6 ns | 12.1 ns | 1.04 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/6/81 | 109 ns | 27.2 ns | 4.01 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/7/4 | 23.2 ns | 17.9 ns | 1.30 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/9/3 | 17.0 ns | 20.0 ns | 0.85 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/22/5 | 17.6 ns | 21.4 ns | 0.82 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/58/2 | 20.2 ns | 24.1 ns | 0.84 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/75/85 | 122 ns | 119 ns | 1.03 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/102/4 | 23.5 ns | 26.2 ns | 0.90 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/200/46 | 262 ns | 156 ns | 1.68 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/325/1 | 77.5 ns | 51.6 ns | 1.50 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/400/50 | 336 ns | 332 ns | 1.01 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1011/11 | 388 ns | 273 ns | 1.42 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1280/46 | 893 ns | 881 ns | 1.01 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1502/23 | 782 ns | 545 ns | 1.43 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2203/54 | 2100 ns | 1817 ns | 1.16 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/3056/7 | 267 ns | 275 ns | 0.97 |
I am indeed observing a nice speedup for the originally motivating bm<AlgType::str_member_last, char>/400/50
case (2.87 speedup), but various regressions. bm<AlgType::str_member_first, char>/1011/11
is a speedup of 0.48 (60.0 ns => 124 ns) and seems like a big haystack where we shouldn't be suffering from small-haystack effects.
Member
BillyONeal commented
•
edited by StephanTLavavej
Loading
Benchmark results from 14900HX:
Benchmark | Before NS | After NS | Speedup |
---|---|---|---|
bm<AlgType::std_func, uint8_t>/2/3 | 3.17 | 2.63 | 1.21 |
bm<AlgType::std_func, uint8_t>/6/81 | 135 | 143 | 0.94 |
bm<AlgType::std_func, uint8_t>/7/4 | 13.4 | 14 | 0.96 |
bm<AlgType::std_func, uint8_t>/9/3 | 8.22 | 8.61 | 0.95 |
bm<AlgType::std_func, uint8_t>/22/5 | 8.87 | 8.92 | 0.99 |
bm<AlgType::std_func, uint8_t>/58/2 | 9.68 | 9.97 | 0.97 |
bm<AlgType::std_func, uint8_t>/75/85 | 44.1 | 45 | 0.98 |
bm<AlgType::std_func, uint8_t>/102/4 | 11.2 | 11.7 | 0.96 |
bm<AlgType::std_func, uint8_t>/200/46 | 56.9 | 58.2 | 0.98 |
bm<AlgType::std_func, uint8_t>/325/1 | 3.91 | 3.96 | 0.99 |
bm<AlgType::std_func, uint8_t>/400/50 | 144 | 147 | 0.98 |
bm<AlgType::std_func, uint8_t>/1011/11 | 79.5 | 76.3 | 1.04 |
bm<AlgType::std_func, uint8_t>/1280/46 | 341 | 351 | 0.97 |
bm<AlgType::std_func, uint8_t>/1502/23 | 279 | 292 | 0.96 |
bm<AlgType::std_func, uint8_t>/2203/54 | 752 | 755 | 1.00 |
bm<AlgType::std_func, uint8_t>/3056/7 | 201 | 221 | 0.91 |
bm<AlgType::std_func, uint16_t>/2/3 | 3.05 | 3.34 | 0.91 |
bm<AlgType::std_func, uint16_t>/6/81 | 137 | 139 | 0.99 |
bm<AlgType::std_func, uint16_t>/7/4 | 13.4 | 13.9 | 0.96 |
bm<AlgType::std_func, uint16_t>/9/3 | 9.83 | 9.13 | 1.08 |
bm<AlgType::std_func, uint16_t>/22/5 | 10.2 | 10.2 | 1.00 |
bm<AlgType::std_func, uint16_t>/58/2 | 12.8 | 13.4 | 0.96 |
bm<AlgType::std_func, uint16_t>/75/85 | 149 | 130 | 1.15 |
bm<AlgType::std_func, uint16_t>/102/4 | 18.2 | 18.9 | 0.96 |
bm<AlgType::std_func, uint16_t>/200/46 | 219 | 191 | 1.15 |
bm<AlgType::std_func, uint16_t>/325/1 | 10.2 | 10.3 | 0.99 |
bm<AlgType::std_func, uint16_t>/400/50 | 462 | 413 | 1.12 |
bm<AlgType::std_func, uint16_t>/1011/11 | 370 | 314 | 1.18 |
bm<AlgType::std_func, uint16_t>/1280/46 | 1285 | 1117 | 1.15 |
bm<AlgType::std_func, uint16_t>/1502/23 | 816 | 659 | 1.24 |
bm<AlgType::std_func, uint16_t>/2203/54 | 2444 | 2182 | 1.12 |
bm<AlgType::std_func, uint16_t>/3056/7 | 430 | 433 | 0.99 |
bm<AlgType::std_func, uint32_t>/2/3 | 2.9 | 2.59 | 1.12 |
bm<AlgType::std_func, uint32_t>/6/81 | 138 | 135 | 1.02 |
bm<AlgType::std_func, uint32_t>/7/4 | 13.5 | 14.7 | 0.92 |
bm<AlgType::std_func, uint32_t>/9/3 | 4.57 | 4.62 | 0.99 |
bm<AlgType::std_func, uint32_t>/22/5 | 11 | 12.1 | 0.91 |
bm<AlgType::std_func, uint32_t>/58/2 | 10.6 | 10.4 | 1.02 |
bm<AlgType::std_func, uint32_t>/75/85 | 156 | 154 | 1.01 |
bm<AlgType::std_func, uint32_t>/102/4 | 12.4 | 12.7 | 0.98 |
bm<AlgType::std_func, uint32_t>/200/46 | 221 | 220 | 1.00 |
bm<AlgType::std_func, uint32_t>/325/1 | 15.5 | 14.2 | 1.09 |
bm<AlgType::std_func, uint32_t>/400/50 | 476 | 462 | 1.03 |
bm<AlgType::std_func, uint32_t>/1011/11 | 259 | 242 | 1.07 |
bm<AlgType::std_func, uint32_t>/1280/46 | 1344 | 1351 | 0.99 |
bm<AlgType::std_func, uint32_t>/1502/23 | 770 | 773 | 1.00 |
bm<AlgType::std_func, uint32_t>/2203/54 | 2663 | 2679 | 0.99 |
bm<AlgType::std_func, uint32_t>/3056/7 | 413 | 421 | 0.98 |
bm<AlgType::std_func, uint64_t>/2/3 | 2.51 | 2.87 | 0.87 |
bm<AlgType::std_func, uint64_t>/6/81 | 131 | 138 | 0.95 |
bm<AlgType::std_func, uint64_t>/7/4 | 15.4 | 13.6 | 1.13 |
bm<AlgType::std_func, uint64_t>/9/3 | 5.09 | 8.89 | 0.57 |
bm<AlgType::std_func, uint64_t>/22/5 | 10 | 10.4 | 0.96 |
bm<AlgType::std_func, uint64_t>/58/2 | 12.2 | 12 | 1.02 |
bm<AlgType::std_func, uint64_t>/75/85 | 290 | 288 | 1.01 |
bm<AlgType::std_func, uint64_t>/102/4 | 23.7 | 23.4 | 1.01 |
bm<AlgType::std_func, uint64_t>/200/46 | 406 | 407 | 1.00 |
bm<AlgType::std_func, uint64_t>/325/1 | 34.7 | 36.6 | 0.95 |
bm<AlgType::std_func, uint64_t>/400/50 | 862 | 878 | 0.98 |
bm<AlgType::std_func, uint64_t>/1011/11 | 476 | 479 | 0.99 |
bm<AlgType::std_func, uint64_t>/1280/46 | 2527 | 2476 | 1.02 |
bm<AlgType::std_func, uint64_t>/1502/23 | 1557 | 1477 | 1.05 |
bm<AlgType::std_func, uint64_t>/2203/54 | 5112 | 4959 | 1.03 |
bm<AlgType::std_func, uint64_t>/3056/7 | 902 | 897 | 1.01 |
bm<AlgType::str_member_first, char>/2/3 | 4.28 | 4.35 | 0.98 |
bm<AlgType::str_member_first, char>/6/81 | 26.7 | 17 | 1.57 |
bm<AlgType::str_member_first, char>/7/4 | 10.2 | 14.1 | 0.72 |
bm<AlgType::str_member_first, char>/9/3 | 8.59 | 12.2 | 0.70 |
bm<AlgType::str_member_first, char>/22/5 | 9.02 | 12.4 | 0.73 |
bm<AlgType::str_member_first, char>/58/2 | 10.1 | 12.8 | 0.79 |
bm<AlgType::str_member_first, char>/75/85 | 43.7 | 37.9 | 1.15 |
bm<AlgType::str_member_first, char>/102/4 | 11.6 | 14.1 | 0.82 |
bm<AlgType::str_member_first, char>/200/46 | 57.1 | 30.5 | 1.87 |
bm<AlgType::str_member_first, char>/325/1 | 26 | 28.8 | 0.90 |
bm<AlgType::str_member_first, char>/400/50 | 100 | 40.6 | 2.46 |
bm<AlgType::str_member_first, char>/1011/11 | 70.8 | 83.7 | 0.85 |
bm<AlgType::str_member_first, char>/1280/46 | 342 | 101 | 3.39 |
bm<AlgType::str_member_first, char>/1502/23 | 273 | 108 | 2.53 |
bm<AlgType::str_member_first, char>/2203/54 | 428 | 151 | 2.83 |
bm<AlgType::str_member_first, char>/3056/7 | 203 | 185 | 1.10 |
bm<AlgType::str_member_first, wchar_t>/2/3 | 11.9 | 11.1 | 1.07 |
bm<AlgType::str_member_first, wchar_t>/6/81 | 33.7 | 38.2 | 0.88 |
bm<AlgType::str_member_first, wchar_t>/7/4 | 13.9 | 15.8 | 0.88 |
bm<AlgType::str_member_first, wchar_t>/9/3 | 11.7 | 16.3 | 0.72 |
bm<AlgType::str_member_first, wchar_t>/22/5 | 12.3 | 17.2 | 0.72 |
bm<AlgType::str_member_first, wchar_t>/58/2 | 14.8 | 19.5 | 0.76 |
bm<AlgType::str_member_first, wchar_t>/75/85 | 60.8 | 48.3 | 1.26 |
bm<AlgType::str_member_first, wchar_t>/102/4 | 19.9 | 26 | 0.77 |
bm<AlgType::str_member_first, wchar_t>/200/46 | 87.4 | 44.4 | 1.97 |
bm<AlgType::str_member_first, wchar_t>/325/1 | 50 | 33.9 | 1.47 |
bm<AlgType::str_member_first, wchar_t>/400/50 | 144 | 52.7 | 2.73 |
bm<AlgType::str_member_first, wchar_t>/1011/11 | 381 | 94.3 | 4.04 |
bm<AlgType::str_member_first, wchar_t>/1280/46 | 388 | 122 | 3.18 |
bm<AlgType::str_member_first, wchar_t>/1502/23 | 559 | 132 | 4.23 |
bm<AlgType::str_member_first, wchar_t>/2203/54 | 647 | 207 | 3.13 |
bm<AlgType::str_member_first, wchar_t>/3056/7 | 432 | 259 | 1.67 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2/3 | 13.9 | 14.1 | 0.99 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/6/81 | 152 | 25.9 | 5.87 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/7/4 | 21.2 | 16.5 | 1.28 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/9/3 | 11.8 | 17 | 0.69 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/22/5 | 12.2 | 17.4 | 0.70 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/58/2 | 14.8 | 20.3 | 0.73 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/75/85 | 158 | 135 | 1.17 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/102/4 | 20 | 24.9 | 0.80 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/200/46 | 223 | 195 | 1.14 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/325/1 | 50.1 | 55.9 | 0.90 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/400/50 | 470 | 411 | 1.14 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1011/11 | 378 | 309 | 1.22 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1280/46 | 1287 | 1116 | 1.15 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1502/23 | 816 | 653 | 1.25 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2203/54 | 2489 | 2155 | 1.15 |
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/3056/7 | 433 | 436 | 0.99 |
bm<AlgType::str_member_first, char32_t>/2/3 | 11.3 | 10.2 | 1.11 |
bm<AlgType::str_member_first, char32_t>/6/81 | 31.5 | 20.9 | 1.51 |
bm<AlgType::str_member_first, char32_t>/7/4 | 12.8 | 15.4 | 0.83 |
bm<AlgType::str_member_first, char32_t>/9/3 | 11.4 | 16.2 | 0.70 |
bm<AlgType::str_member_first, char32_t>/22/5 | 12.6 | 18.2 | 0.69 |
bm<AlgType::str_member_first, char32_t>/58/2 | 12 | 20.9 | 0.57 |
bm<AlgType::str_member_first, char32_t>/75/85 | 49 | 50 | 0.98 |
bm<AlgType::str_member_first, char32_t>/102/4 | 13.8 | 25.5 | 0.54 |
bm<AlgType::str_member_first, char32_t>/200/46 | 86.5 | 39.5 | 2.19 |
bm<AlgType::str_member_first, char32_t>/325/1 | 21.5 | 30.2 | 0.71 |
bm<AlgType::str_member_first, char32_t>/400/50 | 147 | 48.5 | 3.03 |
bm<AlgType::str_member_first, char32_t>/1011/11 | 266 | 96.5 | 2.76 |
bm<AlgType::str_member_first, char32_t>/1280/46 | 385 | 105 | 3.67 |
bm<AlgType::str_member_first, char32_t>/1502/23 | 449 | 113 | 3.97 |
bm<AlgType::str_member_first, char32_t>/2203/54 | 640 | 181 | 3.54 |
bm<AlgType::str_member_first, char32_t>/3056/7 | 414 | 226 | 1.83 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2/3 | 13.7 | 12.2 | 1.12 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/6/81 | 150 | 20.4 | 7.35 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/7/4 | 21.7 | 15.1 | 1.44 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/9/3 | 11.9 | 15.7 | 0.76 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/22/5 | 13.3 | 18.4 | 0.72 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/58/2 | 12.8 | 16.8 | 0.76 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/75/85 | 163 | 164 | 0.99 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/102/4 | 14.4 | 19.5 | 0.74 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/200/46 | 221 | 222 | 1.00 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/325/1 | 21.6 | 27.7 | 0.78 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/400/50 | 472 | 470 | 1.00 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1011/11 | 266 | 263 | 1.01 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1280/46 | 1356 | 1368 | 0.99 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1502/23 | 777 | 784 | 0.99 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2203/54 | 2688 | 2716 | 0.99 |
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/3056/7 | 424 | 423 | 1.00 |
bm<AlgType::str_member_last, char>/2/3 | 4.3 | 4.27 | 1.01 |
bm<AlgType::str_member_last, char>/6/81 | 24.9 | 18.8 | 1.32 |
bm<AlgType::str_member_last, char>/7/4 | 9.93 | 14.4 | 0.69 |
bm<AlgType::str_member_last, char>/9/3 | 8.57 | 12.7 | 0.67 |
bm<AlgType::str_member_last, char>/22/5 | 9.33 | 13.5 | 0.69 |
bm<AlgType::str_member_last, char>/58/2 | 10.2 | 14 | 0.73 |
bm<AlgType::str_member_last, char>/75/85 | 43.9 | 35.9 | 1.22 |
bm<AlgType::str_member_last, char>/102/4 | 12 | 14.8 | 0.81 |
bm<AlgType::str_member_last, char>/200/46 | 47.4 | 31 | 1.53 |
bm<AlgType::str_member_last, char>/325/1 | 27.1 | 30 | 0.90 |
bm<AlgType::str_member_last, char>/400/50 | 107 | 41.2 | 2.60 |
bm<AlgType::str_member_last, char>/1011/11 | 76.1 | 71.1 | 1.07 |
bm<AlgType::str_member_last, char>/1280/46 | 287 | 88.6 | 3.24 |
bm<AlgType::str_member_last, char>/1502/23 | 228 | 95.4 | 2.39 |
bm<AlgType::str_member_last, char>/2203/54 | 472 | 153 | 3.08 |
bm<AlgType::str_member_last, char>/3056/7 | 211 | 190 | 1.11 |
bm<AlgType::str_member_last, wchar_t>/2/3 | 10.9 | 10.6 | 1.03 |
bm<AlgType::str_member_last, wchar_t>/6/81 | 33.1 | 42.3 | 0.78 |
bm<AlgType::str_member_last, wchar_t>/7/4 | 13.4 | 15.7 | 0.85 |
bm<AlgType::str_member_last, wchar_t>/9/3 | 12.2 | 16.1 | 0.76 |
bm<AlgType::str_member_last, wchar_t>/22/5 | 12.7 | 16.4 | 0.77 |
bm<AlgType::str_member_last, wchar_t>/58/2 | 14.7 | 18.4 | 0.80 |
bm<AlgType::str_member_last, wchar_t>/75/85 | 66 | 48.9 | 1.35 |
bm<AlgType::str_member_last, wchar_t>/102/4 | 19.3 | 27.1 | 0.71 |
bm<AlgType::str_member_last, wchar_t>/200/46 | 94.4 | 44.6 | 2.12 |
bm<AlgType::str_member_last, wchar_t>/325/1 | 48.4 | 33 | 1.47 |
bm<AlgType::str_member_last, wchar_t>/400/50 | 150 | 51.4 | 2.92 |
bm<AlgType::str_member_last, wchar_t>/1011/11 | 322 | 94.4 | 3.41 |
bm<AlgType::str_member_last, wchar_t>/1280/46 | 395 | 126 | 3.13 |
bm<AlgType::str_member_last, wchar_t>/1502/23 | 463 | 134 | 3.46 |
bm<AlgType::str_member_last, wchar_t>/2203/54 | 648 | 222 | 2.92 |
bm<AlgType::str_member_last, wchar_t>/3056/7 | 404 | 271 | 1.49 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2/3 | 12.9 | 12.1 | 1.07 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/6/81 | 126 | 25 | 5.04 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/7/4 | 20.3 | 15.9 | 1.28 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/9/3 | 12.2 | 16.5 | 0.74 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/22/5 | 12.9 | 16.7 | 0.77 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/58/2 | 14.8 | 18.8 | 0.79 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/75/85 | 154 | 133 | 1.16 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/102/4 | 19.8 | 25.3 | 0.78 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/200/46 | 214 | 195 | 1.10 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/325/1 | 47.8 | 54.3 | 0.88 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/400/50 | 484 | 421 | 1.15 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1011/11 | 400 | 317 | 1.26 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1280/46 | 1266 | 1192 | 1.06 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1502/23 | 806 | 652 | 1.24 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2203/54 | 2478 | 2166 | 1.14 |
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/3056/7 | 401 | 431 | 0.93 |
This comment was marked as resolved.
This comment was marked as resolved.
We talked about potential mix-and-match issues between 17.13 and 17.14 at the weekly maintainer meeting. (We believe it can affect both the changes to We agree that the potential effects are performance-only, trying a bitmap approach twice, which won't wreak havoc. (We expect that separately compiled libs with 17.13 won't become entrenched because it's not a long-term support release, and it'll be quite new - anyone who upgraded to 17.13 is pretty clearly staying current with the latest release and can be expected to upgrade to 17.14 in short order.)__std_find_first_of_trivial_N
as well as __std_find_last_of_trivial_pos_N
, but that doesn't affect the following analysis.)
This comment was marked as resolved.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.