Vectorize bitset
from string construction by AlexGuteniev · Pull Request #4839 · microsoft/STL (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation27 Commits50 Checks39 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
📜 The approach
We need to:
- Compare against '0' and '1' (or other characters) and report if there's neither of them
- Reverse the order of characters
- Convert byte mask or
wchar_t
to bit mask - Handle bitset and string odd length tails
- Handle too short and too long string properly
It is implemented as follows:
- We compare against '1' vector to get byte or
wchar_t
mask. We xor the input with '0' vector, and check if anything not masked in the resulting byte mask is zero. This way we both validate, and have the mask. - If we have longer string than bitset, we don't do further steps than byte or
wchar_t
mask, as we only need to validate, - To both reverse the order, and convert
wchar_t
mask to byte mask we use(v)pshufb
(..._shuffle_epi8
). The AVX2 version is not cross-lane, and additional cross-lane instruction is somewhat expensive, so we fix up that with bits, rather than now. (v)pmovmskb
is the instruction that does byte mask to bit mask conversion, it is emitted by..._movemask_epi8
intrinsic.- Now we do rotations to fix up the missing AVX2 cross-lane reordering
- Then put that to bitset. We assume padding, even if bitset length is smaller.
- Repeat until string ends
- Write zeros to remaining portion of bitset, if there's something left
⚠️ Padding assumption
To avoid unnecessary complication, and have a bit better performance, an assumption was made that bitset has at least a multiple of 2 padding on SSE4.2 code path, and at lest a multiple of 4 padding on AVX2 code path.
For large bitset we have units of uint64_t
, and it is not reasonable to have unit less than native integer size, and AVX2 code path has the threshold of 256, since it is not efficient on smaller sizes.
For small bitset we have units of uint32_t
, but there's a vNext comment to consider smaller units. But we have a threshold on the whole vectorization as 16 bits, so 8 or less bits bitset that could potentially be less that 2 bytes, would not vectorize.
⚖️ Threshold selection
According to my benchmarks, the vectorized approach benefits for 16 bits, but doesn't for 8 bits. I tried 12, and it seems in favor of non-vectorization. Let it be round up to 16 then.
📏 String length
There are overloads with implicit and explicit length.
For implicit length the non-vectorized implementation calls strlen
/wcslen
(via character traits).
Hypothetically, vectorized implementation could determine length during conversion. We don't do that here, because:
- This cannot be done ASAN-friendly, see Remove attempt on unsized find vectorization... #4486
- Even if it was possible, this adds complication to already not very simple algorithm
Unfortunately, this causes implicit length overload to benefit much less from vectorization.
⏱️ Benchmark results
Benchmark | main | this |
---|---|---|
explicit length, 1-byte chars | ||
BM_bitset_from_string<length_type::char_count, 15, char> | 1813 ns | 1778 ns |
BM_bitset_from_string<length_type::char_count, 16, char> | 1794 ns | 921 ns |
BM_bitset_from_string<length_type::char_count, 36, char> | 2230 ns | 651 ns |
BM_bitset_from_string<length_type::char_count, 64, char> | 2075 ns | 274 ns |
BM_bitset_from_string<length_type::char_count, 512, char> | 2026 ns | 78.1 ns |
BM_bitset_from_string<length_type::char_count, 2048, char> | 2072 ns | 65.7 ns |
explicit length, 2-byte chars | ||
BM_bitset_from_string<length_type::char_count, 15, wchar_t> | 1777 ns | 1774 ns |
BM_bitset_from_string<length_type::char_count, 16, wchar_t> | 1786 ns | 979 ns |
BM_bitset_from_string<length_type::char_count, 36, wchar_t> | 2230 ns | 741 ns |
BM_bitset_from_string<length_type::char_count, 64, wchar_t> | 1988 ns | 349 ns |
BM_bitset_from_string<length_type::char_count, 512, wchar_t> | 1970 ns | 189 ns |
BM_bitset_from_string<length_type::char_count, 2048, wchar_t> | 1912 ns | 143 ns |
implicit length, 1-byte chars | ||
BM_bitset_from_string<length_type::null_term, 15, char> | 2825 ns | 2777 ns |
BM_bitset_from_string<length_type::null_term, 16, char> | 2118 ns | 1439 ns |
BM_bitset_from_string<length_type::null_term, 36, char> | 2571 ns | 1173 ns |
BM_bitset_from_string<length_type::null_term, 64, char> | 2710 ns | 1018 ns |
BM_bitset_from_string<length_type::null_term, 512, char> | 2482 ns | 606 ns |
BM_bitset_from_string<length_type::null_term, 2048, char> | 2407 ns | 566 ns |
implicit length, 2-byte chars | ||
BM_bitset_from_string<length_type::null_term, 15, wchar_t> | 2098 ns | 2102 ns |
BM_bitset_from_string<length_type::null_term, 16, wchar_t> | 2515 ns | 1613 ns |
BM_bitset_from_string<length_type::null_term, 36, wchar_t> | 2128 ns | 1246 ns |
BM_bitset_from_string<length_type::null_term, 64, wchar_t> | 2739 ns | 1055 ns |
BM_bitset_from_string<length_type::null_term, 512, wchar_t> | 2491 ns | 719 ns |
BM_bitset_from_string<length_type::null_term, 2048, wchar_t> | 2407 ns | 639 ns |
This comment was marked as resolved.
This comment was marked as resolved.
Final results on my 5950X:
Benchmark | Before | After | Speedup |
---|---|---|---|
BM_bitset_from_string<length_type::char_count, 15, char> | 1587 ns | 1518 ns | 1.05 |
BM_bitset_from_string<length_type::char_count, 16, char> | 1592 ns | 1046 ns | 1.52 |
BM_bitset_from_string<length_type::char_count, 36, char> | 1396 ns | 675 ns | 2.07 |
BM_bitset_from_string<length_type::char_count, 64, char> | 1492 ns | 301 ns | 4.96 |
BM_bitset_from_string<length_type::char_count, 512, char> | 1457 ns | 69.9 ns | 20.84 |
BM_bitset_from_string<length_type::char_count, 2048, char> | 1438 ns | 50.1 ns | 28.70 |
BM_bitset_from_string<length_type::char_count, 15, wchar_t> | 1493 ns | 1635 ns | 0.91 |
BM_bitset_from_string<length_type::char_count, 16, wchar_t> | 1563 ns | 1122 ns | 1.39 |
BM_bitset_from_string<length_type::char_count, 36, wchar_t> | 1433 ns | 738 ns | 1.94 |
BM_bitset_from_string<length_type::char_count, 64, wchar_t> | 1495 ns | 393 ns | 3.80 |
BM_bitset_from_string<length_type::char_count, 512, wchar_t> | 1436 ns | 141 ns | 10.18 |
BM_bitset_from_string<length_type::char_count, 2048, wchar_t> | 1420 ns | 124 ns | 11.45 |
BM_bitset_from_string<length_type::null_term, 15, char> | 2070 ns | 1993 ns | 1.04 |
BM_bitset_from_string<length_type::null_term, 16, char> | 1974 ns | 1512 ns | 1.31 |
BM_bitset_from_string<length_type::null_term, 36, char> | 1864 ns | 1156 ns | 1.61 |
BM_bitset_from_string<length_type::null_term, 64, char> | 1981 ns | 755 ns | 2.62 |
BM_bitset_from_string<length_type::null_term, 512, char> | 1917 ns | 531 ns | 3.61 |
BM_bitset_from_string<length_type::null_term, 2048, char> | 1895 ns | 483 ns | 3.92 |
BM_bitset_from_string<length_type::null_term, 15, wchar_t> | 2071 ns | 2078 ns | 1.00 |
BM_bitset_from_string<length_type::null_term, 16, wchar_t> | 2041 ns | 1589 ns | 1.28 |
BM_bitset_from_string<length_type::null_term, 36, wchar_t> | 1927 ns | 1222 ns | 1.58 |
BM_bitset_from_string<length_type::null_term, 64, wchar_t> | 1958 ns | 864 ns | 2.27 |
BM_bitset_from_string<length_type::null_term, 512, wchar_t> | 1932 ns | 594 ns | 3.25 |
BM_bitset_from_string<length_type::null_term, 2048, wchar_t> | 1907 ns | 555 ns | 3.44 |
This is great and a single 0.91 scenario is nothing to worry about. 😻
and a single 0.91 scenario is nothing to worry about.
This is one of the four 15 cases which are just below vectorization threshold. Just one of them has exactly 1.00 as expected
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.
Thanks a bit a lot!
😹 0️⃣ 1️⃣
2 participants