Vectorize find_end
by AlexGuteniev 路 Pull Request #4943 路 microsoft/STL (original) (raw)
馃摐 Overview
- Similar to Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745 in the opposite direction
- The algorithm is not very similar though. I have not attempted the one-index
pcmpestri
instruction and variable step, but it looked like multiple-indicespcmpestrm
and fixed step would work better for this direction, because the former would return more partial false matches than in forward direction, as the match with higher index is more likely to be partial.- If that's not convincing enough, I can try the other approach
pcmpestrm
approach could have been made more efficient if DevCom-10689455 is fixed, but honestly I don't hope for that much
- Also restructured control flow a vit in
__std_search_impl
, the place where_Match_1st_16
is introduced. I basically implemented the same thing from scratch, and the new attempt gave clearer control flow implementation. They are not shared though, as_First1 > _Stop1
check is needed for variable step, but not needed for fixed step, so here's still some variation. - The types larger than 2 bytes were excluded from test and benchmark -- same way as we don't test
replace
with smaller elements. - The benchmark patterns list is expanded to test small and large needle closer to the other end. I deliberately left coverage for
search
+ closer to start andfind_end
+ closer to end, It is useful to test the case where startup cost contributes more. - Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved
馃懣 The Evil case
Unlike many other algorithms, search
algorithm run time highly depends on the data. It is possible to craft data which takes way more time than typical data of the same length.
The worst case could be haystack of a single repeating value, and the needle with the same repeating values and another value in the end. This is bad for plain search
, although Boyer-Moore or similar algorithms are expected to handle that better.
鈿狅笍 The results for find_end
are much worse for that case 鈿狅笍
鈿狅笍 The results for search
are also worse after vectorization for such case, but not that much as find_end
鈿狅笍
It look possible to detect the evil case (by the amount of matched beginnings per some amount of data) and switch to another algorithm, or to modify the whole algorithm to be less affected. It can be done in this or subsequent PR.
I doubt how much the Evil case is important.
馃弫 Benchmark results
Legend for the benchmark parameter:
/* 0. Small, closer to end */ {common_src_data, "aliquet"sv}, |
---|
/* 1. Large, closer to end */ {common_src_data, "aliquet malesuada"sv}, |
/* 2. Small, closer to begin */ {common_src_data, "pulvinar"sv}, |
/* 3. Large, closer to begin */ {common_src_data, "dapibus elit interdum"sv}, |
/* 4. Small, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<7, true>}, |
/* 5. Large, evli */ {fill_pattern_view<3000, false>, fill_pattern_view<20, true>}, |
Benchmark | main | this |
---|---|---|
classic_find_endstd::uint8_t/0 | 102 ns | 22.4 ns |
classic_find_endstd::uint8_t/1 | 107 ns | 23.8 ns |
classic_find_endstd::uint8_t/2 | 1447 ns | 298 ns |
classic_find_endstd::uint8_t/3 | 1734 ns | 384 ns |
classic_find_endstd::uint8_t/4 | 5489 ns | 4998 ns |
classic_find_endstd::uint8_t/5 | 17246 ns | 15661 ns |
classic_find_endstd::uint16_t/0 | 103 ns | 39.2 ns |
classic_find_endstd::uint16_t/1 | 103 ns | 44.7 ns |
classic_find_endstd::uint16_t/2 | 1444 ns | 630 ns |
classic_find_endstd::uint16_t/3 | 1740 ns | 822 ns |
classic_find_endstd::uint16_t/4 | 6487 ns | 9816 ns |
classic_find_endstd::uint16_t/5 | 14878 ns | 20222 ns |
ranges_find_endstd::uint8_t/0 | 101 ns | 22.9 ns |
ranges_find_endstd::uint8_t/1 | 109 ns | 24.9 ns |
ranges_find_endstd::uint8_t/2 | 1451 ns | 303 ns |
ranges_find_endstd::uint8_t/3 | 1767 ns | 390 ns |
ranges_find_endstd::uint8_t/4 | 5363 ns | 5087 ns |
ranges_find_endstd::uint8_t/5 | 16254 ns | 15942 ns |
ranges_find_endstd::uint16_t/0 | 103 ns | 39.4 ns |
ranges_find_endstd::uint16_t/1 | 104 ns | 46.1 ns |
ranges_find_endstd::uint16_t/2 | 1457 ns | 633 ns |
ranges_find_endstd::uint16_t/3 | 1728 ns | 811 ns |
ranges_find_endstd::uint16_t/4 | 5343 ns | 9832 ns |
ranges_find_endstd::uint16_t/5 | 14883 ns | 20222 ns |
Re-testong of the search PR against then-main as there are more cases in the benchmark:
Benchmark | main | this |
---|---|---|
c_strstr/0 | 190 ns | 188 ns |
c_strstr/1 | 219 ns | 212 ns |
c_strstr/2 | 14.2 ns | 13.8 ns |
c_strstr/3 | 10.7 ns | 9.92 ns |
c_strstr/4 | 1478 ns | 1416 ns |
c_strstr/5 | 17965 ns | 16897 ns |
classic_searchstd::uint8_t/0 | 2193 ns | 275 ns |
classic_searchstd::uint8_t/1 | 2455 ns | 305 ns |
classic_searchstd::uint8_t/2 | 144 ns | 30.6 ns |
classic_searchstd::uint8_t/3 | 66.9 ns | 17.0 ns |
classic_searchstd::uint8_t/4 | 5315 ns | 1439 ns |
classic_searchstd::uint8_t/5 | 15060 ns | 11813 ns |
classic_searchstd::uint16_t/0 | 1460 ns | 519 ns |
classic_searchstd::uint16_t/1 | 1606 ns | 571 ns |
classic_searchstd::uint16_t/2 | 130 ns | 56.1 ns |
classic_searchstd::uint16_t/3 | 56.9 ns | 27.1 ns |
classic_searchstd::uint16_t/4 | 5342 ns | 7312 ns |
classic_searchstd::uint16_t/5 | 28964 ns | 20472 ns |
ranges_searchstd::uint8_t/0 | 2102 ns | 275 ns |
ranges_searchstd::uint8_t/1 | 2384 ns | 301 ns |
ranges_searchstd::uint8_t/2 | 147 ns | 30.5 ns |
ranges_searchstd::uint8_t/3 | 76.2 ns | 17.0 ns |
ranges_searchstd::uint8_t/4 | 5325 ns | 1438 ns |
ranges_searchstd::uint8_t/5 | 15573 ns | 11803 ns |
ranges_searchstd::uint16_t/0 | 1482 ns | 519 ns |
ranges_searchstd::uint16_t/1 | 1634 ns | 571 ns |
ranges_searchstd::uint16_t/2 | 155 ns | 55.4 ns |
ranges_searchstd::uint16_t/3 | 70.0 ns | 28.1 ns |
ranges_searchstd::uint16_t/4 | 5339 ns | 7338 ns |
ranges_searchstd::uint16_t/5 | 22691 ns | 20377 ns |
search_default_searcherstd::uint8_t/0 | 1963 ns | 273 ns |
search_default_searcherstd::uint8_t/1 | 2182 ns | 301 ns |
search_default_searcherstd::uint8_t/2 | 147 ns | 30.4 ns |
search_default_searcherstd::uint8_t/3 | 60.4 ns | 16.6 ns |
search_default_searcherstd::uint8_t/4 | 5816 ns | 1441 ns |
search_default_searcherstd::uint8_t/5 | 20702 ns | 11753 ns |
search_default_searcherstd::uint16_t/0 | 2443 ns | 519 ns |
search_default_searcherstd::uint16_t/1 | 2671 ns | 607 ns |
search_default_searcherstd::uint16_t/2 | 204 ns | 55.6 ns |
search_default_searcherstd::uint16_t/3 | 92.1 ns | 27.4 ns |
search_default_searcherstd::uint16_t/4 | 5676 ns | 7294 ns |
search_default_searcherstd::uint16_t/5 | 30609 ns | 20342 ns |