Vectorize find_end by AlexGuteniev 路 Pull Request #4943 路 microsoft/STL (original) (raw)

馃摐 Overview

馃懣 The Evil case

Unlike many other algorithms, search algorithm run time highly depends on the data. It is possible to craft data which takes way more time than typical data of the same length.

The worst case could be haystack of a single repeating value, and the needle with the same repeating values and another value in the end. This is bad for plain search, although Boyer-Moore or similar algorithms are expected to handle that better.

鈿狅笍 The results for find_end are much worse for that case 鈿狅笍

鈿狅笍 The results for search are also worse after vectorization for such case, but not that much as find_end 鈿狅笍

It look possible to detect the evil case (by the amount of matched beginnings per some amount of data) and switch to another algorithm, or to modify the whole algorithm to be less affected. It can be done in this or subsequent PR.

I doubt how much the Evil case is important.

馃弫 Benchmark results

Legend for the benchmark parameter:

/* 0. Small, closer to end */ {common_src_data, "aliquet"sv},
/* 1. Large, closer to end */ {common_src_data, "aliquet malesuada"sv},
/* 2. Small, closer to begin */ {common_src_data, "pulvinar"sv},
/* 3. Large, closer to begin */ {common_src_data, "dapibus elit interdum"sv},
/* 4. Small, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<7, true>},
/* 5. Large, evli */ {fill_pattern_view<3000, false>, fill_pattern_view<20, true>},
Benchmark main this
classic_find_endstd::uint8_t/0 102 ns 22.4 ns
classic_find_endstd::uint8_t/1 107 ns 23.8 ns
classic_find_endstd::uint8_t/2 1447 ns 298 ns
classic_find_endstd::uint8_t/3 1734 ns 384 ns
classic_find_endstd::uint8_t/4 5489 ns 4998 ns
classic_find_endstd::uint8_t/5 17246 ns 15661 ns
classic_find_endstd::uint16_t/0 103 ns 39.2 ns
classic_find_endstd::uint16_t/1 103 ns 44.7 ns
classic_find_endstd::uint16_t/2 1444 ns 630 ns
classic_find_endstd::uint16_t/3 1740 ns 822 ns
classic_find_endstd::uint16_t/4 6487 ns 9816 ns
classic_find_endstd::uint16_t/5 14878 ns 20222 ns
ranges_find_endstd::uint8_t/0 101 ns 22.9 ns
ranges_find_endstd::uint8_t/1 109 ns 24.9 ns
ranges_find_endstd::uint8_t/2 1451 ns 303 ns
ranges_find_endstd::uint8_t/3 1767 ns 390 ns
ranges_find_endstd::uint8_t/4 5363 ns 5087 ns
ranges_find_endstd::uint8_t/5 16254 ns 15942 ns
ranges_find_endstd::uint16_t/0 103 ns 39.4 ns
ranges_find_endstd::uint16_t/1 104 ns 46.1 ns
ranges_find_endstd::uint16_t/2 1457 ns 633 ns
ranges_find_endstd::uint16_t/3 1728 ns 811 ns
ranges_find_endstd::uint16_t/4 5343 ns 9832 ns
ranges_find_endstd::uint16_t/5 14883 ns 20222 ns

Re-testong of the search PR against then-main as there are more cases in the benchmark:

Benchmark main this
c_strstr/0 190 ns 188 ns
c_strstr/1 219 ns 212 ns
c_strstr/2 14.2 ns 13.8 ns
c_strstr/3 10.7 ns 9.92 ns
c_strstr/4 1478 ns 1416 ns
c_strstr/5 17965 ns 16897 ns
classic_searchstd::uint8_t/0 2193 ns 275 ns
classic_searchstd::uint8_t/1 2455 ns 305 ns
classic_searchstd::uint8_t/2 144 ns 30.6 ns
classic_searchstd::uint8_t/3 66.9 ns 17.0 ns
classic_searchstd::uint8_t/4 5315 ns 1439 ns
classic_searchstd::uint8_t/5 15060 ns 11813 ns
classic_searchstd::uint16_t/0 1460 ns 519 ns
classic_searchstd::uint16_t/1 1606 ns 571 ns
classic_searchstd::uint16_t/2 130 ns 56.1 ns
classic_searchstd::uint16_t/3 56.9 ns 27.1 ns
classic_searchstd::uint16_t/4 5342 ns 7312 ns
classic_searchstd::uint16_t/5 28964 ns 20472 ns
ranges_searchstd::uint8_t/0 2102 ns 275 ns
ranges_searchstd::uint8_t/1 2384 ns 301 ns
ranges_searchstd::uint8_t/2 147 ns 30.5 ns
ranges_searchstd::uint8_t/3 76.2 ns 17.0 ns
ranges_searchstd::uint8_t/4 5325 ns 1438 ns
ranges_searchstd::uint8_t/5 15573 ns 11803 ns
ranges_searchstd::uint16_t/0 1482 ns 519 ns
ranges_searchstd::uint16_t/1 1634 ns 571 ns
ranges_searchstd::uint16_t/2 155 ns 55.4 ns
ranges_searchstd::uint16_t/3 70.0 ns 28.1 ns
ranges_searchstd::uint16_t/4 5339 ns 7338 ns
ranges_searchstd::uint16_t/5 22691 ns 20377 ns
search_default_searcherstd::uint8_t/0 1963 ns 273 ns
search_default_searcherstd::uint8_t/1 2182 ns 301 ns
search_default_searcherstd::uint8_t/2 147 ns 30.4 ns
search_default_searcherstd::uint8_t/3 60.4 ns 16.6 ns
search_default_searcherstd::uint8_t/4 5816 ns 1441 ns
search_default_searcherstd::uint8_t/5 20702 ns 11753 ns
search_default_searcherstd::uint16_t/0 2443 ns 519 ns
search_default_searcherstd::uint16_t/1 2671 ns 607 ns
search_default_searcherstd::uint16_t/2 204 ns 55.6 ns
search_default_searcherstd::uint16_t/3 92.1 ns 27.4 ns
search_default_searcherstd::uint16_t/4 5676 ns 7294 ns
search_default_searcherstd::uint16_t/5 30609 ns 20342 ns