<algorithm>: Implement worst-case linear-time nth_element by muellerj2 · Pull Request #5100 · microsoft/STL (original) (raw)
Fixes #856 for std::nth_element and std::ranges::nth_element. This implements a fallback to the median-of-medians-of-five algorithm when the quickselect algorithm seems to be making too little progress.
The median-of-medians algorithm is mostly the textbook version, with two minor tweaks:
- If the processed sequence doesn't cleanly divide into groups of five elements, the remainder group with less than five elements isn't considered for the median computation. (This reduces the amount of code and doesn't make any difference in the asymptotics. I couldn't observe any practical difference in running time, too.)
- When the pivot (=median-of-medians) has been computed, all (greater) medians located after the pivot are moved to the very end of the processed sequence and the pivot is swapped into the middle of the sequence. This is because all of these elements are guaranteed to be moved by the pivot partitioning algorithm, so this step immediately moves them into an appropriate position (or the pivot probably closer to it). This way, the medians can also be excluded from the sequence on which the partitioning algorithm is applied, avoiding some unnecessary comparisons. (In practice, the benchmarks suggested that this makes the algorithm a few percent faster, but the difference is minor.)
Benchmark results
bm_uniform just applies nth_element to an integer array of the given length. The integer array is uniformly sampled from a fixed seed. This is to check that the worst-case fallback does not noticeably worsen the processing time on such a sequence.
bm_tunkey_adversary applies nth_element to a sequence on which the implemented quickselect algorithm performs terribly.
Before:
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
bm_uniform<alg_type::std_fn>/1024 1845 ns 1803 ns 407273
bm_uniform<alg_type::std_fn>/2048 3966 ns 3990 ns 172308
bm_uniform<alg_type::std_fn>/4096 7702 ns 7673 ns 89600
bm_uniform<alg_type::std_fn>/8192 18090 ns 18032 ns 40727
bm_uniform<alg_type::rng>/1024 1759 ns 1758 ns 373333
bm_uniform<alg_type::rng>/2048 3985 ns 4011 ns 179200
bm_uniform<alg_type::rng>/4096 7694 ns 7847 ns 89600
bm_uniform<alg_type::rng>/8192 18015 ns 17997 ns 37333
bm_tunkey_adversary<alg_type::std_fn> 12995 ns 13393 ns 56000
bm_tunkey_adversary<alg_type::rng> 12714 ns 12835 ns 56000
After:
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
bm_uniform<alg_type::std_fn>/1024 1599 ns 1604 ns 448000
bm_uniform<alg_type::std_fn>/2048 3626 ns 3610 ns 194783
bm_uniform<alg_type::std_fn>/4096 7068 ns 7150 ns 89600
bm_uniform<alg_type::std_fn>/8192 16469 ns 16044 ns 44800
bm_uniform<alg_type::rng>/1024 1701 ns 1709 ns 448000
bm_uniform<alg_type::rng>/2048 3841 ns 3931 ns 194783
bm_uniform<alg_type::rng>/4096 7447 ns 7324 ns 74667
bm_uniform<alg_type::rng>/8192 17024 ns 16741 ns 37333
bm_tunkey_adversary<alg_type::std_fn> 6075 ns 5929 ns 89600
bm_tunkey_adversary<alg_type::rng> 6270 ns 6278 ns 112000
As expected, the fallback greatly improves the running time for bm_tunkey_adversary. The timings for bm_uniform are about on par, or more precisely even a bit better with this PR on my machine.
The fallback heuristic
std::sort switches to its fallback when the recursion depth exceeds some logarithmic threshold. We could use the same heuristic as well, however, this would not guarantee linear time in the worst case but "only" an O(nlogn)O(n \log n)O(nlogn) bound. Alternatively, we could limit the recursion depth to some constant, but that's likely a pessimization for large sequences.
So I opted for an adaptive depth limit: Like the heuristic for std::sort, it assumes that each iteration should reduce the range of inspected elements by 25 %. But while std::sort derives a maximum recursion depth from this assumption, this heuristic falls back to the median-of-medians algorithm when the actual size of the processed sequence exceeds the desired size by some constant tolerance factor (currently about 2) during some iteration. Thus, the total number of processed elements over all quickselect iterations is bounded by a multiple of the sequence length times a geometric sum, ensuring worst-case linear time overall. At the same time, the tolerance factor introduces some leeway so that one or two bad iterations (especially at the beginning) don't trigger the fallback immediately.
Obviously, there are many possible choices for the desired percentage reduction per iteration and the tolerance factor. But the benchmarks seem to suggest that the chosen values aren't too bad; a smaller percentage reduction or a larger margin factor noticeably worsen the bm_tunkey_adversary benchmark, but result in little difference for bm_uniform. Besides, the implementation of std::sort already sets a precedent for a desired reduction of 25 % per iteration.
Test
The newly added test applies nth_element to the same worst-case sequence as the benchmark. This makes sure that the fallback is actually exerted by the test. (I think it's also the first test that exerts the quickselect algorithm and not the just the insertion sort fallback.)