Vectorize is_sorted_until by AlexGuteniev · Pull Request #5420 · microsoft/STL (original) (raw)

🗺️ Overview

Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in remove, unique or adjacent_find, and comparisons from minmax family.

⚖️ Less and greater

Sorting in both ascending and descending order makes equal sense, although the standard picks less as the default predicate. We support both less and greater here. This is done for the first time. It wasn't done for minmax family in the past, because for minmax it is very squirrelly to reverse the predicate 🐿️.

To reduce the number of functions, less and greater distinguished by a bool parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.

The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.

Fun observation: one could attempt to pass less_equal or greater_equal too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's _DEBUG_LT_PRED to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.

🐞 Floating bugs

We support floating point types here too.

I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.

One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for /fp:except, so it should be fine:

#if _USE_STD_VECTOR_ALGORITHMS && !defined(_M_FP_EXCEPT)
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1
#else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0
#endif // ^^^ not use vector algorithms or not use fast math ^^^

If anything goes wrong, the escape hatch is still there to help.

➕ Unsigned values

Like in minmax_element, we need the *_gt intrinsics that only exist for signed values, so we have to do sign correction.

Unlike in minmax_element, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.

Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for int8_t and ~39.5 for iint8_t, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.

⏱️ Benchmark results

Benchmark Before After Speedup
bm<std::int8_t, AlgType::Std>/3000/1800 434 ns 33.7 ns 12.88
bm<std::int8_t, AlgType::Rng>/3000/1800 434 ns 33.4 ns 12.99
bm<std::int16_t, AlgType::Std>/3000/1800 440 ns 67.8 ns 6.49
bm<std::int16_t, AlgType::Rng>/3000/1800 432 ns 66.5 ns 6.50
bm<std::int32_t, AlgType::Std>/3000/1800 432 ns 129 ns 3.35
bm<std::int32_t, AlgType::Rng>/3000/1800 431 ns 130 ns 3.32
bm<std::int64_t, AlgType::Std>/3000/1800 437 ns 238 ns 1.84
bm<std::int64_t, AlgType::Rng>/3000/1800 440 ns 241 ns 1.83
bm<std::uint8_t, AlgType::Std>/3000/1800 437 ns 39.4 ns 11.09
bm<std::uint8_t, AlgType::Rng>/3000/1800 440 ns 39.5 ns 11.14
bm<std::uint16_t, AlgType::Std>/3000/1800 438 ns 79.0 ns 5.54
bm<std::uint16_t, AlgType::Rng>/3000/1800 433 ns 80.9 ns 5.35
bm<std::uint32_t, AlgType::Std>/3000/1800 432 ns 146 ns 2.96
bm<std::uint32_t, AlgType::Rng>/3000/1800 429 ns 147 ns 2.92
bm<std::uint64_t, AlgType::Std>/3000/1800 500 ns 279 ns 1.79
bm<std::uint64_t, AlgType::Rng>/3000/1800 446 ns 280 ns 1.59
bm<float, AlgType::Std>/3000/1800 658 ns 112 ns 5.88
bm<float, AlgType::Rng>/3000/1800 670 ns 100 ns 6.70
bm<double, AlgType::Std>/3000/1800 653 ns 179 ns 3.65
bm<double, AlgType::Rng>/3000/1800 657 ns 177 ns 3.71