Vectorize is_sorted_until
by AlexGuteniev · Pull Request #5420 · microsoft/STL (original) (raw)
🗺️ Overview
Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in remove
, unique
or adjacent_find
, and comparisons from minmax
family.
⚖️ Less and greater
Sorting in both ascending and descending order makes equal sense, although the standard picks less
as the default predicate. We support both less
and greater
here. This is done for the first time. It wasn't done for minmax
family in the past, because for minmax
it is very squirrelly to reverse the predicate 🐿️.
To reduce the number of functions, less
and greater
distinguished by a bool
parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.
The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.
Fun observation: one could attempt to pass less_equal
or greater_equal
too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's _DEBUG_LT_PRED
to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.
🐞 Floating bugs
We support floating point types here too.
I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.
One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for /fp:except
, so it should be fine:
#if _USE_STD_VECTOR_ALGORITHMS && !defined(_M_FP_EXCEPT) |
---|
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1 |
#else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv |
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0 |
#endif // ^^^ not use vector algorithms or not use fast math ^^^ |
If anything goes wrong, the escape hatch is still there to help.
➕ Unsigned values
Like in minmax_element
, we need the *_gt
intrinsics that only exist for signed values, so we have to do sign correction.
Unlike in minmax_element
, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.
Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for int8_t
and ~39.5 for iint8_t
, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.
⏱️ Benchmark results
Benchmark | Before | After | Speedup |
---|---|---|---|
bm<std::int8_t, AlgType::Std>/3000/1800 | 434 ns | 33.7 ns | 12.88 |
bm<std::int8_t, AlgType::Rng>/3000/1800 | 434 ns | 33.4 ns | 12.99 |
bm<std::int16_t, AlgType::Std>/3000/1800 | 440 ns | 67.8 ns | 6.49 |
bm<std::int16_t, AlgType::Rng>/3000/1800 | 432 ns | 66.5 ns | 6.50 |
bm<std::int32_t, AlgType::Std>/3000/1800 | 432 ns | 129 ns | 3.35 |
bm<std::int32_t, AlgType::Rng>/3000/1800 | 431 ns | 130 ns | 3.32 |
bm<std::int64_t, AlgType::Std>/3000/1800 | 437 ns | 238 ns | 1.84 |
bm<std::int64_t, AlgType::Rng>/3000/1800 | 440 ns | 241 ns | 1.83 |
bm<std::uint8_t, AlgType::Std>/3000/1800 | 437 ns | 39.4 ns | 11.09 |
bm<std::uint8_t, AlgType::Rng>/3000/1800 | 440 ns | 39.5 ns | 11.14 |
bm<std::uint16_t, AlgType::Std>/3000/1800 | 438 ns | 79.0 ns | 5.54 |
bm<std::uint16_t, AlgType::Rng>/3000/1800 | 433 ns | 80.9 ns | 5.35 |
bm<std::uint32_t, AlgType::Std>/3000/1800 | 432 ns | 146 ns | 2.96 |
bm<std::uint32_t, AlgType::Rng>/3000/1800 | 429 ns | 147 ns | 2.92 |
bm<std::uint64_t, AlgType::Std>/3000/1800 | 500 ns | 279 ns | 1.79 |
bm<std::uint64_t, AlgType::Rng>/3000/1800 | 446 ns | 280 ns | 1.59 |
bm<float, AlgType::Std>/3000/1800 | 658 ns | 112 ns | 5.88 |
bm<float, AlgType::Rng>/3000/1800 | 670 ns | 100 ns | 6.70 |
bm<double, AlgType::Std>/3000/1800 | 653 ns | 179 ns | 3.65 |
bm<double, AlgType::Rng>/3000/1800 | 657 ns | 177 ns | 3.71 |