Vectorize `is_sorted_until` by AlexGuteniev · Pull Request #5420 · microsoft/STL (original) (raw)

🗺️ Overview

Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in remove, unique or adjacent_find, and comparisons from minmax family.

⚖️ Less and greater

Sorting in both ascending and descending order makes equal sense, although the standard picks less as the default predicate. We support both less and greater here. This is done for the first time. It wasn't done for minmax family in the past, because for minmax it is very squirrelly to reverse the predicate 🐿️.

To reduce the number of functions, less and greater distinguished by a bool parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.

The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.

Fun observation: one could attempt to pass less_equal or greater_equal too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's _DEBUG_LT_PRED to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.

🐞 Floating bugs

We support floating point types here too.

I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.

One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for /fp:except, so it should be fine:

#if _USE_STD_VECTOR_ALGORITHMS && !defined(_M_FP_EXCEPT)
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1
#else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0
#endif // ^^^ not use vector algorithms or not use fast math ^^^

If anything goes wrong, the escape hatch is still there to help.

➕ Unsigned values

Like in minmax_element, we need the *_gt intrinsics that only exist for signed values, so we have to do sign correction.

Unlike in minmax_element, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.

Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for int8_t and ~39.5 for iint8_t, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.

⏱️ Benchmark results