Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only by AlexGuteniev · Pull Request #4734 · microsoft/STL (original) (raw)

Initially I thought that it could be fixed by using careful minmax implementation, that selects correctly either the first or the last value when the comparands are equivalent.

I've learned the behavior of [v]{min|max}{s|p}{s|d} instructions (thanks @statementreply and @Alcaro for enlightening me on that), figured out that it was possible to control which of the equivalent values is the result, also I've reported the compiler bug DevCom-10686775, and found a reliable workaround for it.

Unfortunately, the control over a single minmax instruction result is not enough. The whole value-based vectorization appoach does not work well with order requirements for equivalent elements Efficient vectorization requires vertical comparisons (same elements on different vector values) to be performed first, and horiziontal comparisons (different elements on the same vector value) to be performed last.

With index-based approach, as in minmax_element, changed order is fine, as we're looking for smallest/greatest index among equal elements.

As a result, we have to resort to using minmax_element approach for floating minmax, unless /fp:fast is specified. Should be not a big loss though -- the benchmark results in #4659 shows that smaller types benefit from minmax approach a lot, but floats not a lot. Definitely still way faster than scalar.

/fp:fast is still fine, as the compiler takes advantage of not distinguishing +0.0 and -0,0 and is able to emit vectorized minmax itself (see related issue #4453)

I decided to keep comparisons reordering for floats in -- this seems to improve the handing of NAN values, which is decided to be unsupported, but why won't keep something that accidentally does things better.


⏱️ Benchmark results

C:\Project\STL>out\bench\benchmark-minmax_element.exe --benchmark_min_time=1s --benchmark_filter=(float^|double)
2024-06-22T13:46:09+03:00
Running out\bench\benchmark-minmax_element.exe
Run on (12 X 2496 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Benchmark main fix fix + /fp:fast
bm<float, 8021, Op::Min> 1184 ns 1207 ns 1221 ns
bm<float, 8021, Op::Max> 1210 ns 1212 ns 1208 ns
bm<float, 8021, Op::Both> 1357 ns 1379 ns 1362 ns
bm<float, 8021, Op::Min_val> 891 ns 1211 ns 891 ns
bm<float, 8021, Op::Max_val> 915 ns 1222 ns 883 ns
bm<float, 8021, Op::Both_val> 955 ns 1378 ns 940 ns
bm<double, 8021, Op::Min> 2246 ns 2393 ns 2352 ns
bm<double, 8021, Op::Max> 2365 ns 2393 ns 2361 ns
bm<double, 8021, Op::Both> 2719 ns 2753 ns 2727 ns
bm<double, 8021, Op::Min_val> 1880 ns 2365 ns 1849 ns
bm<double, 8021, Op::Max_val> 1877 ns 2358 ns 1868 ns
bm<double, 8021, Op::Both_val> 1933 ns 2688 ns 1913 ns

The reodreding of _mm[256]_{min|max}_p{s|d} args seems a bit unfavorable for performance, but not very much, at least the results difference is within variation.