MSVC Backend Updates since Visual Studio 2022 version 17.3 - C++ Team Blog (original) (raw)
Since Visual Studio 2022 version 17.3 we have continued to improve the C++ backend with new features and new and improved optimizations. Here are some of our exciting improvements.
- 17.9 improvements for x86 and x64, thanks to our friends at Intel.
- Support for Scalar FP intrinsics with double/float arguments
- Improve code generation by replacing
VINSERTPS
withVBLENDPS
for x64 only - Support for round scalar functions
- 17.8 improvements
- The new /ARM64XFUNCTIONPADMINX64:# flag allows specifying the number of bytes of padding for x64 functions in arm64x images
- The new /NOFUNCTIONPADSECTION:sec flag allows disabling function padding for functions in a particular section
- LTCG build takes better advantage of threads, improving throughput.
- Support for RAO-INT, thanks to our friends at Intel.
- Address sanitizer improvements:
* The Address Sanitizer flag is now compatible with C++ modules.
* The compiler will now report an error when/fsanitize=address
is combined with an incompatible flag, instead of silently disabling ASAN checks.
* ASAN checks are now emitted for loads and stores in memchr, memcmp, and the various string functions. - Performance improvements that will help every architecture:
* Improve hoisting of loads and stores outside of loops. - Performance improvements for arm64:
* Improve memcmp performance on both arm64 and arm64ec.
* When calling memcpy, memset, memchr, or memcmp from emulated x64 code, remove the performance overhead of switching to arm64ec versions of these functions.
* Optimize scalar immediate loads (from our friends at ARM)
* CombineCSET
andADD
instructions into a singleCINC
instruction (from our friends at ARM) - Performance improvements for x86 and x64, many thanks to our friends at Intel:
* Improve code generation for _mm_fmadd_sd.
* Improve code generation forUMWAIT
andTPAUSE
, preserving implicit input registers.
* Improve code generation for vector shift intrinsics by improving auto-vectorizer.
* Tune internal vectorization thresholds to improve auto-vectorization.
* Implement optimization for FP classification beyond std::isnan.
* Performance improvements for x64:
* Generate a singlePSHUFLW
instruction for _mm_set1_epi16 when only the lower 64 bits of the result are used.
* Improve code generation for abs(). (Thanks to our friends at AMD)
* No longer generate redundant loads and stores whenLDDQU
is combined withVBROADCAST128
.
* GeneratePMADDWD
instead ofPMULLD
where possible.
* Combine two contiguous stores into a single unaligned store.
* Use 32 vector registers in functions that use AVX512 intrinsics even when not compiling with /arch:AVX512.
* Don’t emit unnecessary register to register moves.
* Performance improvements for x86:
* Improve code generation for expf().
- 17.7 improvements
- New /jumptablerdata flag places jump tables for switch statements in the .rdata section instead of the .text section.
- Link time with a cold file system cache is now faster.
- Improve compilation time of POGO-instrumented builds.
- Speed up LTCG compilation in a variety of ways.
- OpenMP improvements with /openmp:llvm, thanks to our friends at Intel:
*#pragma omp atomic update
and#pragma omp atomic capture
no longer need to call into the runtime, improving performance.
* Better code generation for OpenMP floating point atomics.
* The clauseschedule(static)
is now respected for ordered loops. - Performance improvements for all architectures:
* Copy propagation optimizations are now more effective, thanks to our friends from AMD.
* Improve optimization for DeBruijn table.
* Fully unroll loops of fixed size even if they contain function calls.
* Improve bit optimizations.
* Deeply nested loops are now optimized. - Performance improvements and additional functionality for x86 and x64, many thanks to our friends at Intel:
* Support Intel Sierra Forest instruction set (AVX-IFMA, AVX-NE-CONVERT, AVX-VNNI-INT8, CMPCCXADD, Additional MSR support).
* Support Intel Granite Rapids instruction set (AMX-COMPLEX).
* SupportLOCK_SUB
.
* Add overflow detection functions for addition, subtraction, and multiplication.
* Implement intrinsic functions for isunordered, isnan, isnormal, isfinite, isinf, issubnormal, fmax, and fmin.
* Reduce code size of bitwise vector operations.
* Improve code generation for AVX2 instructions during tail call optimization.
* Improve code generation for floating point instructions without an SSE version.
* Remove unneeded PAND instructions.
* Improve assembler output for FP16 truncating conversions to use surpress-all-exceptions instead of embedded rounding.
* Eliminate unnecessary hoisting of conversions from FP to unsigned long long.
* Performance improvements for x64:
* No longer emit unnecessaryMOVSX
/MOVZX
instructions.
* Do a better job of devirtualizing calls to class functions.
* Improve performance of memmove.
* Improve code generation forXOR-EXTRACT
combination pattern. - Performance improvements for arm64:
* Improve register coloring for destinations of NEONBIT
,BIF
, andBSL
instructions, thanks to our friends at ARM.
* Convert cross-binary indirect calls that use the import address table into direct calls.
* Add the_CountTrailingZeros
and_CountTrailingZeros64
intrinsics for counting trailing zeros in integers
* Generate BFI instructions in more places.
- 17.6 improvements
- The
/openmp:llvm
flag now supports thecollapse
clause on#pragma omp loop
(Full Details.) - The new
/d2AsanInstrumentationPerFunctionThreshold:#
flag allows turning off ASAN instrumentation on functions that would add more than a certain number of extra ASAN calls. - New
/OTHERARCHEXPORTS
option fordumpbin /EXPORTS
will dump the x64 exports of an arm64x dll. - Build time improvements:
* Improved LTCG build throughput.
* Reduced LTCG build memory usage.
* Reduced link time during incremental linking. - Performance improvements that will help every architecture:
* Vectorize loops that use min, max, and absolute, thanks to our friends at ARM.
* Turn loops witha[i] = ((a[i]>>15)&0x10001)*0xffff
into vector compares.
* Hoist calculation of array bases of the form(a + constant)[i]
out of the loop. - Performance improvements on arm64:
* Load floats directly into floating point registers instead of using integer load and FMOV instructions.
* Improve code generation for abs(), thanks to our friends at ARM.
* Improve code generation for vectors when NEON instructions are available.
* Generate CSINC instructions when the ? operator has the constant 1 as a possible result of the expression, thanks to our friends at ARM.
* Improve code generation for loops that sum an array by using vector add instructions.
* Combine vector extend and arithmetic instructions into a single instruction.
* Remove extraneous adds, subtractions, and ors with 0.
* Auxiliary delayload IAT: new import address table for calls into delayloaded DLLs in arm64x. At runtime, Windows will patch this table to speed up program execution. - Performance improvements and additional features on x86 and x64, many thanks to our friends at Intel:
* Support for Intel Granite Rapids x64 instruction set, specificallyTDPFP16PS
(AMX-FP16) andPREFETCHIT0
/PREFETCHIT1
.
* Support for ties-to-away rounding for round and roundf intrinsic functions.
* Reduce small loops to vectors.
* No longer generate redundantMOVD
/MOVQ
instructions.
* UseVBLEND
instructions instead of the slowerVINSERTF128
andVBLENDPS
instructions on AVX512 where possible.
* PromotePCLMULQDQ
instructions toVPCLMULQDQ
where possible with /arch:AVX or later.
* ReplaceVEXTRACTI128
instructions that extract the lower half of a vector withVMOVDQU
instructions, thanks to our friends at AMD.
* Support for missing AVX512-FP16 intrinsics.
* Better code generation with correct VEX/EVEX encoding for VCMPXX pseudo-ops in MASM.
* Improve conversions from 64-bit integer to floating-point.
* Improve code generation on x64 with correct instruction scheduling forSTMXCSR
.
- The
- 17.5 improvements
- The new /Zc:checkGwOdr flag allows for enforcing C++ standards for ODR violations even when compiling with /Gw.
- Combine a
MOV
and aCSEL
instruction into aCSINV
instruction on arm64. - Performance and code quality improvements for x86 and x64, thanks to our friends at Intel:
* Improve code generation for returns of structs consisting of 2 64-bit values on x64.
* Type conversions no longer generate unnecessaryFSTP
/FLD
instructions.
* Improve checking floating-point values for Not-a-Number.
* Emit smaller sequence in auto-vectorizer with bit masking and reduction.
* Correct expansion of round to use ROUND instruction only under /fp:fast.
- 17.4 improvements
- Performance improvements that will help every architecture:
* Improve bswap for signed integers.
* Improve stackpacking for functions with memset calls. - Improve the debugging support and performance for Arm64:
* Edit and Continue is now possible for programs targeting Arm64.
* Added support for armv8 int8 matrix multiplication instructions.
* UseBIC
instructions in place of anMVN
andAND
.
* UseBIC_SHIFT
instruction where appropriate. - Performance and code quality improvements on x64 and x86, thanks to our friends at Intel:
* std::memchr now meets the additional C++17 requirement of stopping as soon as a matching byte is found.
* Improve code generation for 16-bit interlocked add.
* Coalesce register initialization on AVX/AVX2.
* Improve code generation for returns of structs consisting of 2 64-bit values.
* Improve codegen for _mm_ucomieq_ss.
* UseVROUNDXX
instructions for ceil, floor, trunc, and round.
* Improve checking floating-point values for Not-a-Number. - Support for OpenMP Standard 3.1 under the experimental
-openmp:llvm
switch expanded to include themin
andmax
operators on thereduction
clause. - Improve copy and move elision
- The new /Qspectre-jmp flag adds an int3 after unconditional jump instructions.
- Performance improvements that will help every architecture:
Do you want to experience the new improvements in the C++ backend? Please download the latest Visual Studio 2022 and give it a try! Any feedback is welcome. We can be reached via the comments below, Developer Community, Twitter (@VisualC), or email at visualcpp@microsoft.com.
Stay tuned for more information on updates to the latest Visual Studio.
Author
Bran Hagger is a software developer on the C++ Machine Independent codegen team. His focus is OpenMP and the SSA optimizer.