[compiler-rt][ARM] Optimized f32 add/subtract for Armv6-M. by statham-arm · Pull Request #154093 · llvm/llvm-project (original) (raw)

This commit replaces the contents of the existing arm/addsf3.S with a much faster implementation that Arm has recently open-sourced in the Arm Optimized Routines git repository.

The new implementation is approximately 1.6× as fast as the old one on average. Some sample cycle timings from a Cortex-M0, with test cases covering both magnitude addition and subtraction and various cases of renormalization:

New code: 73, 63, 53, 81, 81
Old code: 83, 92, 88, 153, 168

This commit also contains a more thorough test suite for single precision addition and subtraction. Using that test suite I also found that the previous arm/addsf3.S had at least one bug, which the new code fixes: adding the largest denormal (0x007fffff) to itself returned 0x007ffffe, a slightly smaller number, instead of the correct 0x00fffffe.

The test suite also includes thorough tests for the NaN handling policy implemented by the new code. This is in line with Arm's hardware FP implementations (so that switching between software and hardware FP makes as little difference as possible to the answers), but doesn't match what compiler-rt does in all other situations, so I've enabled it only under an #ifdef that should match when this implementation is selected.

The new code contains entry points for both addition and subtraction, with cross-branching between them after correcting signs. This avoids the overhead of treating subtraction as a sign-flipping wrapper on addition, but also means I had to add an extra piece of mechanism to the build scripts to allow the wrapper version of subsf3.c to be excluded from the build in the presence of the new addsf3.S. You can indicate that a platform-specific source file replaces an additional platform-independent one by setting its crt_supersedes property in cmake.