[llvm-dev] [ARM] Should Use Load and Store with Register Offset (original) (raw)

Daniel Way via llvm-dev llvm-dev at lists.llvm.org
Tue Jul 21 00:12:04 PDT 2020


Hello Sjoerd,

Thank you for your response! I was not aware that -Oz is a closer equivalent to GCC's -Os. I tried -Oz when compiling with clang and confirmed that the Clang's generated assembly is equivalent to GCC for the code snippet I posted above.

clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer memcpy_alt1: push {r4, lr} movs r3, #0 .LBB0_1: cmp r2, r3 beq .LBB0_3 ldrb r4, [r1, r3] strb r4, [r0, r3] adds r3, r3, #1 b .LBB0_1 .LBB0_3: pop {r4, pc}

On the other hand, -O2 in GCC still uses the register-offset load and store instructions while Clang -O2 generates the same assembly as -Os: immediate-offset (0 offset) load/store followed by incrementing the base register addresses. I have not tried to benchmark the Clang-generated code, it is possible that execution time is bounded by the load and store instructions and memory access latency. From an intuitive view, however, both GCC and Clang are generating code with 1 load and 1 store, so if Clang inserts two additional adds instructions, the binary size is larger, execution could be slower, and there's no improvement in register utilization over GCC.

I wanted to try a couple other variants of memcpy-like functions. The https://godbolt.org/z/d7P6rG link includes memcpy_alt2 which copies data from src to dst starting at the high address and memcpy_silly which copies src to dst<0-4>. Here is the behavior I have noticed from GCC and Clang.

memcpy_alt2

memcpy_silly

I really think that, when limited to the Thumb1 ISA, register-offset load and store instructions should be used at -Oz, -Os, and -O2 optimization levels. Explicitly incrementing a register holding the base address seems unnecessary when the value seems wasteful and I cannot see how it will improve execution time in the examples I'm investigating. Id like to know if I'm wrong in assuming that LDR Rd, [Rn, Rm] and LDR Rd, [Rn, #] have the same execution time, but based on the Cortex-M0+ TRM they should both require 2 clock cycles.

Best regards,

Daniel Way

On Mon, Jul 20, 2020 at 6:15 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com> wrote:

Hello Daniel,

LLVM and GCC's optimisation levels are not really equivalent. In Clang, -Os makes a performance and code-size trade off. In GCC, -Os is minimising code-size, which is equivalent to -Oz with Clang. I have't looked into details yet, but changing -Os to -Oz in the godbolt link gives the codegen you're looking for? Cheers, Sjoerd. ------------------------------ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Daniel Way via llvm-dev <llvm-dev at lists.llvm.org> Sent: 20 July 2020 06:54 To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> Subject: [llvm-dev] [ARM] Should Use Load and Store with Register Offset Hello LLVM Community (specifically anyone working with ARM Cortex-M), While trying to compile the Newlib C library I found that Clang10 was generating slightly larger binaries than the libc from the prebuilt gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy, strcpy, etc.) and noticed that LLVM does not tend to generate load/store instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and instead prefers the immediate offset form. When copying a contiguous sequence of bytes, this results in additional instructions to modify the base address. https://godbolt.org/z/T1xhae void* memcpyalt1(void* dst, const void* src, sizet len) { char* save = (char*)dst; for (sizet i = 0; i < len; ++i)_ _*((char*)(dst + i)) = *((char*)(src + i));_ _return save;_ _}_ _clang --target=armv6m-none-eabi -Os -fomit-frame-pointer_ _memcpyalt1:_ _push {r4, lr}_ _cmp r2, #0_ _beq .LBB03_ _mov r3, r0_ _.LBB02:_ _ldrb r4, [r1]_ _strb r4, [r3]_ _adds r1, r1, #1_ _adds r3, r3, #1_ _subs r2, r2, #1_ _bne .LBB02_ _.LBB03:_ _pop {r4, pc}_ _arm-none-eabi-gcc -march=armv6-m -Os_ _memcpyalt1:_ _movs r3, #0_ _push {r4, lr}_ _.L2:_ _cmp r3, r2_ _bne .L3_ _pop {r4, pc}_ _.L3:_ _ldrb r4, [r1, r3]_ _strb r4, [r0, r3]_ _adds r3, r3, #1_ _b .L2_ _Because this code appears in a loop that could be copying hundreds of_ _bytes, I want to add an optimization that will prioritize load/store_ _instructions with register offsets when the offset is used multiple times._ _I have not worked on LLVM before, so I'd like advice about where to start._ _- The generated code is correct, just sub-optimal so is it appropriate_ _to submit a bug report?_ _- Is anyone already tackling this change or is there someone with more_ _experience interested in collaborating?_ _- Is this optimization better performed early during instruction_ _selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)_ _- What is the potential to cause harm to other parts of the code gen,_ _specifically for other arm targets. I'm working with armv6m, but armv7m_ _offers base register updating in a single instruction. I don't want to_ _break other useful optimizations._ _So far, I am reading through the LLVM documentation to see where a change_ _could be applied. I have also:_ _- Compiled with -S -emit-llvm (see Godbolt link)_ _There is an identifiable pattern where a getelementptr function is_ _followed by a load or store. When multiple getelementptr functions appear_ _with the same virtual register offset, maybe this should match a tLDRr or_ _tSTRr._ _- Ran LLC with --print-machineinstrs_ _It appears that tLDRBi and tSTRBi are selected very early and never_ _replaced by the equivalent t<LDRB|STRB>r instructions. Thank you, Daniel Way -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200721/9e078f30/attachment.html>



More information about the llvm-dev mailing list