[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores (original) (raw)

Nirav Davé via llvm-dev llvm-dev at lists.llvm.org
Tue Sep 11 12:06:10 PDT 2018

Previous message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Next message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hmm. This looks like the backend conservatively giving up early on merging. It looks like you're running clang 5.02. There have been some improvements to the backend's memory aliasing and store merging that have landed since. Can you check if this is fixed in a newer version?

-Nirav

On Tue, Sep 11, 2018 at 2:21 PM, Andres Freund <andres at anarazel.de> wrote:

Hi,

On 2018-09-11 11:16:25 -0400, Nirav Davé wrote: > Andres: > > FWIW, codegen will do the merge if you turn on global alias analysis for it > "-combiner-global-alias-analysis". That said, we should be able to do this > merging earlier. Interesting. That does something for my real case, but certainly not as much as I'd expected, or what I can get dse-partial-store-merging to do if I emit some "superflous" earlier store (which encompass all the previous stores) that allow it to its job. In the case at hand, with a manual 64bit store (this is on a 64bit target), llvm then combines 8 byte-wide stores into one.

Without -combiner-global-alias-analysis it generates: movb $0, 1(%rdx) movl 4(%rsi,%rdi), %ebx movq %rbx, 8(%rcx) movb $0, 2(%rdx) movl 8(%rsi,%rdi), %ebx movq %rbx, 16(%rcx) movb $0, 3(%rdx) movl 12(%rsi,%rdi), %ebx movq %rbx, 24(%rcx) movb $0, 4(%rdx) movq 16(%rsi,%rdi), %rbx movq %rbx, 32(%rcx) movb $0, 5(%rdx) movq 24(%rsi,%rdi), %rbx movq %rbx, 40(%rcx) movb $0, 6(%rdx) movq 32(%rsi,%rdi), %rbx movq %rbx, 48(%rcx) movb $0, 7(%rdx) movq 40(%rsi,%rdi), %rsi were (%rdi) is the array of 1 byte values, where I hope to get stores combined, which is guaranteed to be 8byte aligned. With out -combiner-global-alias-analysis it generates: movw $0, (%rsi) movl (%rcx,%rdi), %ebx movq %rbx, (%rdx) movl 4(%rcx,%rdi), %ebx movl 8(%rcx,%rdi), %r8d movq %rbx, 8(%rdx) movl $0, 2(%rsi) movq %r8, 16(%rdx) movl 12(%rcx,%rdi), %ebx movq %rbx, 24(%rdx) movq 16(%rcx,%rdi), %rbx movq %rbx, 32(%rdx) movq 24(%rcx,%rdi), %rbx movq %rbx, 40(%rdx) movb $0, 6(%rsi) movq 32(%rcx,%rdi), %rbx movq %rbx, 48(%rdx) movb $0, 7(%rsi) where (%rsi) is the array of 1-byte values. So it's a 2, 4, 1, 1 byte store. Huh? Whereas, if I emit a superflous 8-byte store beforehand it becomes: movq $0, (%rsi) movl (%rcx,%rdi), %ebx movq %rbx, (%rdx) movl 4(%rcx,%rdi), %ebx movq %rbx, 8(%rdx) movl 8(%rcx,%rdi), %ebx movq %rbx, 16(%rdx) movl 12(%rcx,%rdi), %ebx movq %rbx, 24(%rdx) movq 16(%rcx,%rdi), %rbx movq %rbx, 32(%rdx) movq 24(%rcx,%rdi), %rbx movq %rbx, 40(%rdx) movq 32(%rcx,%rdi), %rbx movq %rbx, 48(%rdx) movq 40(%rcx,%rdi), %rcx so just a single 8-byte store. I've attached the two testfiles (which unfortunately are somewhat messy): 24703.1.bc - file without "superflous" store 25256.0.bc - file with "superflous" store the workflow I have, emulating the current pipeline, is: opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc_ _-O3 [-combiner-global-alias-analysis]_ _Note that the problem can also occur when -disable-slp-vectorization, it_ _just requires a larger example._ _Greetings,_ _Andres Freund_ _> -Nirav > > > On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <_ _> llvm-dev at lists.llvm.org> wrote: > > > Hi, > > > > On 2018-09-10 13:42:21 -0700, Andres Freund wrote: > > > I have, in postres, a piece of IR that, after inlining and constant > > > propagation boils (when cooked on really high heat) down to (also > > > attached for your convenience): > > > > > > sourcefilename = "pg" > > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" > > > target triple = "x8664-pc-linux-gnu" > > > > > > define void @evalexpr00(i8* align 8 noalias, i32* align 8 noalias) { > > > entry: > > > %a01 = getelementptr i8, i8* %0, i16 0 > > > store i8 0, i8* %a01 > > > > > > ; in the real case this also loads data > > > %b01 = getelementptr i32, i32* %1, i16 0 > > > store i32 0, i32* %b01 > > > > > > %a02 = getelementptr i8, i8* %0, i16 1 > > > store i8 0, i8* %a02 > > > > > > ; in the real case this also loads data > > > %b02 = getelementptr i32, i32* %1, i16 1 > > > store i32 0, i32* %b02 > > > > > > ; in the real case this also loads data > > > %a03 = getelementptr i8, i8* %0, i16 2 > > > store i8 0, i8* %a03 > > > > > > ; in the real case this also loads data > > > %b03 = getelementptr i32, i32* %1, i16 2 > > > store i32 0, i32* %b03 > > > > > > %a04 = getelementptr i8, i8* %0, i16 3 > > > store i8 0, i8* %a04 > > > > > > ; in the real case this also loads data > > > %b04 = getelementptr i32, i32* %1, i16 3 > > > store i32 0, i32* %b04 > > > > > > ret void > > > } > > > > > So, here we finally come to my question: Is it really expected that, > > > unless largely independent optimizations (SLP in this case) happen to > > > move instructions within the same basic block out of the way, these > > > stores don't get coalesced? And then only if the either the > > > optimization pipeline is run again, or if instruction selection can do > > > so? > > > > > > > > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which > > > might address this indirectly. But I'm somewhat doubtful that that's > > > the most straightforward way to optimize this kind of code? > > > > That doesn't help, but it turns out that //reviews.llvm.org/D30703 can > > kinda somwhat help by adding a redundant > > %i32ptr = bitcast i8* %0 to i32* > > store i32 0, i32* %i32ptr > > > > at the start. Then dse-partial-store-merging does its magic and > > optimizes the sub-stores away. But it's fairly ugly to manually have to > > add superflous stores in the right granularity (a larger llvm.memset > > doesn't work). > > > > gcc, since 7, detects such cases in its "new" -fstore-merging pass. > > > > - Andres _> > ________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/209174d1/attachment.html>

Previous message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Next message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the llvm-dev mailing list