[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores (original) (raw)
Andres Freund via llvm-dev llvm-dev at lists.llvm.org
Tue Sep 11 11:21:16 PDT 2018
- Previous message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
- Next message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
On 2018-09-11 11:16:25 -0400, Nirav Davé wrote:
Andres:
FWIW, codegen will do the merge if you turn on global alias analysis for it "-combiner-global-alias-analysis". That said, we should be able to do this merging earlier.
Interesting. That does something for my real case, but certainly not as much as I'd expected, or what I can get dse-partial-store-merging to do if I emit some "superflous" earlier store (which encompass all the previous stores) that allow it to its job.
In the case at hand, with a manual 64bit store (this is on a 64bit target), llvm then combines 8 byte-wide stores into one.
Without -combiner-global-alias-analysis it generates:
movb $0, 1(%rdx)
movl 4(%rsi,%rdi), %ebx
movq %rbx, 8(%rcx)
movb $0, 2(%rdx)
movl 8(%rsi,%rdi), %ebx
movq %rbx, 16(%rcx)
movb $0, 3(%rdx)
movl 12(%rsi,%rdi), %ebx
movq %rbx, 24(%rcx)
movb $0, 4(%rdx)
movq 16(%rsi,%rdi), %rbx
movq %rbx, 32(%rcx)
movb $0, 5(%rdx)
movq 24(%rsi,%rdi), %rbx
movq %rbx, 40(%rcx)
movb $0, 6(%rdx)
movq 32(%rsi,%rdi), %rbx
movq %rbx, 48(%rcx)
movb $0, 7(%rdx)
movq 40(%rsi,%rdi), %rsi
were (%rdi) is the array of 1 byte values, where I hope to get stores combined, which is guaranteed to be 8byte aligned.
With out -combiner-global-alias-analysis it generates:
movw $0, (%rsi)
movl (%rcx,%rdi), %ebx
movq %rbx, (%rdx)
movl 4(%rcx,%rdi), %ebx
movl 8(%rcx,%rdi), %r8d
movq %rbx, 8(%rdx)
movl $0, 2(%rsi)
movq %r8, 16(%rdx)
movl 12(%rcx,%rdi), %ebx
movq %rbx, 24(%rdx)
movq 16(%rcx,%rdi), %rbx
movq %rbx, 32(%rdx)
movq 24(%rcx,%rdi), %rbx
movq %rbx, 40(%rdx)
movb $0, 6(%rsi)
movq 32(%rcx,%rdi), %rbx
movq %rbx, 48(%rdx)
movb $0, 7(%rsi)
where (%rsi) is the array of 1-byte values. So it's a 2, 4, 1, 1 byte store. Huh?
Whereas, if I emit a superflous 8-byte store beforehand it becomes: movq $0, (%rsi) movl (%rcx,%rdi), %ebx movq %rbx, (%rdx) movl 4(%rcx,%rdi), %ebx movq %rbx, 8(%rdx) movl 8(%rcx,%rdi), %ebx movq %rbx, 16(%rdx) movl 12(%rcx,%rdi), %ebx movq %rbx, 24(%rdx) movq 16(%rcx,%rdi), %rbx movq %rbx, 32(%rdx) movq 24(%rcx,%rdi), %rbx movq %rbx, 40(%rdx) movq 32(%rcx,%rdi), %rbx movq %rbx, 48(%rdx) movq 40(%rcx,%rdi), %rcx
so just a single 8-byte store.
I've attached the two testfiles (which unfortunately are somewhat messy): 24703.1.bc - file without "superflous" store 25256.0.bc - file with "superflous" store
the workflow I have, emulating the current pipeline, is:
opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc -O3 [-combiner-global-alias-analysis]
Note that the problem can also occur when -disable-slp-vectorization, it just requires a larger example.
Greetings,
Andres Freund
-Nirav
On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <_ _llvm-dev at lists.llvm.org> wrote: > Hi, > > On 2018-09-10 13:42:21 -0700, Andres Freund wrote: > > I have, in postres, a piece of IR that, after inlining and constant > > propagation boils (when cooked on really high heat) down to (also > > attached for your convenience): > > > > sourcefilename = "pg" > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" > > target triple = "x8664-pc-linux-gnu" > > > > define void @evalexpr00(i8* align 8 noalias, i32* align 8 noalias) { > > entry: > > %a01 = getelementptr i8, i8* %0, i16 0 > > store i8 0, i8* %a01 > > > > ; in the real case this also loads data > > %b01 = getelementptr i32, i32* %1, i16 0 > > store i32 0, i32* %b01 > > > > %a02 = getelementptr i8, i8* %0, i16 1 > > store i8 0, i8* %a02 > > > > ; in the real case this also loads data > > %b02 = getelementptr i32, i32* %1, i16 1 > > store i32 0, i32* %b02 > > > > ; in the real case this also loads data > > %a03 = getelementptr i8, i8* %0, i16 2 > > store i8 0, i8* %a03 > > > > ; in the real case this also loads data > > %b03 = getelementptr i32, i32* %1, i16 2 > > store i32 0, i32* %b03 > > > > %a04 = getelementptr i8, i8* %0, i16 3 > > store i8 0, i8* %a04 > > > > ; in the real case this also loads data > > %b04 = getelementptr i32, i32* %1, i16 3 > > store i32 0, i32* %b04 > > > > ret void > > } > > > So, here we finally come to my question: Is it really expected that, > > unless largely independent optimizations (SLP in this case) happen to > > move instructions within the same basic block out of the way, these > > stores don't get coalesced? And then only if the either the > > optimization pipeline is run again, or if instruction selection can do > > so? > > > > > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which > > might address this indirectly. But I'm somewhat doubtful that that's > > the most straightforward way to optimize this kind of code? > > That doesn't help, but it turns out that //reviews.llvm.org/D30703 can > kinda somwhat help by adding a redundant > %i32ptr = bitcast i8* %0 to i32* > store i32 0, i32* %i32ptr > > at the start. Then dse-partial-store-merging does its magic and > optimizes the sub-stores away. But it's fairly ugly to manually have to > add superflous stores in the right granularity (a larger llvm.memset > doesn't work). > > gcc, since 7, detects such cases in its "new" -fstore-merging pass. > > - Andres _> ________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > -------------- next part -------------- A non-text attachment was scrubbed... Name: 24703.1.bc Type: application/octet-stream Size: 12852 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0002.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: 25256.0.bc Type: application/octet-stream Size: 12324 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0003.obj>
- Previous message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
- Next message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]