[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores (original) (raw)

Andres Freund via llvm-dev llvm-dev at lists.llvm.org
Tue Sep 11 11:21:16 PDT 2018

Previous message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Next message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 2018-09-11 11:16:25 -0400, Nirav Davé wrote:

Andres:

FWIW, codegen will do the merge if you turn on global alias analysis for it "-combiner-global-alias-analysis". That said, we should be able to do this merging earlier.

Interesting. That does something for my real case, but certainly not as much as I'd expected, or what I can get dse-partial-store-merging to do if I emit some "superflous" earlier store (which encompass all the previous stores) that allow it to its job.

In the case at hand, with a manual 64bit store (this is on a 64bit target), llvm then combines 8 byte-wide stores into one.

Without -combiner-global-alias-analysis it generates:

    movb    $0, 1(%rdx)
    movl    4(%rsi,%rdi), %ebx
    movq    %rbx, 8(%rcx)
    movb    $0, 2(%rdx)
    movl    8(%rsi,%rdi), %ebx
    movq    %rbx, 16(%rcx)
    movb    $0, 3(%rdx)
    movl    12(%rsi,%rdi), %ebx
    movq    %rbx, 24(%rcx)
    movb    $0, 4(%rdx)
    movq    16(%rsi,%rdi), %rbx
    movq    %rbx, 32(%rcx)
    movb    $0, 5(%rdx)
    movq    24(%rsi,%rdi), %rbx
    movq    %rbx, 40(%rcx)
    movb    $0, 6(%rdx)
    movq    32(%rsi,%rdi), %rbx
    movq    %rbx, 48(%rcx)
    movb    $0, 7(%rdx)
    movq    40(%rsi,%rdi), %rsi

were (%rdi) is the array of 1 byte values, where I hope to get stores combined, which is guaranteed to be 8byte aligned.

With out -combiner-global-alias-analysis it generates:

movw	$0, (%rsi)
movl	(%rcx,%rdi), %ebx
movq	%rbx, (%rdx)
movl	4(%rcx,%rdi), %ebx
movl	8(%rcx,%rdi), %r8d
movq	%rbx, 8(%rdx)
movl	$0, 2(%rsi)
movq	%r8, 16(%rdx)
movl	12(%rcx,%rdi), %ebx
movq	%rbx, 24(%rdx)
movq	16(%rcx,%rdi), %rbx
movq	%rbx, 32(%rdx)
movq	24(%rcx,%rdi), %rbx
movq	%rbx, 40(%rdx)
movb	$0, 6(%rsi)
movq	32(%rcx,%rdi), %rbx
movq	%rbx, 48(%rdx)
movb	$0, 7(%rsi)

where (%rsi) is the array of 1-byte values. So it's a 2, 4, 1, 1 byte store. Huh?

Whereas, if I emit a superflous 8-byte store beforehand it becomes: movq $0, (%rsi) movl (%rcx,%rdi), %ebx movq %rbx, (%rdx) movl 4(%rcx,%rdi), %ebx movq %rbx, 8(%rdx) movl 8(%rcx,%rdi), %ebx movq %rbx, 16(%rdx) movl 12(%rcx,%rdi), %ebx movq %rbx, 24(%rdx) movq 16(%rcx,%rdi), %rbx movq %rbx, 32(%rdx) movq 24(%rcx,%rdi), %rbx movq %rbx, 40(%rdx) movq 32(%rcx,%rdi), %rbx movq %rbx, 48(%rdx) movq 40(%rcx,%rdi), %rcx

so just a single 8-byte store.

I've attached the two testfiles (which unfortunately are somewhat messy): 24703.1.bc - file without "superflous" store 25256.0.bc - file with "superflous" store

the workflow I have, emulating the current pipeline, is:

opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc -O3 [-combiner-global-alias-analysis]

Note that the problem can also occur when -disable-slp-vectorization, it just requires a larger example.

Greetings,

Andres Freund

-Nirav

On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <_ _llvm-dev at lists.llvm.org> wrote: > Hi, > > On 2018-09-10 13:42:21 -0700, Andres Freund wrote: > > I have, in postres, a piece of IR that, after inlining and constant > > propagation boils (when cooked on really high heat) down to (also > > attached for your convenience): > > > > sourcefilename = "pg" > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" > > target triple = "x8664-pc-linux-gnu" > > > > define void @evalexpr00(i8* align 8 noalias, i32* align 8 noalias) { > > entry: > > %a01 = getelementptr i8, i8* %0, i16 0 > > store i8 0, i8* %a01 > > > > ; in the real case this also loads data > > %b01 = getelementptr i32, i32* %1, i16 0 > > store i32 0, i32* %b01 > > > > %a02 = getelementptr i8, i8* %0, i16 1 > > store i8 0, i8* %a02 > > > > ; in the real case this also loads data > > %b02 = getelementptr i32, i32* %1, i16 1 > > store i32 0, i32* %b02 > > > > ; in the real case this also loads data > > %a03 = getelementptr i8, i8* %0, i16 2 > > store i8 0, i8* %a03 > > > > ; in the real case this also loads data > > %b03 = getelementptr i32, i32* %1, i16 2 > > store i32 0, i32* %b03 > > > > %a04 = getelementptr i8, i8* %0, i16 3 > > store i8 0, i8* %a04 > > > > ; in the real case this also loads data > > %b04 = getelementptr i32, i32* %1, i16 3 > > store i32 0, i32* %b04 > > > > ret void > > } > > > So, here we finally come to my question: Is it really expected that, > > unless largely independent optimizations (SLP in this case) happen to > > move instructions within the same basic block out of the way, these > > stores don't get coalesced? And then only if the either the > > optimization pipeline is run again, or if instruction selection can do > > so? > > > > > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which > > might address this indirectly. But I'm somewhat doubtful that that's > > the most straightforward way to optimize this kind of code? > > That doesn't help, but it turns out that //reviews.llvm.org/D30703 can > kinda somwhat help by adding a redundant > %i32ptr = bitcast i8* %0 to i32* > store i32 0, i32* %i32ptr > > at the start. Then dse-partial-store-merging does its magic and > optimizes the sub-stores away. But it's fairly ugly to manually have to > add superflous stores in the right granularity (a larger llvm.memset > doesn't work). > > gcc, since 7, detects such cases in its "new" -fstore-merging pass. > > - Andres _> ________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > -------------- next part -------------- A non-text attachment was scrubbed... Name: 24703.1.bc Type: application/octet-stream Size: 12852 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0002.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: 25256.0.bc Type: application/octet-stream Size: 12324 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0003.obj>

Previous message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Next message: [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the llvm-dev mailing list