[cfe-dev] [OpenMP] Redundant store inside reduction loop body (original) (raw)

Scaramouch via cfe-dev cfe-dev at lists.llvm.org
Wed Dec 9 06:02:24 PST 2020

Previous message: [cfe-dev] [OpenMP] Redundant store inside reduction loop body
Next message: [cfe-dev] [OpenMP] Redundant store inside reduction loop body
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I think 3 is the best option in general, no?

12/8/2020 7:23 PM, Johannes Doerfert via cfe-dev пишет:

Hi Itay,

the problem is the "escape" of local later in the function (into the OpenMP reduction runtime call). There are various solutions to this, on the compiler side, which I'll try to describe briefly below. If you need a shortstop solution for this issue, you can introduce a secondary privatization variable yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use the user provided variable for accumulation and the OpenMP reduction facility is used for the final reduction across threads. Obviously, this is what the compiler should generate, so here we go: 1) Use PointerMayBeCapturedBefore instead of PointerMayBeCaptured in BasicAliasAnalysis.cpp (the call is done via isNonEscapingLocalObject). The reason we don't do this, I assume, is compile time. Maybe someone will try it out and measure the compile and runtime implications but for not I assume this to be a solution that might not be enacted (though simple and generic). 2) Modify the OpenMP reduction output to introduce the second privatization location. This should be relatively easy but it will only address the problem at hand. Similar problems pop up more often and I would prefer 1) or 3). That said, we can for now enact 2) if someone writes the code ;) 3) Introduce an attribute/annotation that indicates this is actually not an escaping use. Such uses happen all the time (at least in the OpenMP runtime) and it would be ideal if we could indicate the memory store is not causing the pointer to escape. I'm thinking about this a bit more now, any input is welcome :) I hope this helps, feel free to reach out if you have more questions or comments. ~ Johannes

On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote: Hey,

I've encountered a peculiar code generation issue in a parallel-for-reduction. Inside the per-thread reduction loop, I see a store to a stack slot that happens every iteration, clobbering the value written by the previous one. The address of that stack slot is later taken to pass to the inter-thread reduction, but the store is not hoisted outside the loop, while I'd expect it to happen just once outside it. Attached below are the code example that triggers this, and an annotated x86-64 assembly snippet from the outlined OMP function. The code was compiled using clang++-12 Ubuntu focal unstable branch, using the command-line: clang++-12 -fopenmp -O3 main.cpp -o main I'm wondering whether this is some sort of bug, and in which component. Regards, ~Itay // main.cpp #include #include double computedotproduct(sizet n, double *xv, double *yv) { double local = 0.0; #pragma omp parallel for reduction (+:local) for (sizet i = 0; i < n; i++) local += xv[i] * yv[i];_ _return local;_ _}_ _int main(int argc, char **argv)_ _{_ _constexpr sizet n = 0x1000;_ _auto xv = std::makeunique<double[]>(n); auto yv = std::makeunique<double[]>(n); double result = computedotproduct(n, xv.get(), yv.get()); printf("result = %e\n", result); return 0; } // Disassembly excerpt from objdump -d --no-show-raw-insn main, function <...>ompoutlined<...> 4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head_ _4012f5: mulsd (%rsi,%rcx,8),%xmm1_ _4012fa: addsd %xmm1,%xmm0_ _4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration_ _401303: add $0x1,%rcx_ _401307: add $0xffffffffffffffff,%rax_ _40130b: jne 4012f0 <.ompoutlined.+0xc0> ; <--- Non-unrolled loop latch_ _40130d: cmp $0x3,%rbp_ _401311: jb 40138b <.ompoutlined.+0x15b> ; <--- x4-unrolled loop guard_ _401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader_ _401316: lea (%rsi,%rcx,8),%rsi_ _40131a: add $0x18,%rsi_ _40131e: lea (%rdx,%rcx,8),%rcx_ _401322: add $0x18,%rcx_ _401326: mov $0xffffffffffffffff,%rdx_ _40132d: nopl (%rax)_ _401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head_ _401336: mulsd -0x10(%rsi,%rdx,8),%xmm1_ _40133c: addsd %xmm0,%xmm1_ _401340: movsd %xmm1,(%rsp) ; <--- Weird store #1_ _401345: movsd -0x8(%rcx,%rdx,8),%xmm0_ _40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0_ _401351: addsd %xmm1,%xmm0_ _401355: movsd %xmm0,(%rsp) ; <--- Weird store #2_ _40135a: movsd (%rcx,%rdx,8),%xmm1_ _40135f: mulsd (%rsi,%rdx,8),%xmm1_ _401364: addsd %xmm0,%xmm1_ _401368: movsd %xmm1,(%rsp) ; <--- Weird store #3_ _40136d: movsd 0x8(%rcx,%rdx,8),%xmm0_ _401373: mulsd 0x8(%rsi,%rdx,8),%xmm0_ _401379: addsd %xmm1,%xmm0_ _40137d: movsd %xmm0,(%rsp) ; <--- Weird store #4_ _401382: add $0x4,%rdx_ _401386: cmp %rdx,%rdi_ _401389: jne 401330 <.ompoutlined.+0x100> ; <--- x4-unrolled loop latch_ _40138b: mov $0x402028,%edi_ _401390: mov %r14d,%esi_ _401393: callq 401040 <_kmpcforstaticfini at plt> 401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to reduction logic 40139b: mov %rax,0x20(%rsp)

cfe-dev mailing list cfe-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20201209/2729e147/attachment.html>

Previous message: [cfe-dev] [OpenMP] Redundant store inside reduction loop body
Next message: [cfe-dev] [OpenMP] Redundant store inside reduction loop body
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the cfe-dev mailing list