[LLVMdev] X86 FMA4 (original) (raw)
Demikhovsky, Elena elena.demikhovsky at intel.com
Sat Jul 28 23:57:49 PDT 2012
- Previous message: [LLVMdev] X86 FMA4
- Next message: [LLVMdev] X86 FMA4
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Our specialists (Intel) say that “vmovaps” and “vmovsd” have the same throughput and latency, but “vmovsd” reduces chance of 4k aliasing, so it is preferable.
- Elena From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Cameron McInally Sent: Thursday, July 26, 2012 17:50 To: Jan Sjodin Cc: dag at cray.com; llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] X86 FMA4
Hey Jan and Dave,
It's not obvious, but there is a significant scalar performance issue following the GCC intrinsics.
Let's look at the VFMADDSD pattern. We're operating on scalars with undefineds as the remaining vector elements of the operands. This sounds okay, but when one looks closer...
vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647
vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill
vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647
The spill here is 16-bytes. But, we're only using the low 8-bytes of xmm3. Changing the intrinsics and patterns to accept scalar operands, we end up with...
vmovsd fp4_+1056(%rip), %xmm0 # fpppp.f:666
vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill
vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666
I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory.
-Cameron
On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com<mailto:jan_sjodin at yahoo.com>> wrote: Because the intrinsics uses vector types (same as gcc).
- Jan
----- Original Message ----- > From: "dag at cray.com<mailto:dag at cray.com>" <dag at cray.com<mailto:dag at cray.com>> > To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu> > Cc: > Sent: Wednesday, July 25, 2012 3:26 PM > Subject: [LLVMdev] X86 FMA4 >> We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. >> Why is VFMADDSD4 defined with vector types? Is this simply because the > gcc intrinsic uses vector types? It's quite unnatural if you have a > compiler that generates FMAs as opposed to requiring user intrinsics.
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120729/a040afd5/attachment.html>
- Previous message: [LLVMdev] X86 FMA4
- Next message: [LLVMdev] X86 FMA4
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]