[LLVMdev] Modifications to SLP (original) (raw)

Michael Zolotukhin mzolotukhin at apple.com
Tue Jul 7 14:40:42 PDT 2015


On Jul 7, 2015, at 1:30 PM, Sanjay Patel <spatel at rotateright.com> wrote:

I forgot to update the SLP thread from last week. I have a patch up for review that would allow creating wider vectors as requested, but may increase SLP compile time: http://reviews.llvm.org/D10950 <http://reviews.llvm.org/D10950> An audit of the trunk backends shows that only PowerPC + QPX and x86 + AVX / AVX512 would potentially get an extra round of store merging from the use of 'getRegisterBitWidth()'. As reported in the phab comments, I didn't see any compile time hit on test-suite for an AVX machine. I'm very curious to know if that patch causes further blowup in this example. Thanks for the heads up, Sanjay! I think that effect from your patch wouldn’t be that big even in this case, because I don’t think that vectorizeStoreChain is a bottleneck here (disclaimer: that’s just my plain speculations). But it’s always good to check:)

Michael

Frank, what causes a 10^6 instruction function to be generated? Can this be rolled into a loop?

On Tue, Jul 7, 2015 at 12:55 PM, Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote: Hi Frank, The most time consuming part of SLP vectorizer (especially in cases like yours) is finding sets of consecutive stores. It's currently performed by a quadratic search (see routine SLPVectorizer::vectorizeStores) - we do pairwise comparisons between all pointers (but we do limit ourselves to look at at most 16 stores). I think it should be possible to group pointers with a common base, compute constant relative offset and just sort all of them - this way we’ll save a lot of expensive computations. However, I haven’t tried implementing this, and I guess there might be some hard corner cases too. Patches would be welcome here:) Thanks, Michael > On Jul 7, 2015, at 11:31 AM, Frank Winter <fwinter at jlab.org <mailto:fwinter at jlab.org>> wrote: > > Hi all! > > It takes the current SLP vectorizer too long to vectorize my scalar code. I am talking here about functions that have a single, huge basic block with O(10^6) instructions. Here's an example: > > %0 = getelementptr float* %arg1, i32 49 > %1 = load float* %0 > %2 = getelementptr float* %arg1, i32 4145 > %3 = load float* %2 > %4 = getelementptr float* %arg2, i32 49 > %5 = load float* %4 > %6 = getelementptr float* %arg2, i32 4145 > %7 = load float* %6 > %8 = fmul float %7, %1 > %9 = fmul float %5, %3 > %10 = fadd float %9, %8 > %11 = fmul float %7, %3 > %12 = fmul float %5, %1 > %13 = fsub float %12, %11 > %14 = getelementptr float* %arg3, i32 16 > %15 = load float* %14 > %16 = getelementptr float* %arg3, i32 4112 > %17 = load float* %16 > %18 = getelementptr float* %arg4, i32 0 > %19 = load float* %18 > %20 = getelementptr float* %arg4, i32 4096 > %21 = load float* %20 > %22 = fmul float %21, %15 > %23 = fmul float %19, %17 > %24 = fadd float %23, %22 > %25 = fmul float %21, %17 > %26 = fmul float %19, %15 > %27 = fsub float %26, %25 > %28 = fadd float %24, %10 > %29 = fadd float %27, %13 > %30 = getelementptr float* %arg0, i32 0 > store float %29, float* %30 > %31 = getelementptr float* %arg0, i32 4096 > store float %28, float* %31 > ... and so on ... > > The SLP vectorizer would create some code like this: > > %219 = insertelement <4 x float> %218, float %185, i32 2 > %220 = insertelement <4 x float> %219, float %197, i32 3 > %221 = fmul <4 x float> %216, %220 > %222 = fadd <4 x float> %221, %212 > %223 = fmul <4 x float> %207, %220 > .. > %234 = bitcast float* %165 to <4 x float>* > store <4 x float> %233, <4 x float>* %234, align 4 > > > With the current SLP implementation 99.5% of the time is spent in the SLP vectorizer and I have the impression that this can be improved for my case. I believe that the SLP vectorizer has far more capabilities than I would need for these simple (but huge) functions. And I was hoping that any of you have an idea how to remove functionality of the SLP vectorizer such that it still can vectorize those simple functions...? > > Thanks, > Frank > _> ________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>


LLVM Developers mailing list LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>

-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150707/8644ca80/attachment.html>



More information about the llvm-dev mailing list