[llvm-dev] Generalizing load/store promotion in LICM (original) (raw)

Chandler Carruth via llvm-dev llvm-dev at lists.llvm.org
Thu Sep 13 01:51:29 PDT 2018


Haven't had time to dig into this, but wanted to add +Alina Sbirlea <asbirlea at google.com> to the thread as she has been working on promotion and other aspects of LICM for a long time here.

On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <listmail at philipreames.com> wrote:

I'm thinking about making some semi radical changes to load store promotion works in LICM, and figured it would be a good idea to get buy in before I actually started writing code. :)

TLDR: legality of sinking stores to exits is hard, can we separate load handling into a super aggressive form of PRE, and use predicated stores to avoid solving legality question?

Background We've been seeing an interesting class of problems recently that looks roughly like this: _for (int = 0; i < N; i++)_ _if (a[i] == 0) // some data dependent check_ _gcount++; // conditional load and store to shared location_ _The critical detail in this example is that gcount is a global location_ _which may be accessed concurrently* by multiple threads. The challenge_ _here is that we don't know whether the store ever executes, and we're not_ _allowed to insert a store along any path that didn't originally contain_ _them. Said differently, if all elements in "a" are non-zero, we're not_ _allowed to store to gcount. We do know that the gcount location is_ _dereferenceable though._ _(*Please, let's avoid the memory model derailment here. I'm simplifying_ _and C++ language rules aren't real useful for my Java language frontend_ _anyways. In practice, all the access are atomic, but unordered, but we'll_ _leave that out of discussion otherwise.)_ _I have two approaches I'm considering here. These are orthogonal, but I_ _suspect we'll want to implement both._ _Proposal A - Use predicated stores in loop exits_ _The basic idea is that we don't worry about solving the legality question_ _above, and just insert a store which is predicated on a condition which is_ _true exactly when the original store ran. In pseudo code, this looks_ _something like:_ _bool StoreLegal = false;_ _int LocalCount = gcount;_ _for (int = 0; i < N; i++)_ _if (a[i] == 0) {_ _LocalCount++;_ _StoreLegal = true;_ _}_ _if (StoreLegal) gcount = LocalCount;_ _There are two obvious concerns here:_ _1. The predicated store might be expensive in practice - true for most_ _current architectures._ _2. We''re introducing an extra boolean phi cycle around the loop._ _Talking to a couple of folks offline at the socials over the last few_ _months, the former seems to be the main objection. I think we can control_ _this by restricting this transform to when the loop is believed to have a_ _high trip count and the conditional path is taken with some frequency._ _Getting feedback on this particular point is one of my main reasons for_ _writing this mail._ _The second objection can frequently be resolved by finding other_ _expressions which evaluate to the same boolean. (In this case, if_ _LocalCount != LocalCountOrig assuming i doesn't overflow.) We already have_ _a framework with SCEV to do these replacements. Though, from some quick_ _testing, it would definitely need strengthening. However, SCEV can't_ _remove the extra phi in all cases, so we have to be okay with the extra phi_ _cycle in the general case. This seems worthwhile to me, but thoughts?_ _Proposal B - Separate load and store handling into distinct transforms_ _(For folks I've discussed this with before, this part is all new.)_ _Thinking about the problem, it finally occurred to me that we can_ _decompose the original example into two steps: getting the loads out of the_ _loop, and sinking the stores out of the loop. If we can accomplish the_ _former, but not the later, we've still made meaningful progress._ _So, what'd we'd essentially have is a load only transformation which_ _produces this:_ _int LocalCount = gcount;_ _for (int = 0; i < N; i++)_ _if (a[i] == 0) {_ _LocalCount++;_ _gcount = LocalCount;_ _}_ _At this point, we've reduced memory traffic by half, and opened up the_ _possibility that other parts of the optimizer can exploit the simplified_ _form. The last point is particular interesting since we generally try to_ _canonicalize loads out of loops, and much of the optimizer is tuned for a_ _form with as much as possible being loop invariant. As one example,_ _completely by accident, there's some work going on in the LoopVectorizer_ _right now to handle such stores to loop invariant addresses during_ _vectorization. Putting the two pieces together would let us fully_ _vectorize this loop without needing to sink the stores at all._ _In practice, the existing implementation in LICM would cleanly split along_ _these lines with little problem._ _One way of looking at this load specific transform is as an extreme form_ _of PRE (partial redundancy elimination). Our current PRE implementation_ _doesn't handle cases this complicated._ _Thoughts?_ _Philip_ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180913/111e4291/attachment.html>



More information about the llvm-dev mailing list