[llvm-dev] [RFC] Adding range metadata to array subscripts. (original) (raw)

Clement Courbet via llvm-dev llvm-dev at lists.llvm.org
Wed Mar 24 02:14:59 PDT 2021


Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM can do better alias analysis. I have a proposal as D99248 <https://reviews.llvm.org/D99248> (clang part) and D99247 <https://reviews.llvm.org/D99247> (llvm part). But there are other possible options that I'm detailing below.

Consider the following code, adapted from brotli <https://en.wikipedia.org/wiki/Brotli>:


struct Histogram {

  int values[256];

  int total;

};

Histogram DoIt(const int* image, int size) {

  Histogram histogram;

  for (int i = 0; i < size; ++i) {

    ++histogram.values[image[i]];  // (A)

    ++histogram.total;             // (B)

  }

  return histogram;

}

In this code, the compiler does not know anything about the values of images[i], so it assumes that 256 is a possible value for it. In that case, (A) would change the value of histogram.total, so (B) has to load, add one and store [godbolt <https://godbolt.org/z/KxE343>].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use values to form a pointer to total and dereference it. What valid C/C++ code is allowed to do with values is:

Note that the LLVM memory model is much laxer than that of C/C++. It has no notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep reference] representing indexing into the [42 x i32] array to load the i32 member at index 2. It is also valid for %i to be 43 (though not 44 if an inbound GEP is used). So clang has to give LLVM more information about the C/C++ rules.

IR representation: LLVM has several ways of representing ranges of values:

Alternatives: *(1) *Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
%ri
...
!0 = !{i64 0, i64 42}

*(2) *(variant of 1) relax the constraint that !range metadata can only be set on call and load instructions, and set the !range metadata on the index expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
%i
...
!0 = !{i64 0, i64 42}

This is slightly more compact.

(3) Same as (1), with llvm.assume. This feels inferior to annotations. (4) Extend inrange to non-constantexprs GEPs. It is unclear how this will interfere with optimizations.

On the clang side: The clang part is quite trivial as the infrastructure is already in place to emit dynamic ubsan guards: D99248 <https://reviews.llvm.org/D99248>

On the LLVM Side: Alternatives: (A) - (annotation or assume options) Simply enable scev-aa which knows how to handle value ranges in general. IIUC it's not enabled in clang because it has issues with invalidation when code changes, and is therefore not cacheable. This makes it too slow to be practical. *(B) *- (annotation or assume options) Teach BasicAA to honor !range metadata (D99247 <https://reviews.llvm.org/D99247>) (C) - (inrange option) Teach BasicAA to honor inrange attributes of GEP.

I was leaning towards (1) and (B) because:

Opinions ? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210324/9c67983c/attachment-0001.html>



More information about the llvm-dev mailing list