Some thoughts about ValueBoundsConstraintSet (original) (raw)

January 14, 2025, 6:13am 1

Hi @matthias-springer

I am impressed by the ValueBoundsConstraintSet tools, cool thing :slight_smile:

I used the tool to infer the maximum size of dynamic allocations in my IR and came into some ideas / findings.
In the following IR:

func.func private @test_solve_max_alloc_with_subview(%arg0: memref<10x10xi8, 1>, %arg1: memref<10x10xi8, 1>) { 
  %false = arith.constant false
  %c0_i32 = arith.constant 0 : i32
  %c0_i8 = arith.constant 0 : i8
  %c1 = arith.constant 1 : index
  %c100 = arith.constant 100 : index
  %c0 = arith.constant 0 : index
  %alloc = memref.alloc() {alignment = 64 : i64} : memref<10x10xi1, 1>
  linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%arg0 : memref<10x10xi8, 1>) outs(%alloc : memref<10x10xi1, 1>) { 
  ^bb0(%in: i8, %out: i1):
    %1 = arith.cmpi ne, %in, %c0_i8 : i8
    linalg.yield %1 : i1
  } 
  %collapse_shape = memref.collapse_shape %arg1 [[0, 1]] : memref<10x10xi8, 1> into memref<100xi8, 1>
  %collapse_shape_0 = memref.collapse_shape %alloc [[0, 1]] : memref<10x10xi1, 1> into memref<100xi1, 1>
  %alloc_1 = memref.alloc() {alignment = 64 : i64} : memref<100xi8, 1>
  %0 = scf.for %arg2 = %c0 to %c100 step %c1 iter_args(%arg3 = %c0) -> (index) { 
    %1 = memref.load %collapse_shape_0[%arg2] : memref<100xi1, 1>
    %2 = arith.cmpi ne, %1, %false : i1
    %3 = scf.if %2 -> (index) { 
      %4 = memref.load %collapse_shape[%arg2] : memref<100xi8, 1>
      memref.store %4, %alloc_1[%arg3] : memref<100xi8, 1>
      %5 = arith.addi %arg3, %c1 : index
      scf.yield %5 : index
    } else { 
      scf.yield %arg3 : index
    } 
    scf.yield %3 : index
  } 
  memref.dealloc %alloc : memref<10x10xi1, 1>
  %subview = memref.subview %alloc_1[0] [%0] [1] : memref<100xi8, 1> to memref<?xi8, strided<[1]>, 1>
  %alloc_2 = memref.alloc(%0) {alignment = 64 : i64} : memref<?xi32, 1>
  linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%subview : memref<?xi8, strided<[1]>, 1>) outs(%alloc_2 : memref<?xi32, 1>) { 
  ^bb0(%in: i8, %out: i32):
    %1 = arith.extui %in : i8 to i32
    linalg.yield %1 : i32
  } 
  memref.dealloc %alloc_1 : memref<100xi8, 1>
  memref.dealloc %alloc_2 : memref<?xi32, 1>
  return
}

The maximum value of alloc:

%alloc_2 = memref.alloc(%0) {alignment = 64 : i64} : memref<?xi32, 1>

Can be infered to 100 by the following:

  1. According to the result of the scf.for - unfortionately this is not supported between iter_args changes between iterations and it data dependent - if we unroll the loops / support iter_args stage (where {value,stage} is the value at specific iteration) we might get big table of constraints (depends on the trip count of the loop), but we can solve it - what do you think about it? any other way to solve simple flows like the one above?
  2. According to subview limitation size:
%subview = memref.subview %alloc_1[0] [%0] [1] : memref<100xi8, 1> to memref<?xi8, strided<[1]>, 1>

This works, but since I used the ValueBoundsConstraintSet::computeConstantBound on %0, the algoritm is only searching for defining operations and subview above is actually the “user” of %0. When I called ValueBoundsConstraintSet::computeConstantBound with the subview result I managed to get upper bound. My idea is to also check the users of the given value/dim since it may leads to more constriants like in this case, what do you think?

Another odd thing I saw and I wasn’t sure above regarding ValueBoundsConstraintSet::computeConstantBound - there is variable called closedUB, when passing with false I am getting 101 and when passing with true I am getting 100, shouldn’t it be the other way around?

Unrolling can work if the loop is small. I think we have an unrolling pass in MLIR. You can try running that pass in your pipeline before running the pass that computes value bounds. If the loop is small enough to be unrolled for analysis purposes, it may also be small enough to be enrolled in the generated code without a performance penalty.

I wouldn’t do the unrolling in the ValueBoundsConstraintSet infrastructure itself, because it will make compilation time quite unpredictable.

This is problematic because %subview could be guarded by an scf.if and now you’re analyzing IR that could be dead. In fact, there could even be two different memref.subview with contradicting information in two separate scf.if.

Maybe we can somehow write a few heuristics for frequent cases like the one where an iter_arg is increased by a constant value in each iteration.

E.g.:

%0 = scf.for %arg2 = %c0 to %c100 step %c1 iter_args(%arg3 = %c0) → (index) {
    %2 = "foo"
    %3 = scf.if %2 → (index) {
        %5 = arith.addi %arg3, %c1 : index
        scf.yield %5 : index
    } else {
        scf.yield %arg3 : index
    }
    scf.yield %3 : index
}

I think we can already prove that: %3 - %arg3 <= 1. Therefore, %0 <= lb + (ub-lb)/step * 1 => %0 <= 100.

“Closed bound” means “inclusive”. closedUB = false => exclusive. So this looks correct to me.

Another question on the topic of ValueBoundsConstraintSet - I haven’t used this tool yet, but my understanding is that in case we want to calculate the bound for a several different dynamic values then we need to run the analysis for each one separately. Am I getting this right?
If that is the case then calculating many dynamic values in a large IR could take a long time. Do you think that there is a possibility to extend this tool to support getting bounds of multiple values at once, or caching computations already made for previous bound calculations?

Do you think that there is a possibility to extend this tool to support getting bounds of multiple values at once, or caching computations already made for previous bound calculations?

Yes that’s possible. We should turn this into an MLIR Analysis (to be used with getAnalysis). The API for the entry point (to query a specific bound) will change slightly. I don’t have time to implement this myself, but I can help with reviews.

Could we then rename to inclusive ? :slight_smile:

Clearly documenting that by default bounds are left inclusive and right exclusive [x, y) (or like we write “more correctly” :slight_smile: in the French system [x, y[ ) would also be good.

Then inclusive is always [x, y] in both US and FR.

aviadc January 16, 2025, 1:41pm 6

Thanks for the suggestion!
I made patch that solved my issue, can you pleae review:

Please note the slight change that I am using %init instead of %lb.
We can infer more constraints similiarly by using trip count idea with induction variable (I didn’t add for now).

would it make sense to you to add constraints from users as long as they are in the same block as the value that we want to find its bounds?
I have a use case where there are many ops in the same block, and a lot of these ops have only args and no results (a downstream dialect), but can define constraints between the args. Without being able to collect constraints from users in this case, the bounds might not be computable.
I noticed that there are many asserts in code for ops that implement ValueBoundsOpInterface that the value passed is always a result, so I guess that would need to be changed if my suggestion was accepted.

Can you show an example? Are you talking about ops like memref.assume_alignment, which until recently did not have a result value, but still carries information that could potentially be included as a constraint?

I wasn’t referring to memref.assume_alignment, the ops I’m talking about are downstream ops that mostly don’t have results but do hold a lot of information that could be used for constraints building.
I’ll use an example that is similar to the cases I encounter but only with upstream ops:

%arg0 = memref.alloc() : memref<16x32x64xf32>
%arg1 = memref.alloc() : memref<?x64xf32>
linalg.reduce ins(%arg0 : memref<16x32x64xf32>)
outs(%arg1 : memref<?x64xf32>) dimensions = [1]
(%in: f32, %init: f32) {
%0 = arith.addf %in, %init : f32
linalg.yield %0 : f32

In this example the first dim of %arg1 is not known, but via the logic of linalg.reduce we know that it must be equal to the first dim of %arg0 which is known.
So, if we allowed adding constraints from users in the same block and implemented ValueBoundsOpInterface for linalg.reduce in such a way that it expresses the constraints between its 2 arguments, then we could use the ValueBoundsConstraintSet utility to find the bound for the first dim of %arg1.
This is a bit of a simplified example but I think it catches the essence of the issue.

That is dangerous because you don’t know if the code is actually executed. I think it would require branching (copying) the constraint set along control flow edges. Any information that you “attach” to an SSA value only holds within the same basic block.

Example:

scf.if ... {
  %10 = tensor.empty(%c5) : tensor<?xf32>
  linalg.reduce ins(%arg0 : tensor<?xf32>) outs(%10 : tensor<?xf32>)
} else {
  %20 = tensor.empty(%c6) : tensor<?xf32>
  linalg.reduce ins(%arg0 : tensor<?xf32>) outs(%20 : tensor<?xf32>)
}

We could do this, but it needs a redesign of the traversal logic. In the case of your memref example, we wouldn’t even look at the linalg.reduce because we only follow the reverse use-def chain. Redesigning the traversal may actually overlap with another long-standing TODO: turning the ValueBoundsConstraintSet infrastructure into a proper MLIR analysis. At the moment, two subsequent value-bounds queries do not share any information.

I would like to get to a state where different value-bounds queries share information. But beyond the constraints that would be collected via traversing the def-use chain, I would like to inject additional constraints that arise from various hints in the IR (attributes with additional constraint information or ops that don’t have results but can provide additional constraints).
These additional constraints would not be collected from ops that implement ValueBoundsOpInterface, so could not be attained directly from how ValueBoundsConstraintSet collects constraints, but would need to be provided externally.
If ValueBoundsConstraintSet is turned into an analysis pass, than it would need to run on the IR as is without being provided any additional information since the interface for analysis pass does not support additional parameters getAnalysis<MyOperationAnalysis>().
Could you advise how in that case I could augment the analysis to include any additional constraints that should be added?

getAnalysis<AnalysisT> is just an API to cache analyses between pass boundaries. That API does not support arguments, as you said. But you don’t have to use the API. You can construct the analysis directly. (It won’t be cached, but you can query the same analysis object multiple times within a pass without having to traverse/analyze the IR multiple times.)

If I understand correctly, AnalysisT is just a C++ class. It could have two constructors: one with extra arguments (for you to inject information) and one without (so that it can be used with getAnalysis).

In case of external constraints, you probably want to populate these in the constraint set first, then start analyzing the IR. The traversal mechanism would probably have to change, as you now have to analyze the entire function/module/… that is passed to the analysis.