[RFC] Changing the loop pipeliner prologue/epilogue generation (original) (raw)

Context

The loop pipeliner in scf currently allows a user to provide a schedule for a scf::for loop and will automatically generate the pipelined loop by emitting the prologue + new loop + epilogue.
This greatly simplify writing a software pipeline transformation as it allows separating finding the schedule from transforming the loop.

Current implementation

The current interface allows user to decide if the epilogue should be peeled out or not. When the epilogue is not peeled out the pipeliner will mask out operations so that not all the operations are executed in the last iterations of the loop. The masking is done by a lambda that needs to be provided by users since MLIR doesn’t have a standard way to mask.

Problem

We cannot control how much we want to peel the epilogue and we cannot control whether we want to peel the prologue. It can be very useful to control those parameters with finer grain control. For instance we want to peel the epilogue only for 1 iteration if there are very few ops in the last stage.
The pipeliner logic is more or less duplicated in the prologue and epilogue emission make it more likely to have bugs.

Proposal

Instead of emitting prologue then kernel then epilogue, we could emit the pipelined loop directly then peel out loop iterations for the prologue and epilogue. Of course we still want to filter out dead operations in the prologue and epilogue so to make it easy we add a new transient piplening.mask operation that controls when this op should or shouldn’t be executed. Then when we peel the prologue/epilogue we can just “fold” the mask operation that either becomes dead or becomes the original operation.

ex:

    scf.for %arg2 = %c0 to %c8 step %c1 {
      %0 = memref.load %arg0[%arg2]  : memref<?xf32>
      %1 = arith.addf %0, %cst : f32
      memref.store %1, %arg1[%arg2] : memref<?xf32>
    }

currently generates directly:

    %0 = memref.load %arg0[%c0] : memref<?xf32>
    %1 = memref.load %arg0[%c1] : memref<?xf32>
    %2 = arith.addf %0, %cst : f32
    %3:2 = scf.for %arg2 = %c0 to %c6 step %c1 iter_args(%arg3 = %1, %arg4 = %2) -> (f32, f32) {
      %5 = arith.addi %arg2, %c2 : index
      %6 = memref.load %arg0[%5] : memref<?xf32>
      %7 = arith.addf %arg3, %cst : f32
      memref.store %arg4, %arg1[%arg2] : memref<?xf32>
      scf.yield %6, %7 : f32, f32
    }
    %4 = arith.addf %3#0, %cst : f32
    memref.store %3#1, %arg1[%c6] : memref<?xf32>
    memref.store %4, %arg1[%c7] : memref<?xf32>

instead the first step would be to generate:

    %ini0 = ub.poison : f32
    %ini1 = ub.poison : f32
    %3:2 = scf.for %arg2 = %c-2 to %c8 step %c1 iter_args(%arg3 = %ini0, %arg4 = %ini1) -> (f32, f32) {
      %5 = arith.addi %arg2, %c2 : index
      %p0 = pipeline.predicate 0, %arg2, %c0, %c2, %c1
      %p1 = pipeline.predicate 1, %arg2, %c0, %c2, %c1
      %p2 = pipeline.predicate 2, %arg2, %c0, %c2, %c1
      %m6 = pipeline.predicate.mask %p0 { %6 = memref.load %arg0[%5] : memref<?xf32> }
      %m7 = pipeline.predicate.mask %p1 { %7 = arith.addf %arg3, %cst : f32 }
      pipeline.predicate.mask %p2 { memref.store %arg4, %arg1[%arg2] : memref<?xf32> }
      scf.yield %m6, %m7 : f32, f32
    }

at this stage we can either lower pipeline.predicate and pipeline.predicate.mask directly and we would get the loop with both epilogue and prologue not peeled or we can apply loop peeling then fold the pipeline ops. For instance if we peel one iteration at the end:

    %ini0 = ub.poison : f32
    %ini1 = ub.poison : f32
    %3:2 = scf.for %arg2 = %c-2 to %c7 step %c1 iter_args(%arg3 = %ini0, %arg4 = %ini1) -> (f32, f32) {
      %5 = arith.addi %arg2, %c2 : index
      %p0 = pipeline.predicate 0, %arg2, %c0, %c2, %c1
      %p1 = pipeline.predicate 1, %arg2, %c0, %c2, %c1
      %p2 = pipeline.predicate 2, %arg2, %c0, %c2, %c1
      %m6 = pipeline.predicate.mask %p0 { %6 = memref.load %arg0[%5] : memref<?xf32> }
      %m7 = pipeline.predicate.mask %p1 { %7 = arith.addf %arg3, %cst : f32 }
      pipeline.predicate.mask %p2 { memref.store %arg4, %arg1[%arg2] : memref<?xf32> }
      scf.yield %m6, %m7 : f32, f32
    }
    %5 = arith.addi %c7, %c2 : index
    %p0 = %false
    %p1 = %false
    %p2 = %true
    %m6 = pipeline.predicate.mask %p0 { %6 = memref.load %arg0[%5] : memref<?xf32> } // dead
    %m7 = pipeline.predicate.mask %p1 { %7 = arith.addf %3#0, %cst : f32 } // dead
    // pipeline.predicate.mask %p2 { memref.store %3#1, %arg1[%c7] : memref<?xf32> }
    memref.store %3#1, %arg1[%c7] : memref<?xf32>

This requires adding two transient ops. Those would only be used within the pipelining transformation.
One question I have is whether we have a precedent for transient ops and if adding those into SCF dialect would make sense?

implementation

This change wouldn’t require breaking the current interface we could just add extra parameters to control how much we want to peel the prologue/epilogue. We can then change the kernel lowering and implement a peeling helper. There is no probably no peeling helpers that can be used out of the box so I we might need a specific one but this is very little code.

I’m not sure if there are many users of the loop pipeliner downstream, it was originally written for IREE but not sure if it is still used there. It is used heavily in Triton right now. Having fine grain control on the peeling has become quite important but I wonder if other users have found other missing features.

Also I’m not sure when I’ll get to implementing this, if anybody is interested in contributing I’m very open to collaboration here. If there is low interest in the community I may start this downstream first.

cc: @pawel.szczerbuk @Mogball @mehdi_amini @antiagainst