Affine-Parallelize not parallelizing some loops (original) (raw)

February 21, 2024, 11:43am 1

Hello, we tried to tile and parallelize a linalg.matmul to let it run on an GPU.
While lowering we encountered a loop which doesn’t get parallelized by the affine-parallelize pass.

We created this minimal example, where the loop over 2048 also doesn’t get parallelized using mlir-opt --affine-parallelize (same loop-nest as a tiled matmul, but simplified body):

#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 4)>
#map2 = affine_map<(d0) -> (d0 + 32, 1000)>
#map3 = affine_map<(d0) -> (d0 + 32)>

func.func @main() {
    %alloc = memref.alloc() : memref<16x2048xf32>
    %alloc_0 = memref.alloc() : memref<2048x1000xf32>
    %alloc_1 = memref.alloc() : memref<16x1000xf32>
    
    %cst = arith.constant 1.000000e+00 : f32
    
    affine.for %arg0 = 0 to 16 step 4 {
      affine.for %arg1 = 0 to 2048 step 32 {
        affine.for %arg2 = 0 to 1000 step 32 {
          affine.for %arg3 = #map(%arg0) to #map1(%arg0) {
            affine.for %arg4 = #map(%arg2) to min #map2(%arg2) {
              affine.for %arg5 = #map(%arg1) to #map3(%arg1) {
                affine.store %cst, %alloc_1[%arg3, %arg4] : memref<16x1000xf32>
              }
            }
          }
        }
      }
    }
    return
}

It results in the following IR:

#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> (d0 + 32)>
module {
  func.func @main() {
    %alloc = memref.alloc() : memref<16x2048xf32>
    %alloc_0 = memref.alloc() : memref<2048x1000xf32>
    %alloc_1 = memref.alloc() : memref<16x1000xf32>
    %cst = arith.constant 1.000000e+00 : f32
    affine.parallel (%arg0) = (0) to (16) step (4) {
      affine.for %arg1 = 0 to 2048 step 32 {
        affine.parallel (%arg2) = (0) to (1000) step (32) {
          affine.parallel (%arg3) = (%arg0) to (%arg0 + 4) {
            affine.parallel (%arg4) = (%arg2) to (min(%arg2 + 32, 1000)) {
              affine.for %arg5 = #map(%arg1) to #map1(%arg1) {
                affine.store %cst, %alloc_1[%arg3, %arg4] : memref<16x1000xf32>
              }
            }
          }
        }
      }
    }
    return
  }
}

We don’t understand why this loop in particular doesn’t get parallelized via the pass, but the surrounding ones do.

For GEMM, we ideally want to map these loops to blocks and threads for GPU execution. Does anybody have an idea how we could achieve that?

But %arg1 and %arg5 aren’t parallel loops. Their iterations are writing to the same memory location!

IIRC, linalg.matmul op can also be represented by linalg.generic. It has iterator_types = [“parallel”, “parallel”, “reduction”], which means only two for loops can be mapped to affine.parallel. U can use scf.forall instead, and transform dialect provides some methods to map scf.forall to blocks and threads.

jungpark February 21, 2024, 1:23pm 4

Yeah, as @bondhugula mentioned, affine.for for %arg1 and %arg5 are not required in above IR.

My assumption as you called it GEMM, 2048 is K dimension of 16x2048x1000 sized GEMM calculation and you’re trying 32x32 sized tile.
In that case, iterating over 2048 axis is a reduction, you may want to keep unless you’re planning to use a special instruction such as atomic_add.

Yeah of course, thank you, totally overlooked that

We tried to tile a matrix-matrix-multiplication and get block and thread mappings in the pipeline, ideally just by using passes (not the transform dialect - at least not yet until we are more familiar with mlir). With

mlir-opt \
    --convert-linalg-to-affine-loops \
    --affine-loop-tile="tile-sizes=4,32,32" \
    --affine-loop-unroll="unroll-factor=4" \
    --canonicalize \
    --affine-loop-invariant-code-motion \
    --affine-loop-normalize \
    --affine-parallelize \
    --lower-affine \
    --canonicalize \
    --gpu-map-parallel-loops \
    --convert-parallel-loops-to-gpu \
    matmul.mlir

we just get a block dimension with a single thread dimension each, and we’re still experimenting…

    ...
    %c1 = arith.constant 1 : index
    %c4 = arith.constant 4 : index
    %c0 = arith.constant 0 : index
    ...
    %c1_5 = arith.constant 1 : index
    %0 = affine.apply #map(%c4)[%c0, %c1]
    %1 = affine.apply #map(%c32)[%c0, %c1]

    gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %0, %arg7 = %c1_5, %arg8 = %c1_5) threads(%arg3, %arg4, %arg5) in (%arg9 = %1, %arg10 = %c1_5, %arg11 = %c1_5) {

      %2 = affine.apply #map1(%arg0)[%c1, %c0]
      %3 = arith.muli %2, %c4 : index
      %4 = affine.apply #map1(%arg3)[%c1, %c0]
      %5 = arith.muli %4, %c32 : index
      scf.for %arg12 = %c0 to %c64 step %c1 {
        %6 = arith.muli %arg12, %c32 : index
        scf.parallel (%arg13) = (%c0) to (%c4) step (%c1) {
          %7 = arith.addi %3, %arg13 : index
          %8 = arith.muli %4, %c-32 : index
          %9 = arith.addi %8, %c1000 : index
          %10 = arith.cmpi sgt, %9, %c32 : index
          %11 = arith.select %10, %c32, %9 : index
          scf.parallel (%arg14) = (%c0) to (%11) step (%c1) {
            %12 = arith.addi %5, %arg14 : index
            scf.for %arg15 = %c0 to %c8 step %c1 {
     ...

it’s a bit more complicated then what we were looking for, but we will try to solve it with the transform dialect, thank you for the tip