[Affine] Is it reasonable to extend Affine to support non-index arithmetic? (original) (raw)

I would like to get some guidance on the use of Affine dialect/analysis/transforms in the context of Flang Fortran compiler.

There are certain attempts to convert Flang’s FIR to Affine (e.g. [RFC] Add FIR->Affine(optimization)->FIR pass pipeline ), which might be a good experimental project. At the same time, I see some limitations in Affine dialect and the AffineExpr that make me question this direction.

I want to start with just one problem and try to figure whether it is feasible and makes sense to extend the Affine representation to support such a case.

Here is a simple Fortran example with 32-bit arithmetic used to address array a:

subroutine test(a,n,m,k1,k2,k3,k4)
  real :: a(*)
  do i=n,m
     a(i - k1 + k2 - k3 + k4) = 1.0
  end do
end subroutine test

This code may be expressed with the usual MLIR dialects like this:

      scf.for %arg7 = %25 to %27 step %c1 {
        %37 = arith.index_cast %arg7 : index to i32
        %38 = arith.subi %37, %29 overflow<nsw> : i32
        %39 = arith.addi %38, %31 overflow<nsw> : i32
        %40 = arith.subi %39, %33 overflow<nsw> : i32
        %41 = arith.addi %40, %35 overflow<nsw> : i32
        %42 = arith.index_cast %41 : i32 to index
        %43 = arith.subi %42, %c1 : index
        memref.store %cst, %array[%43] : memref<?xf32, strided<[1]>>
      }

One may want to convert this loop to Affine, for example, to parallelize this loop via AffineParallelize analysis/pass (or apply any other Affine transformation, in general). Currently, it is not possible to represent the same loop using Affine dialect, because there is no way to create AffineExpr’s representing the i32 array index computations.

One of the approaches might be to promote all arithmetic to index, and then we can get a clean Affine representation:

        affine.for %arg7 = %25 to %27 {
          affine.store %cst, %array[%arg7 - symbol(%30) + symbol(%33) - symbol(%36) + symbol(%39) - 1] : memref<?xf32, strided<[1]>>
        }

Unfortunately, this means we end up with i64 arithmetic on most targets, and for some of them (e.g. GPU) i64 arithmetic will be executed slower than i32. I do not know of a good way to demote i64 to i32, in general. So it would be best (just in my opinion) if we could represent these computations in AffineExpr’s, so that after the Affine transformations and the expansion of the AffineMap’s we can have the original i32 arithmetic operations.

Does it make sense to add support for SCEV-like SExt and Trunc AffineExpr’s or there are some other ways to preserve the original bitness of the computations through the Affine transformation pipepline?