`omp.target` access allocated memory raise errors (original) (raw)

January 2, 2024, 10:30am 1

The MLIR and LLVM toolchain was built with [AMDGPU][True16] Don't use the VGPR_LO/HI16 register classes. (#76440) · llvm/llvm-project@8c6172b · GitHub. Use mlir-translate --mlir-to-llvmir| clang++ -c -x ir -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1103 to generate static lib

module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.target = #omp.target<target_cpu = "gfx1103", target_features = "">} {
  llvm.func @llvm_omp_target_alloc_device(i64, i32) -> !llvm.ptr
  llvm.func @omp_get_default_device() -> i32
  llvm.func @_QQmain_omp_outline_1() attributes {omp.declare_target = #omp.declaretarget<device_type = (nohost)>} {
    %0 = llvm.mlir.zero : !llvm.ptr
    %1 = llvm.call @omp_get_default_device() : () -> i32
    %2 = llvm.getelementptr %0[67108864] : (!llvm.ptr) -> !llvm.ptr, f64
    %3 = llvm.ptrtoint %2 : !llvm.ptr to i64
    %4 = llvm.call @llvm_omp_target_alloc_device(%3, %1) : (i64, i32) -> !llvm.ptr
    %5 = llvm.call @llvm_omp_target_alloc_device(%3, %1) : (i64, i32) -> !llvm.ptr
    %6 = omp.map_info var_ptr(%4 : !llvm.ptr, f64) map_clauses(tofrom) capture(ByCopy) -> !llvm.ptr
    %7 = omp.map_info var_ptr(%5 : !llvm.ptr, f64) map_clauses(tofrom) capture(ByCopy) -> !llvm.ptr
    omp.target map_entries(%6 -> %arg0, %7 -> %arg1 : !llvm.ptr, !llvm.ptr) {
    ^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr):
      %8 = llvm.mlir.constant(0 : index) : i64
      %9 = llvm.mlir.constant(1 : index) : i64
      %10 = llvm.mlir.constant(8192 : index) : i64
      omp.teams {
        omp.parallel {
          omp.wsloop for  (%arg2, %arg3, %arg4, %arg5) : i64 = (%8, %8, %8, %8) to (%10, %10, %10, %10) step (%9, %9, %9, %9) {
            %11 = llvm.mul %arg2, %10  : i64
            %12 = llvm.add %11, %arg3  : i64
            %13 = llvm.load %arg0 {alignment = 8 : i64} : !llvm.ptr -> vector<16xf64>
            %14 = llvm.load %arg1 {alignment = 8 : i64} : !llvm.ptr -> vector<16xf64>
            %15 = llvm.mul %arg3, %10  : i64
            %16 = llvm.add %15, %arg2  : i64
            %17 = llvm.getelementptr %arg1[%16] : (!llvm.ptr, i64) -> !llvm.ptr, f64
            %18 = llvm.load %17 {alignment = 8 : i64} : !llvm.ptr -> vector<16xf64>
            %19 = llvm.fmul %13, %14  : vector<16xf64>
            %20 = llvm.fdiv %14, %18  : vector<16xf64>
            %21 = llvm.fadd %19, %20  : vector<16xf64>
            %22 = llvm.getelementptr %arg1[%12] : (!llvm.ptr, i64) -> !llvm.ptr, f64
            llvm.store %21, %22 {alignment = 8 : i64} : vector<16xf64>, !llvm.ptr
            omp.terminator
          }
          omp.terminator
        }
        omp.terminator
      }
      omp.terminator
    }
    omp.barrier
    llvm.return
  }
  llvm.func @_mlir_ciface__QQmain_omp_outline_1() attributes {llvm.emit_c_interface} {
    llvm.call @_QQmain_omp_outline_1() : () -> ()
    llvm.return
  }
}

Using C program to call the function and build with clang args -fopenmp-targets=amdgcn-amd-amdhsa -fopenmp -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1103 it’s normal when running on the CPU host, but on AMDGPU it raise errors even if replace llvm_omp_target_alloc_device to llvm.alloc and with a smaller size:

Libomptarget error: Host ptr 0x0000560ed81500a1 does not have a matching target pointer.
Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

The C code has similar function works normally on the AMDGPU:

int main() {
    double* a = (double*)llvm_omp_target_alloc_device(4096*4096*sizeof(double) , omp_get_default_device());
    double* b = (double*)llvm_omp_target_alloc_device(4096*4096*sizeof(double) , omp_get_default_device());
    #pragma omp target teams map(tofrom: a, b)
    #pragma omp parallel for
        for(int i=0; i<4096; i++){
            for(int j=0; j<4096; j++){
                for (int k=0; k<4096; k++){
                a[i*4096+j] = i * j;
                b[j*4096+k] = j / i;
                }
            }
        }
}

shiltian January 2, 2024, 11:13pm 2

Thanks for the report. Could you please open an issue on Github? Here is more for a discussion.

I would look at the (mapping related) mlir of a small Flang OpenMP offloading snipped.
The mapping code seems different and we can track it on GH if this is a LLVM issue.
That said, I think the code itself is somewhat curious:

llvm_omp_target_alloc_device returns device memory.
Using the function is totally fine, but you need to keep in mind that it is device memory.

#pragma omp target teams map(tofrom: a, b)
Usually, a tofrom would go with an array range, e.g., tofrom: a[:100] to allocate and copy 100 elements from the host to the device. Without the array range it is likely treated as a scalar for which a copy is created.
In combination with the alloc_device, I was expecting: is_device_ptr(a, b) to tell the compiler that those are device pointers and should be used as is (effectively copy/firstprivate semantics).
With OpenMP 5.1/2 semantics (which is what the runtime implements) you can also just do
#pragma omp target teams
and the pointers will be mapped firstprivate, IIRC.

Hi,
@EllisLambda thank you for your report. Could you also attach the source code and compilation command for generation of MLIR file?

Thanks for your guidance. I have tried to use the @malloc/@omp_alloc and added the bounds to the omp.map in MLIR but have same issue. is_device_ptr clause for omp dialect seems still in the TODO list.

skatrak January 4, 2024, 10:34am 8

I can see one thing that may be contributing to this code not working, though I don’t know whether that could be the source of the specific error you’re experiencing. It might also be that I don’t quite understand your use case.

Is it your intention to compile this snippet for the GPU to run as the OpenMP host device? Generally, I would expect it should act as the target device instead, so two compilations (host and device) would be necessary. For that, you would need two MLIR files with about the same contents, but a module attribute omp.is_target_device = true on the one with llvm.target_triple = "amdgcn-amd-amdhsa" and another with omp.is_target_device = false and llvm.target_triple = "x86/arm/...". That way you could link your C program with the host version of your MLIR program, which sets up memory maps and launches the GPU kernel.

However, that might still not work yet because AFAIK omp.teams lowering to LLVM IR for the device is not working upstream yet.

Thanks for the great suggestion! That’s my use case. I will try it.

agozillon January 4, 2024, 12:55pm 10

Another thing is that I am not entirely sure the MLIR → LLVM-IR map lowering code would lower this case very well at the moment (and the runtime error makes it seem like that is the case), but I am not entirely sure as I’ve not tried to do something like this yet. And I’ve not been able to emit the host LLVM-IR for the C example to contrast the emitted LLVM-IR between both MLIR and Clang.

There’s a chance you may need an OpenMP dialect BoundsOp to specify the range of data you wish mapped across per map argument, but I am not entirely sure that’s necessary for this case as I am rather unfamiliar with llvm_omp_target_alloc_device and it’s usage unfortunately. Perhaps the openmp runtime only needs the pointer.

Thanks for advice. I have have already tried to add bounds, but it’s useless. Still unable to access the memory.

omp.target access allocated memory raise errors (original) (raw)

`omp.target` access allocated memory raise errors (original) (raw)