omp.target
access allocated memory raise errors (original) (raw)
January 2, 2024, 10:30am 1
The MLIR and LLVM toolchain was built with [AMDGPU][True16] Don't use the VGPR_LO/HI16 register classes. (#76440) · llvm/llvm-project@8c6172b · GitHub. Use mlir-translate --mlir-to-llvmir| clang++ -c -x ir -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1103
to generate static lib
module attributes {llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.target = #omp.target<target_cpu = "gfx1103", target_features = "">} {
llvm.func @llvm_omp_target_alloc_device(i64, i32) -> !llvm.ptr
llvm.func @omp_get_default_device() -> i32
llvm.func @_QQmain_omp_outline_1() attributes {omp.declare_target = #omp.declaretarget<device_type = (nohost)>} {
%0 = llvm.mlir.zero : !llvm.ptr
%1 = llvm.call @omp_get_default_device() : () -> i32
%2 = llvm.getelementptr %0[67108864] : (!llvm.ptr) -> !llvm.ptr, f64
%3 = llvm.ptrtoint %2 : !llvm.ptr to i64
%4 = llvm.call @llvm_omp_target_alloc_device(%3, %1) : (i64, i32) -> !llvm.ptr
%5 = llvm.call @llvm_omp_target_alloc_device(%3, %1) : (i64, i32) -> !llvm.ptr
%6 = omp.map_info var_ptr(%4 : !llvm.ptr, f64) map_clauses(tofrom) capture(ByCopy) -> !llvm.ptr
%7 = omp.map_info var_ptr(%5 : !llvm.ptr, f64) map_clauses(tofrom) capture(ByCopy) -> !llvm.ptr
omp.target map_entries(%6 -> %arg0, %7 -> %arg1 : !llvm.ptr, !llvm.ptr) {
^bb0(%arg0: !llvm.ptr, %arg1: !llvm.ptr):
%8 = llvm.mlir.constant(0 : index) : i64
%9 = llvm.mlir.constant(1 : index) : i64
%10 = llvm.mlir.constant(8192 : index) : i64
omp.teams {
omp.parallel {
omp.wsloop for (%arg2, %arg3, %arg4, %arg5) : i64 = (%8, %8, %8, %8) to (%10, %10, %10, %10) step (%9, %9, %9, %9) {
%11 = llvm.mul %arg2, %10 : i64
%12 = llvm.add %11, %arg3 : i64
%13 = llvm.load %arg0 {alignment = 8 : i64} : !llvm.ptr -> vector<16xf64>
%14 = llvm.load %arg1 {alignment = 8 : i64} : !llvm.ptr -> vector<16xf64>
%15 = llvm.mul %arg3, %10 : i64
%16 = llvm.add %15, %arg2 : i64
%17 = llvm.getelementptr %arg1[%16] : (!llvm.ptr, i64) -> !llvm.ptr, f64
%18 = llvm.load %17 {alignment = 8 : i64} : !llvm.ptr -> vector<16xf64>
%19 = llvm.fmul %13, %14 : vector<16xf64>
%20 = llvm.fdiv %14, %18 : vector<16xf64>
%21 = llvm.fadd %19, %20 : vector<16xf64>
%22 = llvm.getelementptr %arg1[%12] : (!llvm.ptr, i64) -> !llvm.ptr, f64
llvm.store %21, %22 {alignment = 8 : i64} : vector<16xf64>, !llvm.ptr
omp.terminator
}
omp.terminator
}
omp.terminator
}
omp.terminator
}
omp.barrier
llvm.return
}
llvm.func @_mlir_ciface__QQmain_omp_outline_1() attributes {llvm.emit_c_interface} {
llvm.call @_QQmain_omp_outline_1() : () -> ()
llvm.return
}
}
Using C program to call the function and build with clang args -fopenmp-targets=amdgcn-amd-amdhsa -fopenmp -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx1103
it’s normal when running on the CPU host, but on AMDGPU it raise errors even if replace llvm_omp_target_alloc_device
to llvm.alloc
and with a smaller size:
Libomptarget error: Host ptr 0x0000560ed81500a1 does not have a matching target pointer.
Libomptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
The C code has similar function works normally on the AMDGPU:
int main() {
double* a = (double*)llvm_omp_target_alloc_device(4096*4096*sizeof(double) , omp_get_default_device());
double* b = (double*)llvm_omp_target_alloc_device(4096*4096*sizeof(double) , omp_get_default_device());
#pragma omp target teams map(tofrom: a, b)
#pragma omp parallel for
for(int i=0; i<4096; i++){
for(int j=0; j<4096; j++){
for (int k=0; k<4096; k++){
a[i*4096+j] = i * j;
b[j*4096+k] = j / i;
}
}
}
}
shiltian January 2, 2024, 11:13pm 2
Thanks for the report. Could you please open an issue on Github? Here is more for a discussion.
I would look at the (mapping related) mlir of a small Flang OpenMP offloading snipped.
The mapping code seems different and we can track it on GH if this is a LLVM issue.
That said, I think the code itself is somewhat curious:
llvm_omp_target_alloc_device
returns device memory.
Using the function is totally fine, but you need to keep in mind that it is device memory.
#pragma omp target teams map(tofrom: a, b)
Usually, a tofrom
would go with an array range, e.g., tofrom: a[:100]
to allocate and copy 100 elements from the host to the device. Without the array range it is likely treated as a scalar for which a copy is created.
In combination with the alloc_device, I was expecting: is_device_ptr(a, b)
to tell the compiler that those are device pointers and should be used as is (effectively copy/firstprivate semantics).
With OpenMP 5.1/2 semantics (which is what the runtime implements) you can also just do #pragma omp target teams
and the pointers will be mapped firstprivate, IIRC.
Hi,
@EllisLambda thank you for your report. Could you also attach the source code and compilation command for generation of MLIR file?
Thanks for your guidance. I have tried to use the @malloc/@omp_alloc and added the bounds to the omp.map
in MLIR but have same issue. is_device_ptr
clause for omp dialect seems still in the TODO list.
skatrak January 4, 2024, 10:34am 8
I can see one thing that may be contributing to this code not working, though I don’t know whether that could be the source of the specific error you’re experiencing. It might also be that I don’t quite understand your use case.
Is it your intention to compile this snippet for the GPU to run as the OpenMP host device? Generally, I would expect it should act as the target device instead, so two compilations (host and device) would be necessary. For that, you would need two MLIR files with about the same contents, but a module attribute omp.is_target_device = true
on the one with llvm.target_triple = "amdgcn-amd-amdhsa"
and another with omp.is_target_device = false
and llvm.target_triple = "x86/arm/..."
. That way you could link your C program with the host version of your MLIR program, which sets up memory maps and launches the GPU kernel.
However, that might still not work yet because AFAIK omp.teams
lowering to LLVM IR for the device is not working upstream yet.
Thanks for the great suggestion! That’s my use case. I will try it.
agozillon January 4, 2024, 12:55pm 10
Another thing is that I am not entirely sure the MLIR → LLVM-IR map lowering code would lower this case very well at the moment (and the runtime error makes it seem like that is the case), but I am not entirely sure as I’ve not tried to do something like this yet. And I’ve not been able to emit the host LLVM-IR for the C example to contrast the emitted LLVM-IR between both MLIR and Clang.
There’s a chance you may need an OpenMP dialect BoundsOp to specify the range of data you wish mapped across per map argument, but I am not entirely sure that’s necessary for this case as I am rather unfamiliar with llvm_omp_target_alloc_device and it’s usage unfortunately. Perhaps the openmp runtime only needs the pointer.
Thanks for advice. I have have already tried to add bounds, but it’s useless. Still unable to access the memory.