Showcasing LLVM/Offload (original) (raw)
While LLVM/Offload is not yet a thing (I hope to change that next week), I wanted to showcase to the wider community what we’ll be able to do once it is:
TL;DR: This shows how to run a CUDA kernel on an AMD GPU and, if no suitable GPU is found, it will simply fallback to the host CPU.
Let’s do this with the most basic example.
(Note that I didn’t add CUDA API support, which we could: Breaking the Vendor Lock - Performance Portable Programming Through OpenMP as Target Independent Runtime Layer (Conference) | OSTI.GOV)
The main function will allocate some “shared” memory (accessible by the host and device) and then write 7 to it.
The kernel is launched and it will write 42 into the memory.
We will verify both values and use launch debug information to check where execution happened.
/p/v/d/s/llvm-project ❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}
__global__ void square(int *A) { *A = 42; }
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
*Ptr = 7;
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
square<<<1, 1>>>(Ptr);
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
llvm_omp_target_free_shared(Ptr, DevNo);
}
Compiling this is a little more cumbersome that it should be, the problem is --offload-arch
only works with NVIDIA and AMD GPUs right now and --offload=
is not properly hooked up at all. We use -fopenmp-targets
but the other two options will work in the future. (You can already replace all OpenMP options wtih --offload-arch=gfx90a
but that won’t work for X86. STRICT_ANSI is some X86 fluke that I did not investigate further right now. It’s triggered by the auto-inclusion of gpu-libc (which I build too).)
/p/v/d/s/llvm-project ❯❯❯ clang++ test.cu -O3 -o test -foffload-via-llvm \
-fopenmp-targets=x86_64,amdgcn-amd-amdhsa -D__STRICT_ANSI__ \
-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a
Let’s inspect the binary we just created to verify we got two offloading images.
Whenever you read “openmp” assume this will later say “llvm_offload” or similar.
/p/v/d/s/llvm-project ❯❯❯ llvm-objdump --offloading test
test: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
OFFLOADING IMAGE [1]:
kind elf
arch
triple x86_64
producer openmp
Next, we run it with all GPUs visible.
/p/v/d/s/llvm-project ❯❯❯ LIBOMPTARGET_INFO=16 ./test
Ptr 0x155448ac8000, *Ptr 7
"PluginInterface" device 0 info: Launching kernel _Z6squarePi with 1 blocks and 1 threads in SPMD mode
AMDGPU device 0 info: #Args: 1 Teams x Thrds: 1x 1 (MaxFlatWorkGroupSize: 1024) LDS Usage: 16711931B #SGPRs/VGPRs: 10/2 #SGPR/VGPR Spills: 0/0 Tripcount: 0
Ptr 0x155448ac8000, *Ptr 42
The launch information shows we actually went to the GPU (I also confirmed it with GPU tools):
Now we hide the GPUs and run it again:
/p/v/d/s/llvm-project ❯❯❯ ROCR_VISIBLE_DEVICES= LIBOMPTARGET_INFO=16 ./test
Ptr 0x555555655bf0, *Ptr 7
"PluginInterface" device 0 info: Launching kernel _Z6squarePi with 1 blocks and 1 threads in Generic mode
Ptr 0x555555655bf0, *Ptr 42
This time we see a different launch message as the X86 plugin will execute the code since no GPU devices have been found.
The branch on which this was done has 26 files changed, 421 insertions(+), 46 deletions(-)
but not everything is necessary.
I also just “made it work” in places, so certainly cleanups needed.
We plan to upstream this and the CUDA API implementation after we created LLVM/Offload.
We also have work on wrapper for high-level things like Thrust. Feel free to reach out.
~ J
Code: