Showcasing LLVM/Offload (original) (raw)

While LLVM/Offload is not yet a thing (I hope to change that next week), I wanted to showcase to the wider community what we’ll be able to do once it is:

TL;DR: This shows how to run a CUDA kernel on an AMD GPU and, if no suitable GPU is found, it will simply fallback to the host CPU.

Let’s do this with the most basic example.
(Note that I didn’t add CUDA API support, which we could: Breaking the Vendor Lock - Performance Portable Programming Through OpenMP as Target Independent Runtime Layer (Conference) | OSTI.GOV)
The main function will allocate some “shared” memory (accessible by the host and device) and then write 7 to it.
The kernel is launched and it will write 42 into the memory.
We will verify both values and use launch debug information to check where execution happened.

/p/v/d/s/llvm-project ❯❯❯ cat test.cu

extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}

__global__ void square(int *A) { *A = 42; }

int main(int argc, char **argv) {
  int DevNo = 0;
  int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
  *Ptr = 7;
  printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
  square<<<1, 1>>>(Ptr);
  printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
  llvm_omp_target_free_shared(Ptr, DevNo);
}

Compiling this is a little more cumbersome that it should be, the problem is --offload-arch only works with NVIDIA and AMD GPUs right now and --offload= is not properly hooked up at all. We use -fopenmp-targets but the other two options will work in the future. (You can already replace all OpenMP options wtih --offload-arch=gfx90a but that won’t work for X86. STRICT_ANSI is some X86 fluke that I did not investigate further right now. It’s triggered by the auto-inclusion of gpu-libc (which I build too).)

/p/v/d/s/llvm-project ❯❯❯ clang++ test.cu -O3 -o test -foffload-via-llvm \
  -fopenmp-targets=x86_64,amdgcn-amd-amdhsa -D__STRICT_ANSI__ \
  -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a

Let’s inspect the binary we just created to verify we got two offloading images.
Whenever you read “openmp” assume this will later say “llvm_offload” or similar.

/p/v/d/s/llvm-project ❯❯❯ llvm-objdump --offloading test

test:        file format elf64-x86-64

OFFLOADING IMAGE [0]:
kind            elf
arch            gfx90a
triple          amdgcn-amd-amdhsa
producer        openmp

OFFLOADING IMAGE [1]:
kind            elf
arch
triple          x86_64
producer        openmp

Next, we run it with all GPUs visible.

/p/v/d/s/llvm-project ❯❯❯ LIBOMPTARGET_INFO=16 ./test
Ptr 0x155448ac8000, *Ptr 7
"PluginInterface" device 0 info: Launching kernel _Z6squarePi with 1 blocks and 1 threads in SPMD mode
AMDGPU device 0 info: #Args: 1 Teams x Thrds:    1x   1 (MaxFlatWorkGroupSize: 1024) LDS Usage: 16711931B #SGPRs/VGPRs: 10/2 #SGPR/VGPR Spills: 0/0 Tripcount: 0
Ptr 0x155448ac8000, *Ptr 42

The launch information shows we actually went to the GPU (I also confirmed it with GPU tools):

Now we hide the GPUs and run it again:

/p/v/d/s/llvm-project ❯❯❯ ROCR_VISIBLE_DEVICES= LIBOMPTARGET_INFO=16 ./test
Ptr 0x555555655bf0, *Ptr 7
"PluginInterface" device 0 info: Launching kernel _Z6squarePi with 1 blocks and 1 threads in Generic mode
Ptr 0x555555655bf0, *Ptr 42

This time we see a different launch message as the X86 plugin will execute the code since no GPU devices have been found.

The branch on which this was done has
26 files changed, 421 insertions(+), 46 deletions(-)
but not everything is necessary.

I also just “made it work” in places, so certainly cleanups needed.

We plan to upstream this and the CUDA API implementation after we created LLVM/Offload.
We also have work on wrapper for high-level things like Thrust. Feel free to reach out.

~ J

Code: