Building Device IR and Host IR into an Executable (original) (raw)

February 3, 2025, 8:46am 1

I am trying to apply a transformation to OpenMP device IR and use the transformed device IR and host IR to produce an executable.

I use the clang++ --save-temps option to produce device.bc and host.bc as input files.

After running opt to transform device.bc, its content is modified.

I then attempt to combine the new device.bc with host.bc using the following steps:

$ llc -filetype=obj host.bc
# Produce host.o

$ llc device.bc
# Produce device.s

$ ptxas -arch=sm_89 --compile-only -o device.o device.s
# Produce device.o

Finally, I run:

$ clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 host.o device.o -o test

However, I get the following error:

/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: Relocations in generic ELF (EM: 190)
/usr/bin/ld: /tmp/device-nvptx64-sm_89-893c41.o: error adding symbols: file in wrong format
collect2: error: ld returned 1 exit status
clang: error: linker (via gcc) command failed with exit code 1 (use -v to see invocation)
/home/hhfeng/llvm-project-llvmorg-19.1.6/build/bin/clang-linker-wrapper: error: 'clang' failed
clang++-l: error: linker command failed with exit code 1 (use -v to see invocation)

Can anyone provide guidelines on how to build an executable from the IR level?

Also, I’m not sure if these are the correct steps to follow.

I’ve also tried building a single object file using clang-offload-packager first:

$ clang-offload-packager -o test.o \
  --image=file=device.o,arch=sm_89,target=nvptx64 \
  --image=file=host.bc,target=x86_64

Then linking:

$ clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 test.o device.o -o test

But I still get the same error message.

Follow the steps taken by clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 test.c -o test, or clang++ -O3 -fopenmp --offload-arch=nvptx64 test.c -o test.
So, first run clang++ -O3 -fopenmp --offload-arch=nvptx64 test.c -o test -save-temps -### to see the steps.
Then you can intercept the device code and use the rest of the command to package it back together.

(Tag @jhuber6)

jhuber6 February 3, 2025, 4:57pm 3

You are confusing fat binaries, device binaries, and the offload binaries (I don’t blame you, it’s complicated.).

The clang-linker-wrapper works by extracting device code from fat binaries. A fat binary is a host object that contains an OffloadBinary in a special ELF section. The output of clang-offload-packager is an OffloadBinary, i.e., a blob before it’s embedded into the host object to make a fat binary. If you want to make a ‘device-only’ fatbinary, you can do something like this.

$ clang-offload-packager -o test.o \
  --image=file=device.o,arch=sm_89,target=nvptx64 \
  --image=file=host.bc,target=x86_64
$ cat /dev/null | clang -x c - -c -Xclang -fembed-offload-object=test.o -o fatbin.o

If you simply want to pass an extra file to the device compilation step, you can also just do this,

$ clang foo.c -fopenmp --offload-arch=sm_89 -Xoffload-linker device.bc

But this won’t work for cases where you need to modify the generated OpenMP device code. It’s mostly useful for passing GPU implementations of some functions.

eytodo February 6, 2025, 6:15am 4

Hi @jdoerfert and @jhuber6,

Thank you both very much for your advice and knowledge. It has really helped me, as a rookie, to better understand the compilation flow and the usage of tools. I truly appreciate it.

Based on what I’ve studied, here is the overall process I successfully used to build an executable.

Steps to Build an Executable

Assume you have two input IR files: host.bc and device.bc.

1. Apply Transformation to Device IR

$ opt -load-pass-plugin <your_pass_library> -passes='<your_pass>' device.bc -o device.bc

2. Compile `device.bc` into a Device Object File

(Not necessary.)

$ clang -cc1 -triple nvptx64-nvidia-cuda -S -o device.s device.bc
$ ptxas -m64 --gpu-name sm_89 -o device.o device.s -c

3. Build the Device Object into an Offload Object

(You can also skip Step 2 and use device.bc as the image file.)

$ clang-offload-packager -o offload.o \
  --image=file=<device.o|device.bc>,triple=nvptx64-nvidia-cuda,arch=sm_89,kind=openmp

4. Embed Offload Object into Host IR

$ clang -cc1 -triple x86_64-unknown-linux-gnu -emit-obj \
  -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda \
  -fembed-offload-object=offload.o -o test.o -x ir host.bc

At this stage, test.o is a fat binary, meaning it is a host binary that contains an offload entry, and this entry stores the offload object.

5. Final Linking

$ clang -fopenmp -fopenmp-targets=nvptx64 -O3 test.o -o exe [link other libraries and objects]

I’ve also tried creating GPU functions to allow the device IR to invoke them.

1. Create the Device IR for Your GPU Function

$ clang -S -emit-llvm --cuda-gpu-arch=sm_89 --cuda-device-only \
  -I /opt/cuda/include -O3 -x cuda deviceLib.cpp -o deviceLib.ll

2. Link the GPU Function with Other Device Files

You can add the -Xoffload-linker deviceLib.ll option to Step 5 to link your GPU function with other device files.

jhuber6 February 6, 2025, 8:27pm 5

Glad I could help. I really need to add some more examples like that to Offloading Design & Internals — Clang 21.0.0git documentation. It’s been an iterative process and I’ve just forgotten to keep up with some of the changes. Especially now that CUDA uses the new driver by default.