What exactly is stored in __tgt_device_image struct? (original) (raw)

January 12, 2024, 2:09pm 1

Hi folks,

I’ve been exploring how OpenMP offloading is implemented and in particular I’ve been looking into data structures containing device code which are emitted by the compiler for the runtime.

I haven’t been able to find answers in documentation and went to explore the source code. However, there is lots of it and without hands on experience with it I’m not sure if I got the right understanding, hence the post here.

clang-linker-wrapper emits device code registration functions, which accept a pointer to __tgt_bin_desc. That structure contains a pointer to list of __tgt_device_image objects, which are pretty small containers intended to hold binary blob with device code.

However, the tool internally operates on a more complicated data structure OffloadBinary, which is capable to hold some extra metadata about the device image: image kind, offload kind and a string-to-string map for other arbitrary auxiliary info (for example target device/arch).

From what I understood, that information is only used during device link step. For example, it could be used to understand which input device images should be disregarded if input object files were compiled for several targets, but the final app is being linked only for a single target.

Nevertheless, it seems that clang-linker-wrapper anyway serializes the whole OffloadBinary objects and embeds it into __tgt_device_image struct. Later at runtime this device image struct is re-created to effectively drop all that extra metadata so that pointers within the struct point to actual device code blob and that happens in DeviceImageTy constructor.

Is that understanding correct?

I’m also curious about how device image is checked for compatibility with a device. It seems like OpenMP offloading plugins decode that binary blob __tgt_device_image struct points to in order to extract some information out of it and do the check. For example, if device image is an ELF, then for CUDA we are looking for EF_CUDA_SM flag to get compute capabilities used/required by the device image to compare them with device we have. This means that binary blob stored by __tgt_device_image is not viewed as a black box, but instead compiler and OpenMP offloading plugins have a certain contract about its possible formats and the content - right?

Apology if the category is wrong, but this one seems like the closest to the area I’m curious about. Thanks in advance!

jhuber6 January 12, 2024, 2:35pm 2

Hello,

I’ve been changing a lot of this code recently as well, so apologies for the confusion there.

This was a very recent change after I got around to adding all the necessary ELF flags for NVIDIA machines after reverse engineering the binaries. We used to use that metadata internally as well. Now, it’s just a pointer to the ELF which contains the information we’re interested in, namely e_machine and e_flags.

The reason we still embed the bundled format OffloadBinary is because this information is put into a specific .llvm.offloading section which tools like llvm-objdump --offloading can spit out information to the user using said metadata.

More or less, it’s just an ELF, which is the native format for executables on most targets.

The __tgt_device_image struct contains two things. First, as you’ve said, it’s just a binary blob. Right now their either points to an ELF that can be loaded onto the respective target (e.g. AMDHSA or CUDA) and executed or an LLVM bitcode file to be compiled to an ELF in JIT fashion. These files have header information which lets us know what it’s targeting. GPUs have very little backwards compatibility, so it’s important to know what architecture it’s intended for.

The second is a pointer to a list of something we call __tgt_offload_entry. These are just structs that represent global variables or functions that need to be registered with the respective runtimes. This pretty much just contains the name of the symbol, its size, and a pointer to the host version so we can map the two. I’m currently working on burning this out of the device portion, because it’s realistically only a construct used to make it easier for the compiler to emit these variables. Right now clang just emits a global at a special section which the linker lets us get a pointer to so we can iterate over all of them. There’s some more information at Offloading Design & Internals — Clang 18.0.0git documentation.

Hopefully this answers your questions. If anything else is confusing I can probably explain it since I’ve practically rewritten the entire toolchain at this point.

Thanks, @jhuber6

This was a very recent change after I got around to adding all the necessary ELF flags for NVIDIA machines after reverse engineering the binaries. We used to use that metadata internally as well. Now, it’s just a pointer to the ELF which contains the information we’re interested in, namely e_machine and e_flags .

Oh, so I wasn’t the only one who couldn’t find documentation about those details

Is it correct to say that strictly speaking there is a risk that NVIDA updates the format somehow so everything will be broken and another round of reverse-engineering would be required?

The second is a pointer to a list of something we call __tgt_offload_entry .

Yeah, I’m aware of this one, but I didn’t mention it because I was specifically interested in the binary with device code.

jhuber6 January 12, 2024, 4:13pm 4

Theoretically, but it’s highly unlikely as it would most likely break their code as well. Most of everything relating the CUDA or NVIDIA is reverse engineered somehow in LLVM/Clang so it’s not out of the ordinary. I’ve never heard of a target ever changing its ELF flags however.

Realistically, you can just think of the binary image as an executable. I’ve done as much with my libc project which allows you to do stuff like this. You can just target C/C++ code directly with clang and get an ELF out of it if you’re interested in looking at them.

$ clang main.c --target=amdgcn-amd-amdhsa -mcpu=gfx90a -nogpulib -flto -o image
$ clang main.c --target=nvptx64-nvidia-cuda -march=sm_89 -nogpulib -o image
llvm-readelf -hSs image

This is more or less what I do for my libc project libc for GPUs — The LLVM C Library that includes making the source and running the unit test suite.