Proposing llvm.gpu intrinsics (original) (raw)

December 1, 2023, 9:48am 1

There are fundamental differences between the GPU targets and there are incidental ones. To the extent we can abstract over the differences, we get to reuse optimisations and testing across the targets.

GPUs are vector machines modelled as SIMT in IR. This spawns a bunch of architecture specific intrinsics for things like thread id in warp or warp level shuffles. At least one for amdgpu and one for nvptx.

Openmp has largely dealt with this by emitting calls into a runtime library with implementations that dispatch to the architecture specific intrinsic, thus the openmp optimisations act in part on those runtime functions.

I believe HIP uses header files that dispatch to the amdgpu intrinsic. I’m unclear what the story for running HIP on nvptx is there. Fortran won’t be using C++ header files and probably has it’s own dispatch layer, hopefully in MLIR somewhere.

Libc has it’s own header abstracting over these with link time selection of the target architecture.

I wish to collapse this divergence into the following and am seeking to uncover support or opposition:

1/ add a llvm.gpu.name intrinsic for each of these things
2/ add a codegen IR pass that lowers those intrinsics to the target specific ones, doing some limited impedance matching as required
3/ call that pass (early) from amdgpu and nvptx codegen
4/ add trivial clang builtins that expand to the llvm intrinsics
5/ patches to fold the existing divergence onto these as one goes along

This is architecture independent until the back end and provides a common substrate for the GPU programming languages to rely on. The work is mechanical, most of the mental energy probably goes on choosing names for the intrinsics that annoy all parties equally.

bader December 1, 2023, 6:04pm 2

I fully support this idea.

GPUs are vector machines modelled as SIMT in IR. This spawns a bunch of architecture specific intrinsics for things like thread id in warp or warp level shuffles. At least one for amdgpu and one for nvptx.

1/ add a llvm.gpu.name intrinsic for each of these things

Are you proposing intrinsics only for “thread id in warp or warp level shuffles”?

vchuravy December 1, 2023, 6:32pm 3

From the JuliaGPU side a hearty endorsement, but there are two additional areas of concern. NVPTX doesn’t fully support LLVM’s atomics and has it’s own atomics intrinsics. Furthermore a lot of the NVIDIA stack is not implemented using compiler intrinsics, but rather using inline assembly.

The AMDGPU backend as IMO the right approach w.r.t. atomics, e.g. a legalization pass. I think a while back I noticed that the GPU intrinsics were also not model in the unroller cost model.

It’s of the order of a couple of dozen intrinsics, probably an initial batch to cover the existing in tree runtimes and then slowly accumulating more over time. Libc has a relatively coherent list, extracted from utils and noise removed it looks like:

uint32_t get_num_blocks_x();                                                                                                                                                                                       
uint32_t get_num_blocks_y();                                                                                                                                                                                       
uint32_t get_num_blocks_z();                                                                                                                                                                                       
uint64_t get_num_blocks();                                                                                                                                                                                         
uint32_t get_block_id_x();                                                                                                                                                                                         
uint32_t get_block_id_y();                                                                                                                                                                                         
uint32_t get_block_id_z();                                                                                                                                                                                         
uint64_t get_block_id();                                                                                                                                                                                           
uint32_t get_num_threads_x();                                                                                                                                                                                      
uint32_t get_num_threads_y();                                                                                                                                                                                      
uint32_t get_num_threads_z();                                                                                                                                                                                      
uint64_t get_num_threads();                                                                                                                                                                                        
uint32_t get_thread_id_x();                                                                                                                                                                                        
uint32_t get_thread_id_y();                                                                                                                                                                                        
uint32_t get_thread_id_z();                                                                                                                                                                                        
uint64_t get_thread_id();                                                                                                                                                                                          
uint32_t get_lane_size();                                                                                                                                                                                          
uint32_t get_lane_id();                                                                                                                                                                                            
uint64_t get_lane_mask();                                                                                                                                                                                          
uint32_t broadcast_value(uint64_t, uint32_t);                                                                                                                                                                      
uint64_t ballot(uint64_t, bool);                                                                                                                                                                                   
void sync_threads();                                                                                                                                                                                               
void sync_lane(uint64_t);                                                                                                                                                                                          
void end_program();

That looks incomplete to me - the openmp runtime has some shuffles in it which libc doesn’t seem to be using - and also redundant, e.g. get_num_blocks is just multiplying the xyz ones together. But it’s a fair approximation to the order of magnitude.

nvptx is missing some intrinsics - mem.bar maybe? I haven’t looked for a while. I’d be inclined to expand the proposed llvm intrinsics to the same inline asm it uses at the moment.

The atomic story on nvptx is a mess - some functions only work when compiled as cuda, others work as freestanding c++, some require opencl - @jhuber6 is vaguely signed up to add gcc style intrinsics that take a scope argument at which point those would be a sensible way to implement concurrency in gpu-agnostic fashion (they would emit the IR instructions, never calls into opencl runtime functions that don’t exist)

I like canonical representations of information. Compilers are basically pattern matching systems and it sucks to have to match multiple different representations of the same information.

jhuber6 December 1, 2023, 8:59pm 6

That’s pretty much done in [Clang] Introduce scoped variants of GNU atomic functions by jhuber6 · Pull Request #72280 · llvm/llvm-project · GitHub, I’m mainly waiting for @efriedma-quic to give the go-ahead and some extra feedback on naming conventions. Unfortunately, the NVPTX backend doesn’t support any atomic scopes right now, even though PTX does. I believe @Artem-B says that it’s on his ever expanding todo list.

I’m a bit confused by the goals right now: I can see how having some common intrinsics can be convenient for DSL / LLVM Frontends to emit GPU kernels in LLVM IR without having to select the right set of intrinsics. But that’s not part of your stated motivations, instead you’re alluding to “reuse optimizations and testing across the targets”. Do you have examples of what you mean here? Like pointers passes and code that would be deduplicated / simplified in LLVM? Some pointers to redundant testing that would be unified?
It’s hard for me to evaluate the impact of this on these aspects right now.

nhaehnle December 21, 2023, 3:40pm 8

One example of a potential optimization that came up recently is strengthening uniformity analysis by looking at branch conditions.

It is fairly common to have the occasional block of code guarded by something like if (thread_id == 0). All values whose definitions are guarded by such a constraint are uniform by definition.

We can teach uniformity analysis about the different per-backend intrinsics, or we could just have the relevant generic intrinsics proposed here and use those.

ssahasra December 23, 2023, 7:41am 9

Another would be teaching InstCombine about uniformity. An expression such as “__any(predicate)” or “__all(predicate)” can be replaced by the input predicate if it is known to be uniform, if these were generic intrinsics.

Starting to pick the threads of work back up in the new year.

The divergence and uniformity handling is a good example, thank you.

An alternative or complementary approach is magic library functions backed by a runtime, e.g. [RFC] GPU builtins runtime

Intrinsics and magic library functions (and instructions for that matter) are the same thing in a deep sort of sense. My heuristic is in favour of intrinsics when the semantics they represent are interesting and the expansion is simple, and compiler-rt builtins when the expansion is complicated.

With either implementation path, it would be really nice for clang to emit the same intrinsic-or-function-call-symbol for different GPU targets where possible, for lit testing and for diff of IR which works on one target and not on another.

jhuber6 January 7, 2024, 2:49pm 11

I think what we really need is a portable way to express attribute dependent code for GPU targets that isn’t hostile to optimizations. Solutions like i-funcs and the target attribute work with clang code, but the delays resolution of the functions until load time.

There’s a few existing solutions to this problem used by the Nvidia and AMD device libraries respectively. Nvidia uses an intrinsic called __nvvm_reflect(const char *) which returns a macro definition, e.g. __CUDA_ARCH__ by a special pass. The AMDGPU runtime instead uses a list of external variables like __oclc_ISA_version which are expected to be defined via -mlink-builtin-bitcode or similar means.

I’m wondering if we could use something similar to __nvvm_reflect that allows the code to branch off of function-level IR attributes. E.g. check if the target-cpu attribute is sm_89 or has the ftz attribute. The implementation would then compile without an -mcpu or -march option, thus not emitting any target specific attributes, then when it’s linked in with another IR module that has said attributes we can propagate them and then resolve the builtin.

Would something like that be feasible?

Ah yes, well remembered. At the moderate risk of derailing this specific proposal, which is to abstract over amdgpu and nvptx, I’ll attempt to summarise the RFC I seem to have posted internally instead of externally. The internal response was muted enthusiasm provided someone else wrote it.

The proper solution for amdgpu microarch-dependent intrinsics, like u64 wavefront_size(), goes thus:

1/ Define an intrinsic for the feature (or batch them nvvm_reflect style, semantically equivalent)
2/ Library code can be written in terms of that intrinsic, e.g. the libc runtime
3/ Compiler backend instruction selection implements that intrinsic
4/ Oh look, now only the backend needs to know the target architecture, libc can be a single bitcode file

Opinion is divided on phase ordering and whether it’s OK for clang to burn a microarch into the IR up front as opposed to llc finding it out later, so let us write the lowering as an IR pass which the backend always runs and which expands the arch specific stuff where known + useful to do so. Application code probably knows the arch it’s running on, library code would like a zero overhead way of not shipping 40 near identical copies of the same code, both requirements are satisfied by this.

As an example, i32 llvm.gpu.wavesize() would exist, be available under __llvm_gpu_wavesize() to clang, but not be a constexpr thing. The lowering pass expands it to 32 on cuda, or to whatever intel uses, and on amdgpu gets to do an exciting thing where sometimes it’s known to be 32 and sometimes it’s known to be 64, and sometimes it’s actually supposed to be undefined*.

Abstracting over the GPUs being unnecessarily different so language front ends and library authors get to somewhat ignore it is literally the point of this sort of proposal - it is intrinsic complexity induced by the vendors, I want to solve it once for the benefit of all.

[1] noting the undefined thing here, with a note that I’m working on secondhand information which might be out of date. Compute kernels think the wavefront size is 32 or 64 based on architecture. One IR module can know the answer to what is the wavefront size as a (compiler middle end time) constant. Graphics sometimes want 32 and sometimes want 64 on the same hardware (which the hardware is indeed up for) and sometimes put both in the same IR module, with “don’t get it wrong” as the user experience in terms of callgraph reachability. Preserving existing behaviour means graphics continues to burn that llvm.gpu.wavefront intrinsic out at just-after-clang time (as it if was the macro constant), though this proposal also opens the door to a more comprehensive fix involving callgraph walks and cloning functions.

However even if I am unable to change that corner of the world for the better, we can definitely change the name of uniform/divergence related intrinsics to a llvm.gpu. prefix to make the passes dealing with those less annoying, and likewise to eliminate some need for users to paper over the different function names themselves.

nhaehnle January 9, 2024, 4:47am 13

At the risk of further derailing, some context here: This happens in LLPC because it can compile an entire graphics pipeline as a single IR module and (for example) the vertex shader may use a different wave size than the pixel shader.

There is no issue with call graphs in practice because the way that graphics APIs are currently still defined, different shader stages are different worlds unto themselves at the API level, and also everything is inlined anyway. And even if it wasn’t, the call graphs would be disjoint by definition.

So yeah, we’d use such an intrinsic and would just resolve it in a custom way before handing the module off to the backend (but not “just-after-clang”, because clang is uninvolved).