[SPIRV] GPU intrinsics by JonChesterfield · Pull Request #131190 · llvm/llvm-project (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation51 Commits1 Checks12 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Introduce __builtin_gpu builtins to clang and corresponding llvm.gpu intrinsics in llvm for abstracting over minor differences between GPU architectures, and use those to implement a gpuintrin.h instantiation to support compiling llvm libc to spirv64--.

Motivated by discussion at https://discourse.llvm.org/t/rfc-spirv-ir-as-a-vendor-agnostic-gpu-representation/85115 and RFC for this specifically at https://discourse.llvm.org/t/rfc-llvm-gpu-builtins-for-target-agnostic-code-representation/85181.

These are not named llvm.spirv because there are no spirv specific semantics involved. They're deliberately the same small abstraction over targets used by llvm libc already. Essentially this patch allows us to postpone choosing a target architecture for libc until JIT time.

There is some refactoring to be done if this lands - moving some work out of CGBuiltin.cpp, simplifying gpuintrin.h, adjusting openmp's codegen to use these instead of devicertl magic functions.

@llvm/pr-subscribers-backend-nvptx
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-x86

Author: Jon Chesterfield (JonChesterfield)

Changes

Motivated by discussion at https://discourse.llvm.org/t/rfc-spirv-ir-as-a-vendor-agnostic-gpu-representation/85115 and RFC for this specifically as .

There is some refactoring to be done if this lands - moving some work out of CGBuiltin.cpp, simplifying gpuintrin.h, adjusting openmp's codegen to use these instead of devicertl magic functions.

Patch is 100.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131190.diff

18 Files Affected:

(modified) clang/include/clang/Basic/Builtins.td (+33)
(modified) clang/lib/Headers/amdgpuintrin.h (+1-1)
(modified) clang/lib/Headers/gpuintrin.h (+2)
(added) clang/lib/Headers/spirvintrin.h (+177)
(added) clang/test/CodeGen/amdgpu-grid-builtins.c (+158)
(added) clang/test/CodeGen/gpu_builtins.c (+647)
(modified) clang/test/Headers/gpuintrin.c (+223)
(modified) llvm/include/llvm/IR/Intrinsics.td (+63)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(modified) llvm/include/llvm/Transforms/Scalar.h (+6)
(added) llvm/include/llvm/Transforms/Scalar/LowerGPUIntrinsic.h (+26)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Transforms/Scalar/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp (+501)
(modified) llvm/lib/Transforms/Scalar/Scalar.cpp (+1)
(added) llvm/test/CodeGen/SPIRV/gpu_intrinsics.ll (+427)

diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td index 2fbdfaea57ccd..fb87b5c74c0f8 100644 --- a/clang/include/clang/Basic/Builtins.td +++ b/clang/include/clang/Basic/Builtins.td @@ -4770,6 +4770,39 @@ def GetDeviceSideMangledName : LangBuiltin<"CUDA_LANG"> { let Prototype = "char const*(...)"; }

+// GPU intrinsics +class GPUBuiltin : Builtin {

let Spellings = ["_builtin" # NAME];
let Prototype = prototype;
let Attributes = [NoThrow]; +}
+multiclass GPUGridBuiltin {
def _x : GPUBuiltin;
def _y : GPUBuiltin;
def _z : GPUBuiltin; +}
+defm gpu_num_blocks : GPUGridBuiltin<"uint32_t()">; +defm gpu_block_id : GPUGridBuiltin<"uint32_t()">; +defm gpu_num_threads : GPUGridBuiltin<"uint32_t()">; +defm gpu_thread_id : GPUGridBuiltin<"uint32_t()">;
+def gpu_num_lanes : GPUBuiltin<"uint32_t()">; +def gpu_lane_id : GPUBuiltin<"uint32_t()">; +def gpu_lane_mask : GPUBuiltin<"uint64_t()">;
+def gpu_read_first_lane_u32 : GPUBuiltin<"uint32_t(uint64_t, uint32_t)">; +def gpu_shuffle_idx_u32 : GPUBuiltin<"uint32_t(uint64_t, uint32_t, uint32_t, uint32_t)">;
+def gpu_ballot : GPUBuiltin<"uint64_t(uint64_t, bool)">;
+// Todo, why can't I mark these convergent? +def gpu_sync_threads : GPUBuiltin<"void()">; +def gpu_sync_lane : GPUBuiltin<"void(uint64_t)">; +def gpu_exit : GPUBuiltin<"void()">; +def gpu_thread_suspend : GPUBuiltin<"void()">;
// HLSL def HLSLAddUint64: LangBuiltin<"HLSL_LANG"> { let Spellings = ["__builtin_hlsl_adduint64"]; diff --git a/clang/lib/Headers/amdgpuintrin.h b/clang/lib/Headers/amdgpuintrin.h index f7fb8e2814180..817cfeec896c4 100644 --- a/clang/lib/Headers/amdgpuintrin.h +++ b/clang/lib/Headers/amdgpuintrin.h @@ -1,4 +1,4 @@ -//===-- amdgpuintrin.h - AMDPGU intrinsic functions -----------------------===// +//===-- amdgpuintrin.h - AMDGPU intrinsic functions -----------------------===// // // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h index 0fb3916acac61..a3ce535188a48 100644 --- a/clang/lib/Headers/gpuintrin.h +++ b/clang/lib/Headers/gpuintrin.h @@ -60,6 +60,8 @@ _Pragma("omp end declare target"); #include <nvptxintrin.h> #elif defined(AMDGPU) #include <amdgpuintrin.h>

+#elif defined(SPIRV64) +#include <spirvintrin.h> #elif !defined(_OPENMP) #error "This header is only meant to be used on GPU architectures." #endif diff --git a/clang/lib/Headers/spirvintrin.h b/clang/lib/Headers/spirvintrin.h new file mode 100644 index 0000000000000..a5129d3577edf --- /dev/null +++ b/clang/lib/Headers/spirvintrin.h @@ -0,0 +1,177 @@ +//===-- spirvintrin.h - SPIRV intrinsic functions ------------------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#ifndef __SPIRVINTRIN_H +#define __SPIRVINTRIN_H + +#ifndef SPIRV64 +// 32 bit SPIRV is currently a stretch goal +#error "This file is intended for SPIRV64 targets or offloading to SPIRV64" +#endif + +#ifndef __GPUINTRIN_H +#error "Never use <spirvintrin.h> directly; include <gpuintrin.h> instead" +#endif + +// This is the skeleton of the spirv implementation for gpuintrin +// Address spaces and kernel attribute are not yet implemented +// The target-specific functions are declarations waiting for clang support + +#if defined(_OPENMP) +#error "Openmp is not yet available on spirv though gpuintrin header" +#endif + +// Type aliases to the address spaces used by the SPIRV backend. +#define __gpu_private +#define __gpu_constant +#define __gpu_local +#define __gpu_global +#define __gpu_generic + +// Attribute to declare a function as a kernel. +#define __gpu_kernel + +// Returns the number of workgroups in the 'x' dimension of the grid. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_blocks_x(void) {

return __builtin_gpu_num_blocks_x(); +}
+// Returns the number of workgroups in the 'y' dimension of the grid. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_blocks_y(void) {
return __builtin_gpu_num_blocks_y(); +}
+// Returns the number of workgroups in the 'z' dimension of the grid. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_blocks_z(void) {
return __builtin_gpu_num_blocks_z(); +}
+// Returns the 'x' dimension of the current AMD workgroup's id. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_block_id_x(void) {
return __builtin_gpu_block_id_x(); +}
+// Returns the 'y' dimension of the current AMD workgroup's id. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_block_id_y(void) {
return __builtin_gpu_block_id_y(); +}
+// Returns the 'z' dimension of the current AMD workgroup's id. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_block_id_z(void) {
return __builtin_gpu_block_id_z(); +}
+// Returns the number of workitems in the 'x' dimension. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_threads_x(void) {
return __builtin_gpu_num_threads_x(); +}
+// Returns the number of workitems in the 'y' dimension. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_threads_y(void) {
return __builtin_gpu_num_threads_y(); +}
+// Returns the number of workitems in the 'z' dimension. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_threads_z(void) {
return __builtin_gpu_num_threads_z(); +}
+// Returns the 'x' dimension id of the workitem in the current workgroup. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_thread_id_x(void) {
return __builtin_gpu_thread_id_x(); +}
+// Returns the 'y' dimension id of the workitem in the current workgroup. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_thread_id_y(void) {
return __builtin_gpu_thread_id_y(); +}
+// Returns the 'z' dimension id of the workitem in the current workgroup. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_thread_id_z(void) {
return __builtin_gpu_thread_id_z(); +}
+// Returns the size of the wave. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_lanes(void) {
return __builtin_gpu_num_lanes(); +}
+// Returns the id of the thread inside of a wave executing together. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_lane_id(void) {
return __builtin_gpu_lane_id(); +}
+// Returns the bit-mask of active threads in the current wave. +_DEFAULT_FN_ATTRS static inline uint64_t __gpu_lane_mask(void) {
return __builtin_gpu_lane_mask(); +}
+// Copies the value from the first active thread in the wave to the rest. +_DEFAULT_FN_ATTRS static inline uint32_t +__gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t __x) {
return __builtin_gpu_read_first_lane_u32(__lane_mask, __x); +}
+// Returns a bitmask of threads in the current lane for which \p x is true. +_DEFAULT_FN_ATTRS static inline uint64_t __gpu_ballot(uint64_t __lane_mask,

                                                     bool __x) {

return __builtin_gpu_ballot(__lane_mask, __x); +}
+// Waits for all the threads in the block to converge and issues a fence. +_DEFAULT_FN_ATTRS static inline void __gpu_sync_threads(void) {
return __builtin_gpu_sync_threads(); +}
+// Wait for all threads in the wave to converge +_DEFAULT_FN_ATTRS static inline void __gpu_sync_lane(uint64_t __lane_mask) {
return __builtin_gpu_sync_lane(__lane_mask); +}
+// Shuffles the the lanes inside the wave according to the given index. +_DEFAULT_FN_ATTRS static inline uint32_t +__gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x,
```
                 uint32_t __width) {
```
return __builtin_gpu_shuffle_idx_u32(__lane_mask, __idx, __x, __width); +}
+// Returns a bitmask marking all lanes that have the same value of __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_any_u32(uint64_t __lane_mask, uint32_t __x) {
return __gpu_match_any_u32_impl(__lane_mask, __x); +}
+// Returns a bitmask marking all lanes that have the same value of __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_any_u64(uint64_t __lane_mask, uint64_t __x) {
return __gpu_match_any_u64_impl(__lane_mask, __x); +}
+// Returns the current lane mask if every lane contains __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_all_u32(uint64_t __lane_mask, uint32_t __x) {
return __gpu_match_all_u32_impl(__lane_mask, __x); +}
+// Returns the current lane mask if every lane contains __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_all_u64(uint64_t __lane_mask, uint64_t __x) {
return __gpu_match_all_u64_impl(__lane_mask, __x); +}
+// Terminates execution of the associated wave. +_DEFAULT_FN_ATTRS [[noreturn]] static inline void __gpu_exit(void) {
return __builtin_gpu_exit(); +}
+// Suspend the thread briefly to assist the scheduler during busy loops. +_DEFAULT_FN_ATTRS static inline void __gpu_thread_suspend(void) {
return __builtin_gpu_thread_suspend(); +}
+#endif // __SPIRVINTRIN_H

diff --git a/clang/test/CodeGen/amdgpu-grid-builtins.c b/clang/test/CodeGen/amdgpu-grid-builtins.c new file mode 100644 index 0000000000000..2104da2dc3cbc --- /dev/null +++ b/clang/test/CodeGen/amdgpu-grid-builtins.c @@ -0,0 +1,158 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 5 +// REQUIRES: amdgpu-registered-target +// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -emit-llvm -O1 %s -o - | FileCheck %s + +#include <stdint.h> + +// CHECK-LABEL: define dso_local noundef i32 @workgroup_id_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call i32 @llvm.amdgcn.workgroup.id.x() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workgroup_id_x(void) +{

return __builtin_amdgcn_workgroup_id_x(); +}
+// CHECK-LABEL: define dso_local noundef i32 @workgroup_id_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR2:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call i32 @llvm.amdgcn.workgroup.id.y() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workgroup_id_y(void) +{
return __builtin_amdgcn_workgroup_id_y(); +}
+// CHECK-LABEL: define dso_local noundef i32 @workgroup_id_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR3:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call i32 @llvm.amdgcn.workgroup.id.z() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workgroup_id_z(void) +{
return __builtin_amdgcn_workgroup_id_z(); +}
+// CHECK-LABEL: define dso_local noundef range(i32 0, 1024) i32 @workitem_id_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR4:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call noundef range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.x() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workitem_id_x(void) +{
return __builtin_amdgcn_workitem_id_x(); +}
+// CHECK-LABEL: define dso_local noundef range(i32 0, 1024) i32 @workitem_id_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR5:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call noundef range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.y() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workitem_id_y(void) +{
return __builtin_amdgcn_workitem_id_y(); +}
+// CHECK-LABEL: define dso_local noundef range(i32 0, 1024) i32 @workitem_id_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR6:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call noundef range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.z() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workitem_id_z(void) +{
return __builtin_amdgcn_workitem_id_z(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 1025) i32 @workgroup_size_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR7:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 12 +// CHECK-NEXT: [[TMP2:%.]] = load i16, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG3:![0-9]+]], !invariant.load [[META4:![0-9]+]], !noundef [[META4]] +// CHECK-NEXT: [[CONV:%.*]] = zext nneg i16 [[TMP2]] to i32 +// CHECK-NEXT: ret i32 [[CONV]] +// +uint32_t workgroup_size_x(void) +{
return __builtin_amdgcn_workgroup_size_x(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 1025) i32 @workgroup_size_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR7]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 14 +// CHECK-NEXT: [[TMP2:%.]] = load i16, ptr addrspace(4) [[TMP1]], align 2, !range [[RNG3]], !invariant.load [[META4]], !noundef [[META4]] +// CHECK-NEXT: [[CONV:%.*]] = zext nneg i16 [[TMP2]] to i32 +// CHECK-NEXT: ret i32 [[CONV]] +// +uint32_t workgroup_size_y(void) +{
return __builtin_amdgcn_workgroup_size_y(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 1025) i32 @workgroup_size_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR7]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 16 +// CHECK-NEXT: [[TMP2:%.]] = load i16, ptr addrspace(4) [[TMP1]], align 8, !range [[RNG3]], !invariant.load [[META4]], !noundef [[META4]] +// CHECK-NEXT: [[CONV:%.*]] = zext nneg i16 [[TMP2]] to i32 +// CHECK-NEXT: ret i32 [[CONV]] +// +uint32_t workgroup_size_z(void) +{
return __builtin_amdgcn_workgroup_size_z(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 0) i32 @grid_size_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR8:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 12 +// CHECK-NEXT: [[TMP2:%.]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG5:![0-9]+]], !invariant.load [[META4]] +// CHECK-NEXT: ret i32 [[TMP2]] +// +uint32_t grid_size_x(void) +{
return __builtin_amdgcn_grid_size_x(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 0) i32 @grid_size_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR8]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 16 +// CHECK-NEXT: [[TMP2:%.]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG5]], !invariant.load [[META4]] +// CHECK-NEXT: ret i32 [[TMP2]] +// +uint32_t grid_size_y(void) +{
return __builtin_amdgcn_grid_size_y(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 0) i32 @grid_size_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR8]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 20 +// CHECK-NEXT: [[TMP2:%.]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG5]], !invariant.load [[META4]] +// CHECK-NEXT: ret i32 [[TMP2]] +// +uint32_t grid_size_z(void) +{
return __builtin_amdgcn_grid_size_z(); +}
+//. +// CHECK: [[RNG3]] = !{i16 1, i16 1025} +// CHECK: [[META4]] = !{} +// CHECK: [[RNG5]] = !{i32 1, i32 0} +//. diff --git a/clang/test/CodeGen/gpu_builtins.c b/clang/test/CodeGen/gpu_builtins.c new file mode 100644 index 0000000000000..8231b0952fa5e --- /dev/null +++ b/clang/test/CodeGen/gpu_builtins.c @@ -0,0 +1,647 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-attributes +// RUN: %clang_cc1 -O1 -triple spirv64 %s -emit-llvm -o - | FileCheck %s --check-prefix=SPIRV64 +// RUN: %clang_cc1 -O1 -triple spirv64-amd-amdhsa %s -emit-llvm -o - | FileCheck %s --check-prefix=AMDHSA +// RUN: %clang_cc1 -O1 -triple nvptx64 -emit-llvm %s -o - | FileCheck %s --check-prefix=NVPTX +// RUN: %clang_cc1 -O1 -triple amdgcn -emit-llvm %s -o - | FileCheck %s --check-prefix=AMDGCN
+#include <stdint.h>
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// SPIRV64-LABEL: @num_blocks_x( +// SPIRV64-NEXT: entry: +// SPIRV64-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.x() +// SPIRV64-NEXT: ret i32 [[TMP0]] +// +// AMDHSA: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDHSA-LABEL: @num_blocks_x( +// AMDHSA-NEXT: entry: +// AMDHSA-NEXT: [[TMP0:%.]] = tail call addrspace(4) i32 @llvm.gpu.num.blocks.x() +// AMDHSA-NEXT: ret i32 [[TMP0]] +// +// NVPTX: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// NVPTX-LABEL: @num_blocks_x( +// NVPTX-NEXT: entry: +// NVPTX-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.x() +// NVPTX-NEXT: ret i32 [[TMP0]] +// +// AMDGCN: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDGCN-LABEL: @num_blocks_x( +// AMDGCN-NEXT: entry: +// AMDGCN-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.x() +// AMDGCN-NEXT: ret i32 [[TMP0]] +// +uint32_t num_blocks_x(void) {
return __builtin_gpu_num_blocks_x(); +}
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// SPIRV64-LABEL: @num_blocks_y( +// SPIRV64-NEXT: entry: +// SPIRV64-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.y() +// SPIRV64-NEXT: ret i32 [[TMP0]] +// +// AMDHSA: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDHSA-LABEL: @num_blocks_y( +// AMDHSA-NEXT: entry: +// AMDHSA-NEXT: [[TMP0:%.]] = tail call addrspace(4) i32 @llvm.gpu.num.blocks.y() +// AMDHSA-NEXT: ret i32 [[TMP0]] +// +// NVPTX: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// NVPTX-LABEL: @num_blocks_y( +// NVPTX-NEXT: entry: +// NVPTX-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.y() +// NVPTX-NEXT: ret i32 [[TMP0]] +// +// AMDGCN: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDGCN-LABEL: @num_blocks_y( +// AMDGCN-NEXT: entry: +// AMDGCN-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.y() +// AMDGCN-NEXT: ret i32 [[TMP0]] +// +uint32_t num_blocks_y(void) {
return __builtin_gpu_num_blocks_y(); +}
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// SPIRV64-LABEL: @num_blocks_z( +// SPIRV64-NEXT: entry: +// SPIRV64-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.z() +// SPIRV64-NEXT: ret i32 [[TMP0]] +// +// AMDHSA: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDHSA-LABEL: @num_blocks_z( +// AMDHSA-NEXT: entry: +// AMDHSA-NEXT: [[TMP0:%.]] = tail call addrspace(4) i32 @llvm.gpu.num.blocks.z() +// AMDHSA-NEXT: ret i32 [[TMP0]] +// +// NVPTX: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// NVPTX-LABEL: @num_blocks_z( +// NVPTX-NEXT: entry: +// NVPTX-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.z() +// NVPTX-NEXT: ret i32 [[TMP0]] +// +// AMDGCN: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDGCN-LABEL: @num_blocks_z( +// AMDGCN-NEXT: entry: +// AMDGCN-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.z() +// AMDGCN-NEXT: ret i32 [[TMP0]] +// +uint32_t num_blocks_z(void) {
return __builtin_gpu_num_blocks_z(); +}
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwi... [truncated]

@llvm/pr-subscribers-backend-spir-v

Author: Jon Chesterfield (JonChesterfield)

Changes

Motivated by discussion at https://discourse.llvm.org/t/rfc-spirv-ir-as-a-vendor-agnostic-gpu-representation/85115 and RFC for this specifically as .

There is some refactoring to be done if this lands - moving some work out of CGBuiltin.cpp, simplifying gpuintrin.h, adjusting openmp's codegen to use these instead of devicertl magic functions.

Patch is 100.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131190.diff

18 Files Affected:

(modified) clang/include/clang/Basic/Builtins.td (+33)
(modified) clang/lib/Headers/amdgpuintrin.h (+1-1)
(modified) clang/lib/Headers/gpuintrin.h (+2)
(added) clang/lib/Headers/spirvintrin.h (+177)
(added) clang/test/CodeGen/amdgpu-grid-builtins.c (+158)
(added) clang/test/CodeGen/gpu_builtins.c (+647)
(modified) clang/test/Headers/gpuintrin.c (+223)
(modified) llvm/include/llvm/IR/Intrinsics.td (+63)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+2)
(modified) llvm/include/llvm/InitializePasses.h (+1)
(modified) llvm/include/llvm/Transforms/Scalar.h (+6)
(added) llvm/include/llvm/Transforms/Scalar/LowerGPUIntrinsic.h (+26)
(modified) llvm/lib/Passes/PassBuilder.cpp (+1)
(modified) llvm/lib/Passes/PassRegistry.def (+1)
(modified) llvm/lib/Transforms/Scalar/CMakeLists.txt (+1)
(added) llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp (+501)
(modified) llvm/lib/Transforms/Scalar/Scalar.cpp (+1)
(added) llvm/test/CodeGen/SPIRV/gpu_intrinsics.ll (+427)

+// GPU intrinsics +class GPUBuiltin : Builtin {

let Spellings = ["_builtin" # NAME];
let Prototype = prototype;
let Attributes = [NoThrow]; +}
+multiclass GPUGridBuiltin {
def _x : GPUBuiltin;
def _y : GPUBuiltin;
def _z : GPUBuiltin; +}
+defm gpu_num_blocks : GPUGridBuiltin<"uint32_t()">; +defm gpu_block_id : GPUGridBuiltin<"uint32_t()">; +defm gpu_num_threads : GPUGridBuiltin<"uint32_t()">; +defm gpu_thread_id : GPUGridBuiltin<"uint32_t()">;
+def gpu_num_lanes : GPUBuiltin<"uint32_t()">; +def gpu_lane_id : GPUBuiltin<"uint32_t()">; +def gpu_lane_mask : GPUBuiltin<"uint64_t()">;
+def gpu_read_first_lane_u32 : GPUBuiltin<"uint32_t(uint64_t, uint32_t)">; +def gpu_shuffle_idx_u32 : GPUBuiltin<"uint32_t(uint64_t, uint32_t, uint32_t, uint32_t)">;
+def gpu_ballot : GPUBuiltin<"uint64_t(uint64_t, bool)">;
+// Todo, why can't I mark these convergent? +def gpu_sync_threads : GPUBuiltin<"void()">; +def gpu_sync_lane : GPUBuiltin<"void(uint64_t)">; +def gpu_exit : GPUBuiltin<"void()">; +def gpu_thread_suspend : GPUBuiltin<"void()">;
// HLSL def HLSLAddUint64: LangBuiltin<"HLSL_LANG"> { let Spellings = ["__builtin_hlsl_adduint64"]; diff --git a/clang/lib/Headers/amdgpuintrin.h b/clang/lib/Headers/amdgpuintrin.h index f7fb8e2814180..817cfeec896c4 100644 --- a/clang/lib/Headers/amdgpuintrin.h +++ b/clang/lib/Headers/amdgpuintrin.h @@ -1,4 +1,4 @@ -//===-- amdgpuintrin.h - AMDPGU intrinsic functions -----------------------===// +//===-- amdgpuintrin.h - AMDGPU intrinsic functions -----------------------===// // // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. diff --git a/clang/lib/Headers/gpuintrin.h b/clang/lib/Headers/gpuintrin.h index 0fb3916acac61..a3ce535188a48 100644 --- a/clang/lib/Headers/gpuintrin.h +++ b/clang/lib/Headers/gpuintrin.h @@ -60,6 +60,8 @@ _Pragma("omp end declare target"); #include <nvptxintrin.h> #elif defined(AMDGPU) #include <amdgpuintrin.h>

return __builtin_gpu_num_blocks_x(); +}
+// Returns the number of workgroups in the 'y' dimension of the grid. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_blocks_y(void) {
return __builtin_gpu_num_blocks_y(); +}
+// Returns the number of workgroups in the 'z' dimension of the grid. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_blocks_z(void) {
return __builtin_gpu_num_blocks_z(); +}
+// Returns the 'x' dimension of the current AMD workgroup's id. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_block_id_x(void) {
return __builtin_gpu_block_id_x(); +}
+// Returns the 'y' dimension of the current AMD workgroup's id. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_block_id_y(void) {
return __builtin_gpu_block_id_y(); +}
+// Returns the 'z' dimension of the current AMD workgroup's id. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_block_id_z(void) {
return __builtin_gpu_block_id_z(); +}
+// Returns the number of workitems in the 'x' dimension. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_threads_x(void) {
return __builtin_gpu_num_threads_x(); +}
+// Returns the number of workitems in the 'y' dimension. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_threads_y(void) {
return __builtin_gpu_num_threads_y(); +}
+// Returns the number of workitems in the 'z' dimension. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_threads_z(void) {
return __builtin_gpu_num_threads_z(); +}
+// Returns the 'x' dimension id of the workitem in the current workgroup. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_thread_id_x(void) {
return __builtin_gpu_thread_id_x(); +}
+// Returns the 'y' dimension id of the workitem in the current workgroup. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_thread_id_y(void) {
return __builtin_gpu_thread_id_y(); +}
+// Returns the 'z' dimension id of the workitem in the current workgroup. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_thread_id_z(void) {
return __builtin_gpu_thread_id_z(); +}
+// Returns the size of the wave. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_num_lanes(void) {
return __builtin_gpu_num_lanes(); +}
+// Returns the id of the thread inside of a wave executing together. +_DEFAULT_FN_ATTRS static inline uint32_t __gpu_lane_id(void) {
return __builtin_gpu_lane_id(); +}
+// Returns the bit-mask of active threads in the current wave. +_DEFAULT_FN_ATTRS static inline uint64_t __gpu_lane_mask(void) {
return __builtin_gpu_lane_mask(); +}
+// Copies the value from the first active thread in the wave to the rest. +_DEFAULT_FN_ATTRS static inline uint32_t +__gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t __x) {
return __builtin_gpu_read_first_lane_u32(__lane_mask, __x); +}
+// Returns a bitmask of threads in the current lane for which \p x is true. +_DEFAULT_FN_ATTRS static inline uint64_t __gpu_ballot(uint64_t __lane_mask,

                                                     bool __x) {

return __builtin_gpu_ballot(__lane_mask, __x); +}
+// Waits for all the threads in the block to converge and issues a fence. +_DEFAULT_FN_ATTRS static inline void __gpu_sync_threads(void) {
return __builtin_gpu_sync_threads(); +}
+// Wait for all threads in the wave to converge +_DEFAULT_FN_ATTRS static inline void __gpu_sync_lane(uint64_t __lane_mask) {
return __builtin_gpu_sync_lane(__lane_mask); +}
+// Shuffles the the lanes inside the wave according to the given index. +_DEFAULT_FN_ATTRS static inline uint32_t +__gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x,
```
                 uint32_t __width) {
```
return __builtin_gpu_shuffle_idx_u32(__lane_mask, __idx, __x, __width); +}
+// Returns a bitmask marking all lanes that have the same value of __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_any_u32(uint64_t __lane_mask, uint32_t __x) {
return __gpu_match_any_u32_impl(__lane_mask, __x); +}
+// Returns a bitmask marking all lanes that have the same value of __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_any_u64(uint64_t __lane_mask, uint64_t __x) {
return __gpu_match_any_u64_impl(__lane_mask, __x); +}
+// Returns the current lane mask if every lane contains __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_all_u32(uint64_t __lane_mask, uint32_t __x) {
return __gpu_match_all_u32_impl(__lane_mask, __x); +}
+// Returns the current lane mask if every lane contains __x. +_DEFAULT_FN_ATTRS static inline uint64_t +__gpu_match_all_u64(uint64_t __lane_mask, uint64_t __x) {
return __gpu_match_all_u64_impl(__lane_mask, __x); +}
+// Terminates execution of the associated wave. +_DEFAULT_FN_ATTRS [[noreturn]] static inline void __gpu_exit(void) {
return __builtin_gpu_exit(); +}
+// Suspend the thread briefly to assist the scheduler during busy loops. +_DEFAULT_FN_ATTRS static inline void __gpu_thread_suspend(void) {
return __builtin_gpu_thread_suspend(); +}
+#endif // __SPIRVINTRIN_H

return __builtin_amdgcn_workgroup_id_x(); +}
+// CHECK-LABEL: define dso_local noundef i32 @workgroup_id_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR2:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call i32 @llvm.amdgcn.workgroup.id.y() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workgroup_id_y(void) +{
return __builtin_amdgcn_workgroup_id_y(); +}
+// CHECK-LABEL: define dso_local noundef i32 @workgroup_id_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR3:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call i32 @llvm.amdgcn.workgroup.id.z() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workgroup_id_z(void) +{
return __builtin_amdgcn_workgroup_id_z(); +}
+// CHECK-LABEL: define dso_local noundef range(i32 0, 1024) i32 @workitem_id_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR4:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call noundef range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.x() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workitem_id_x(void) +{
return __builtin_amdgcn_workitem_id_x(); +}
+// CHECK-LABEL: define dso_local noundef range(i32 0, 1024) i32 @workitem_id_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR5:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call noundef range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.y() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workitem_id_y(void) +{
return __builtin_amdgcn_workitem_id_y(); +}
+// CHECK-LABEL: define dso_local noundef range(i32 0, 1024) i32 @workitem_id_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR6:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call noundef range(i32 0, 1024) i32 @llvm.amdgcn.workitem.id.z() +// CHECK-NEXT: ret i32 [[TMP0]] +// +uint32_t workitem_id_z(void) +{
return __builtin_amdgcn_workitem_id_z(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 1025) i32 @workgroup_size_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR7:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 12 +// CHECK-NEXT: [[TMP2:%.]] = load i16, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG3:![0-9]+]], !invariant.load [[META4:![0-9]+]], !noundef [[META4]] +// CHECK-NEXT: [[CONV:%.*]] = zext nneg i16 [[TMP2]] to i32 +// CHECK-NEXT: ret i32 [[CONV]] +// +uint32_t workgroup_size_x(void) +{
return __builtin_amdgcn_workgroup_size_x(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 1025) i32 @workgroup_size_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR7]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 14 +// CHECK-NEXT: [[TMP2:%.]] = load i16, ptr addrspace(4) [[TMP1]], align 2, !range [[RNG3]], !invariant.load [[META4]], !noundef [[META4]] +// CHECK-NEXT: [[CONV:%.*]] = zext nneg i16 [[TMP2]] to i32 +// CHECK-NEXT: ret i32 [[CONV]] +// +uint32_t workgroup_size_y(void) +{
return __builtin_amdgcn_workgroup_size_y(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 1025) i32 @workgroup_size_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR7]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 8 dereferenceable(256) ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 16 +// CHECK-NEXT: [[TMP2:%.]] = load i16, ptr addrspace(4) [[TMP1]], align 8, !range [[RNG3]], !invariant.load [[META4]], !noundef [[META4]] +// CHECK-NEXT: [[CONV:%.*]] = zext nneg i16 [[TMP2]] to i32 +// CHECK-NEXT: ret i32 [[CONV]] +// +uint32_t workgroup_size_z(void) +{
return __builtin_amdgcn_workgroup_size_z(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 0) i32 @grid_size_x( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR8:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 12 +// CHECK-NEXT: [[TMP2:%.]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG5:![0-9]+]], !invariant.load [[META4]] +// CHECK-NEXT: ret i32 [[TMP2]] +// +uint32_t grid_size_x(void) +{
return __builtin_amdgcn_grid_size_x(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 0) i32 @grid_size_y( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR8]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 16 +// CHECK-NEXT: [[TMP2:%.]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG5]], !invariant.load [[META4]] +// CHECK-NEXT: ret i32 [[TMP2]] +// +uint32_t grid_size_y(void) +{
return __builtin_amdgcn_grid_size_y(); +}
+// CHECK-LABEL: define dso_local range(i32 1, 0) i32 @grid_size_z( +// CHECK-SAME: ) local_unnamed_addr #[[ATTR8]] { +// CHECK-NEXT: [[ENTRY:.:]] +// CHECK-NEXT: [[TMP0:%.]] = tail call align 4 dereferenceable(64) ptr addrspace(4) @llvm.amdgcn.dispatch.ptr() +// CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds nuw i8, ptr addrspace(4) [[TMP0]], i64 20 +// CHECK-NEXT: [[TMP2:%.]] = load i32, ptr addrspace(4) [[TMP1]], align 4, !range [[RNG5]], !invariant.load [[META4]] +// CHECK-NEXT: ret i32 [[TMP2]] +// +uint32_t grid_size_z(void) +{
return __builtin_amdgcn_grid_size_z(); +}
+//. +// CHECK: [[RNG3]] = !{i16 1, i16 1025} +// CHECK: [[META4]] = !{} +// CHECK: [[RNG5]] = !{i32 1, i32 0} +//. diff --git a/clang/test/CodeGen/gpu_builtins.c b/clang/test/CodeGen/gpu_builtins.c new file mode 100644 index 0000000000000..8231b0952fa5e --- /dev/null +++ b/clang/test/CodeGen/gpu_builtins.c @@ -0,0 +1,647 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-attributes +// RUN: %clang_cc1 -O1 -triple spirv64 %s -emit-llvm -o - | FileCheck %s --check-prefix=SPIRV64 +// RUN: %clang_cc1 -O1 -triple spirv64-amd-amdhsa %s -emit-llvm -o - | FileCheck %s --check-prefix=AMDHSA +// RUN: %clang_cc1 -O1 -triple nvptx64 -emit-llvm %s -o - | FileCheck %s --check-prefix=NVPTX +// RUN: %clang_cc1 -O1 -triple amdgcn -emit-llvm %s -o - | FileCheck %s --check-prefix=AMDGCN
+#include <stdint.h>
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// SPIRV64-LABEL: @num_blocks_x( +// SPIRV64-NEXT: entry: +// SPIRV64-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.x() +// SPIRV64-NEXT: ret i32 [[TMP0]] +// +// AMDHSA: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDHSA-LABEL: @num_blocks_x( +// AMDHSA-NEXT: entry: +// AMDHSA-NEXT: [[TMP0:%.]] = tail call addrspace(4) i32 @llvm.gpu.num.blocks.x() +// AMDHSA-NEXT: ret i32 [[TMP0]] +// +// NVPTX: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// NVPTX-LABEL: @num_blocks_x( +// NVPTX-NEXT: entry: +// NVPTX-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.x() +// NVPTX-NEXT: ret i32 [[TMP0]] +// +// AMDGCN: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDGCN-LABEL: @num_blocks_x( +// AMDGCN-NEXT: entry: +// AMDGCN-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.x() +// AMDGCN-NEXT: ret i32 [[TMP0]] +// +uint32_t num_blocks_x(void) {
return __builtin_gpu_num_blocks_x(); +}
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// SPIRV64-LABEL: @num_blocks_y( +// SPIRV64-NEXT: entry: +// SPIRV64-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.y() +// SPIRV64-NEXT: ret i32 [[TMP0]] +// +// AMDHSA: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDHSA-LABEL: @num_blocks_y( +// AMDHSA-NEXT: entry: +// AMDHSA-NEXT: [[TMP0:%.]] = tail call addrspace(4) i32 @llvm.gpu.num.blocks.y() +// AMDHSA-NEXT: ret i32 [[TMP0]] +// +// NVPTX: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// NVPTX-LABEL: @num_blocks_y( +// NVPTX-NEXT: entry: +// NVPTX-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.y() +// NVPTX-NEXT: ret i32 [[TMP0]] +// +// AMDGCN: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDGCN-LABEL: @num_blocks_y( +// AMDGCN-NEXT: entry: +// AMDGCN-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.y() +// AMDGCN-NEXT: ret i32 [[TMP0]] +// +uint32_t num_blocks_y(void) {
return __builtin_gpu_num_blocks_y(); +}
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// SPIRV64-LABEL: @num_blocks_z( +// SPIRV64-NEXT: entry: +// SPIRV64-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.z() +// SPIRV64-NEXT: ret i32 [[TMP0]] +// +// AMDHSA: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDHSA-LABEL: @num_blocks_z( +// AMDHSA-NEXT: entry: +// AMDHSA-NEXT: [[TMP0:%.]] = tail call addrspace(4) i32 @llvm.gpu.num.blocks.z() +// AMDHSA-NEXT: ret i32 [[TMP0]] +// +// NVPTX: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// NVPTX-LABEL: @num_blocks_z( +// NVPTX-NEXT: entry: +// NVPTX-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.z() +// NVPTX-NEXT: ret i32 [[TMP0]] +// +// AMDGCN: Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) +// AMDGCN-LABEL: @num_blocks_z( +// AMDGCN-NEXT: entry: +// AMDGCN-NEXT: [[TMP0:%.]] = tail call i32 @llvm.gpu.num.blocks.z() +// AMDGCN-NEXT: ret i32 [[TMP0]] +// +uint32_t num_blocks_z(void) {
return __builtin_gpu_num_blocks_z(); +}
+// SPIRV64: Function Attrs: mustprogress nofree norecurse nosync nounwi... [truncated]

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 8aa835c2b5a9fd7d643257a4acd84e8fa99e391c e3d1c0d0f430a96e26c68e22ab53dc2fa4a14e47 --extensions h,cpp,c -- clang/test/CodeGen/amdgpu-grid-builtins.c clang/test/CodeGen/gpu_builtins.c llvm/include/llvm/Transforms/Scalar/LowerGPUIntrinsic.h llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp llvm/include/llvm/InitializePasses.h llvm/include/llvm/Transforms/Scalar.h llvm/lib/Passes/PassBuilder.cpp llvm/lib/Transforms/Scalar/Scalar.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp b/llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp index 3c670340fa..b913d8c04e 100644 --- a/llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp +++ b/llvm/lib/Transforms/Scalar/LowerGPUIntrinsic.cpp @@ -221,8 +221,7 @@ Value grid_size(Module &M, IRBuilder<> &Builder, Intrinsic::ID, CallBase ) { } template -Value WGSize(Module &M, IRBuilder<> &Builder, Intrinsic::ID , - CallBase ) { +Value WGSize(Module &M, IRBuilder<> &Builder, Intrinsic::ID, CallBase ) { // Note: "__oclc_ABI_version" is supposed to be emitted and initialized by // clang during compilation of user code. @@ -262,20 +261,20 @@ Value *WGSize(Module &M, IRBuilder<> &Builder, Intrinsic::ID , return Builder.CreateZExt(LD, Type::getInt32Ty(Ctx)); }

template Value NumBlocks(Module &M, IRBuilder<> &Builder, Intrinsic::ID from, - CallBase CI) - { - // This is __builtin_amdgcn_grid_size / gpu_num_threads - // However we don't have a grid size intrinsic so can't expand to use that - // Open code it directly instead as the equivalent to - // Thus amdgpu::grid_size / amdgpu::WGSize - Value Numerator = grid_size(M, Builder, Intrinsic::not_intrinsic, nullptr); - Value Denominator = WGSize(M, Builder, Intrinsic::not_intrinsic, nullptr); - return Builder.CreateUDiv(Numerator, Denominator); - }

```
            CallBase *CI) {
```
// This is __builtin_amdgcn_grid_size / gpu_num_threads
// However we don't have a grid size intrinsic so can't expand to use that
// Open code it directly instead as the equivalent to
// Thus amdgpu::grid_size / amdgpu::WGSize
Value *Numerator =

 grid_size<Index>(M, Builder, Intrinsic::not_intrinsic, nullptr);

Value *Denominator =

 WGSize<Index>(M, Builder, Intrinsic::not_intrinsic, nullptr);

return Builder.CreateUDiv(Numerator, Denominator); +}
} // namespace amdgpu

namespace nvptx {

@@ -424,7 +423,7 @@ static const IntrinsicMap ls[] = { {gpu_thread_id_x, S, S}, {gpu_thread_id_y, S, S}, {gpu_thread_id_z, S, S},

// CGBuiltin maps builtin_amdgcn_workgroup_size onto gpu_num_threads
{gpu_num_threads_x, B<amdgpu::WGSize<0>>, S<nvvm_read_ptx_sreg_ntid_x>},
{gpu_num_threads_y, B<amdgpu::WGSize<1>>, S<nvvm_read_ptx_sreg_ntid_y>},

@@ -481,7 +480,7 @@ bool LowerGPUIntrinsic::runOnModule(Module &M) {

 for (auto *U : make_early_inc_range(Intr->users())) {
   if (auto *CI = dyn_cast<CallBase>(U)) {

   assert (CI->getCalledFunction() == Intr);

   assert(CI->getCalledFunction() == Intr);
   Changed |= maybeLowering(M, Builder, I.Generic, CI);
 }

}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is basically a serialization of the headers in gpuintrin.h as IR? I'll need to defer to the people more familiar with the backend / IR layer but I have a few comments.

@@ -4770,6 +4770,35 @@ def GetDeviceSideMangledName : LangBuiltin<"CUDA_LANG"> {
let Prototype = "char const*(...)";
}

// GPU intrinsics
class GPUBuiltin : Builtin {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to make these only emitted on 'gpu' compilations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly. Intrinsics aren't required to lower to non-error states on all targets anyway. If someone wants to use these to write x64 code with implied simt semantics and deal with that later, cool - e.g. ipsc might like it.

@@ -0,0 +1,182 @@
//===-- spirvintrin.h - SPIRV intrinsic functions ------------------------===//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header stuff is separate from introducing these builtins, they should be tested in isolation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would look a lot like #131164. The gpuintrin.h header is checked by the existing test scaffolding with spirv added, I added more for the builtins/intrinsics in this patch.


// Note: "__oclc_ABI_version" is supposed to be emitted and initialized by
// clang during compilation of user code.
StringRef Name = "__oclc_ABI_version";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate this thing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely specific to what we do, so it adds to my issues with claiming "genericity".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the lowering of the intrinsic to amdgpu, it's gonna be amdgpu specific. Prior to the lowering which can be at JIT time you've got llvm.gpu.num_threads_x or __builtin_gpu_num_threads_x.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really should be part of the target OS version in the triple.

But checking this here is a sign this isn't fundamentally addressing the issue

The intrinsic set matching the gpuintrin.h set is important context for why I chose these ones. That's why the header is included here. The patch would read better with the spirv stub in place first but it wouldn't change the semantics.

The idea is to translate the core gpuintrin functions into IR, then later expand them out to what would have been done in the front end if the target was assumed up front. A side effect is the dispatch to different architectures is done by an IR pass instead of the preprocessor so the gpuintrin.h headers could become significantly shorter. Other side effects are we have some target independent intrinsics that we could use elsewhere in the compiler.

I'm not convinced we should do this, as I have a bunch of concerns:

it's intrusive and duplicates work already done by libclc;
SPIRV already has intrinsics
it squats on some prime naming real estate
it is extremely chunky
not everything should end up in Clang, even though it is individually "convenient"

I am equally uneasy with the cavalier attitude towards differences between various GPU and GPU-like targets.

not everything should end up in Clang, even though it is individually "convenient"

We should at least have some builtins for those SPIR-V intrinsics you listed. I think right now we only have reflect.

not everything should end up in Clang, even though it is individually "convenient"

We should at least have some builtins for those SPIR-V intrinsics you listed. I think right now we only have reflect.

Probably, I think there's a sense that in general it's desirable to mirror all target intrinsics to Clang builtins by default, except extremely exotic ones.

not everything should end up in Clang, even though it is individually "convenient"

We should at least have some builtins for those SPIR-V intrinsics you listed. I think right now we only have reflect.

Probably, I think there's a sense that in general it's desirable to mirror all target intrinsics to Clang builtins by default, except extremely exotic ones.

At a high level, if those work, we could just use them in spirvintrin.h. Do you happen to know what the backend does with them? They seem to be like what's going on here but I don't know when the target specific magic gets baked in.

I'm not convinced we should do this, as I have a bunch of concerns:
* it's intrusive and duplicates work already done by [libclc](https://github.com/llvm/llvm-project/tree/main/libclc);

Also compiler-rt. Also gpuintrin.h. Also openmp's devicertl. All the language runtimes duplicate this work because it's necessary to program the GPUs. Or we put it in LLVM and they all get to use the common layer.

* [SPIRV already has intrinsics](https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/IR/IntrinsicsSPIRV.td)

Quoting from that file:
// The following intrinsic(s) are mirrored from IntrinsicsDirectX.td for HLSL support.

Also amdgpu has intrinsics. And so does nvptx. And so does opencl. Do we really want one intrinsic per language for a concept like thread_id?

* it squats on some prime naming real estate

I don't care about names. If llvm.gpu wants to be reserved for something else, llvm.simt or llvm.gpuintrin or whatever are absolutely fine by me.

The lowering is 500 lines written with no effort to be concise to cover both architectures. Twenty of so of tablegen. Almost all this patch is test code and the proof of concept in gpuintrin.

* not everything should end up in Clang, even though it is individually "convenient"

It's the LLVM IR I'm after here. The auto-generated clang builtin to expose the IR directly to language runtimes seems fine, but I can get by with a handwritten IR file and a C header if that turns out to be a blocker.

I am equally uneasy with the cavalier attitude towards differences between various GPU and GPU-like targets.

This abstraction is pre-existing. I've deliberately chosen the pre-existing one which is known to work well enough to drive libc, which is also just about enough to cover openmp. Where people don't want to abstract over the architecture they have a wide variety of existing choices.

If the name llvm.gpu is a stumbling block, how about llvm.offload? Will add JD to the reviewers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I menti

@@ -150,6 +150,8 @@ defm int_amdgcn_workitem_id : AMDGPUReadPreloadRegisterIntrinsic_xyz;
defm int_amdgcn_workgroup_id : AMDGPUReadPreloadRegisterIntrinsic_xyz_named
<"__builtin_amdgcn_workgroup_id">;

defm int_amdgcn_grid_size : AMDGPUReadPreloadRegisterIntrinsic_xyz;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not dead, but that's unimportant - the patch could be reworked to not include it and I'll do that in the spirit of increasing consensus

Comment on lines +2867 to +2870

: DefaultAttrsIntrinsic<[ret_type],
[],
[NoUndef, IntrNoMem, IntrSpeculatable]>,
ClangBuiltin;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on the thread I do not think this should be the first solution to reach for.

For example as-is this immediately breaks running AMDGPUAttributor, which will now incorrectly not detect uses of the underlying intrinsics hidden behind these wrappers. These wrappers need to have the same semantics as an unknown external call (as in, this should not use DefaultAddrsIntrinsic)

And need a test showing that AMDGPUAttributor infers none of the implicit argument attributes if there is a call to any of these

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't break amdgpu attributor. The attributor will correctly completely ignore these intrinsics, as it should do, unless at some point we decide the amdgpu attributor should start working in terms of target agnostic intrinsics at which point it'll work fine. We don't need an extra test showing that the amdgpu attributor works correctly when shown unknown intrinsics, because if it doesn't have that property already it's already severely broken.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will break the AMDGPU attributor.

We don't need an extra test showing that the amdgpu attributor works correctly when shown unknown intrinsics, because if it doesn't have that property already it's already severely broken.

Yes we do. You're creating something that isn't an unknown intrinsic. It becomes a known intrinsic, which is assumed does not call other intrinsics. e.g. https://godbolt.org/z/zjPdEbz8j. The current reasonable assumption is intrinsics are leaves that do not call other intrinsics. It may work as hoped if you strip out some set of attributes (probably need to lose nocallback at least), but we need this test to show it.

unless at some point we decide the amdgpu attributor should start working in terms of target agnostic intrinsics at which point it'll work fine.

We don't want this, particularly for the nontrivial cases that aren't just read argument register.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +231 to +235

if (!ABIVersionC) {
// In CGBuiltin, we'd have to create an extern variable to emit the load for
// Here, we can leave the intrinsic in place and it'll get lowered later
return nullptr;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be mandatory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't. That's what the return branch means. No magic variable saying where the value is to be found, leave the intrinsic in place until that variable shows up later. If it never shows up, well, we haven't chosen a code object format and can't codegen.


// Note: "__oclc_ABI_version" is supposed to be emitted and initialized by
// clang during compilation of user code.
StringRef Name = "__oclc_ABI_version";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really should be part of the target OS version in the triple.

But checking this here is a sign this isn't fundamentally addressing the issue


for (auto *U : make_early_inc_range(Intr->users())) {
if (auto *CI = dyn_cast(U)) {
if (CI->getCalledFunction() == Intr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an assert

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have issues with __oclc_ABI_version. However I'm not trying to fix that inline here, only to provide the same lowering we use already. I do like the new feature that IR without that present would be meaningful without adding an extern reference into it (because the load would remain hidden in this intrinsic, which is why this doesn't inject an extern reference in CGBuiltin style).

Comment on lines +77 to +78

typedef Value (builder)(Module &M, IRBuilder<> &Builder, Intrinsic::ID from,
CallBase *CI);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using function_ref

Comment on lines +461 to +468

if (!TT.isAMDGPU() && !TT.isNVPTX()) {
return Changed;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual convention for ad-hoc IR lowering passes like this is to have some hook in TargetLowering to process the intrinsic. I'm not sure where you're planning on running this expansion, or if it's permissible to pull in codegen there

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backend should run it so that the JIT model expands it to meaningful code. However if the only place run is the backend, all the programming languages will refuse to use it for amdgpu or nvptx targets. So I want to also run it early in the clang pipeline where targetting a specific language, and have the backend run handle it for whatever spirv64-- was converted into and for the O0 or llvm passes disabled case.

@@ -0,0 +1,427 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-attributes
; RUN: opt -S -mtriple=amdgcn-- -passes=lower-gpu-intrinsic < %s \| FileCheck %s --check-prefix=AMDGCN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be testing amdgpu handling in test/CodeGen/SPIRV. Duplicate the test.

Also this is just an IR->IR test, should go in test/Transforms/LowerGPUIntrinsics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love copy&paste the test since the input IR is identical between the two and they'll definitely drift out of sync but sure, I'll copy&paste the thing as requested. Dumping them in a new directory to slightly improve the chances that people editing one remember to edit the other.


define void @thread_suspend() {
; AMDGCN-LABEL: @thread_suspend(
; AMDGCN-NEXT: call void @llvm.amdgcn.s.sleep(i32 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2

lowerFunction NVPTX;
};

using namespace Intrinsic;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using namespace Intrinsic;

The unqualified intrinsic names below are a bit confusing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read through the RFC and the changes in the PR yet but I think llvm/test/CodeGen/SPIRV/ is not really the best place for this test as it is not testing the SPIR-V backend.

//
// Lower the llvm.gpu intrinsics to target specific code sequences.
// Can be called from clang if building for a specific GPU or from the backend
// as part of a SPIRV lowering pipeline. Initial pass can lower to amdgcn or

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this pass fits into the SPIR-V lowering pipeline. Is this confusing SPIR and SPIR-V? SPIR being the format based on LLVM IR (although on an old version of LLVM 3.4)? SPIR-V being a custom Khronos-defined format.

While SPIR-V easily maps to LLVM IR, it is far from being a standardized mapping and the actual translation is usually handled by either SPIRV-LLVM-Translator or the in-tree SPIR-V backend.

Could you please comment on how this pass and the changes would fit into the current SPIR-V code generation flow through the SPIR-V backend?

While I see the value in target-independent intrinsics for commonly-used operations (like thread local id/lane id/etc) and many would support the idea, I would not necessarily associate them with the SPIR-V format in any way, especially since the intrinsics would not have any meaning outside of the LLVM world.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all written in terms of the spirv64-unknown-unknown triple. The in tree backend writes that out as SPV which sort of punts the problem onward to someone else, or the spirv-translator converts the spirv--unknown-unknown LLVM IR into SPV. Then the spirv-translator or something else turns the SPV into something executable.

The spirv backend should stash these intrinsics in the SPV file like any other ones. It might need a patch to do so or we might get default handling for llvm prefixed intrinsics. That's the point really - we pass information about the SIMT computation through the SPV onward to whatever is dealing with that later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to better understand your goal here with this patch and potential future patches. How is this written in terms of the spirv64 triple? What is specific about the intrinsics that they only "work" for SPIR-V triple or the other way around?

The SPIR-V triple in a module does describe characteristics about the module and obviously the architecture -- including a calling convention, vendor, address spaces, environment, etc. and most importantly indicates the target used -- in this case the in-tree SPIR-V backend. However, I do not see how the SPIR-V triple in case of this patch is relevant to the SPIR-V backend. If the goal is to define new intrinsics which could be handled by AMDGPU and NVPTX, then I don't think those should be associated with the SPIR-V triple in LLVM.

Is your goal to have SPIR-V backend implement these intrinsics (lower into proper spec defined SPIR-V builtin calls or primitive instructions)? Or would you see the backend just keep these as function calls? If the latter, then I am not sure this is the right use of llvm. intrinicis according to the guidelines? In theory, I would expect all llvm. intrinsics to be handled by the backends.

Having SPIR-V backend emit abstract function calls for intrinsics which could be lowered into the SPIR-V spec defined instructions or builtins is a very dangerous path and opposes the idea of the SPIR-V spec altogether.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the commit message, I want spirv to be usable as a target agnostic representation for gpu compute. https://discourse.llvm.org/t/rfc-spirv-ir-as-a-vendor-agnostic-gpu-representation/85115. Pretty much all it needs for that is a way to talk about simt intrinsics and a lowering to the vendor.

I think it is deeply silly to have lots of intrinsics named after languages which mean the same thing and want to fold them together into the llvm core. So far I'm not getting a lot of love for that but I concede that sweeping cleanup isn't strictly necessary, being technically much cleaner doesn't always means politically tractable.

The builtins and intrinsics introduced here are target agnostic. I think people would be much happier if I'd named them llvm.spirv or llvm.openmp but given they have nothing to do with spirv or openmp, aside from being useful for both, I'm arguing against that.

Spir64-unknown-unknown doesn't say anything about amdgpu or nvptx or intel. Nor should it. It really doesn't say much about the GPU architecture. That's why it's viable as a target to JIT into amdgpu or nvptx later. That's the goal of spirv64, right? Single shader IR that runs on arbitrary GPUs?

I don't particularly care how these intrinsics are represented in SPV, provided they can be raised back onto these llvm intrinsics in the translator. The name llvm is appropriate as that's where the semantics for these intrinsics is defined. I don't know what distinction you're drawing between abstract function calls and intrinsics, terminology is not consistent across compilers here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all I agree, that having a common interface between various frontends is a good idea. Yet as from what I see now - the intrinsics added in the PR are not common, but tuned only for AMDGPU and NVPTX (also intrinsics are not necessarily the only possible common interface). And just like @michalpaszkowski I don't understand, what place SPIR-V has for those intrinsics.

The spirv backend should stash these intrinsics in the SPV file like any other ones. It might need a patch to do so or we might get default handling for llvm prefixed intrinsics. That's the point really - we pass information about the SIMT computation through the SPV onward to whatever is dealing with that later.
I don't know what distinction you're drawing between abstract function calls and intrinsics, terminology is not consistent across compilers here.

Please correct me if I understood this part wrong. Is your proposal to store llvm.gpu intrinsic call in SPIR-V module as is if there is no appropriate SPIR-V instruction for it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're optionally lowered to amdgpu or to nvptx as they're the only two architectures in tree, or they come out the other end in IR to be handled later in the pipeline. I'd like lowering for intel too but that will need to be out of tree until intel comes in tree. The semantics are hopefully fairly general purpose but I expect they'll need iterative improvement as one learns more about other targets - the first draft is the basis set of llvm libc.

I don't really understand the spirv pipeline. The one I've been working with is the amdgpu / hip one, which passes llvm intrinsics through unchanged, and ends up creating amdgcn-- LLVM IR to feed back into the backend. My hope is that in the first instance the SPIR-V module will contain the llvm.gpu intrinsic call, and then sometime later it'll pick up a proper SPIRV name of some sort, but where conversion back to llvm with that intrinsic in it works losslessly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPIR-V in itself is an architecture (although a synthetic one) and the SPIR-V backend is what lowers LLVM IR to this architecture. There are proprietary SPIR-V consumers which are not LLVM-based. For this reason it might be unreasonable to expect those to accept llvm. calls and it would be good to plan early how the new llvm.gpu intrinsics would be handled in the SPIR-V backend case by case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like lowering for intel too but that will need to be out of tree until intel comes in tree.

To clarify: while me and Michal are working for Intel, our concerns are not about lowering to some Intel's internal stuff, but about lowering to SPIR-V, which is supported as intermediate representation for Vulkan, OpenCL etc by multiple vendors. I can't speak for others, but (judging by amount of PRs I've been looking in LLVM's SPIR-V backend) apart of Intel at least Google has interest in developing SPIR-V backend as a target in LLVM. So let me actually invite @sudonatalie and @Keenuts to the discussion.

which passes llvm intrinsics through unchanged, and ends up creating amdgcn-- LLVM IR to feed back into the backend

I assume it only works, when the frontend and the backend are built on top of the same LLVM version, right? Otherwise you would face issues like: a. intrinsics semantics can change depending on LLVM version; b. rules of intrinsics' name can change from version to version (for example LLVM's switch from typed to untyped pointers has affected the names).

So talking about SPIR-V toolchains, as it's hard to know ahead of time, who will consume the generated SPIR-V - we must not have compiler generated external symbols (like intrinsics) in the module as the backend might not be able to resolve them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

First of all I agree, that having a common interface between various frontends is a good idea. Yet as from what I see now - the intrinsics added in the PR are not common, but tuned only for AMDGPU and NVPTX (also intrinsics are not necessarily the only possible common interface).

Agree with this. I think having shared intrinsic at the IR level llvm.gpu.num_threads() would be a nice thing.

Frontends would have to do some adjusting around it (maybe a language exposes a num_threads function which returns an array instead of a vector)
backends would need to know how to lower this to the respective target.

or they come out the other end in IR to be handled later in the pipeline.

That's the part I don't quite follow. I don't think those should be visible in the emitted SPIR-V. I believe those shared intrinsics should be lowered by the respective backends into the target specific output: for SPIR-V, that would be a builtin load/store or something like this. And if we don't support it yet, fail.

The benefit I'd see in shared intrinsics is not to have 1 frontend emits a opencl.thread_id, another hlsl.thread_id, and a last random.thread_id, and have each one implemented by a single backend.

This patch probably does too many things and it's implemented in a non-conventional fashion. I think that does a decent job of having something to upset most people that have looked at it. I note that reviewers are primarily complaining about disjoint aspects of the patch so a more optimistic reading could be that most of you love some of it and hate part of it, and that the links to the RFCs may have been too subtle.

The functionality I need here, independent of the implementation, is:

Represent GPU SIMT constructs in LLVM IR without up front specifying a target ISA

What I want beyond that is:

Be able to lower that to target specific just out of clang, so they're free to use on a specific target
Easy, clean ways to test the lowering (i.e. please let me do this in IR)

If I can't have cheap lowering the openmp crowd will insist on amdgpu and nvptx specific paths that duplicate the lowering that llvm already knows how to do, and so will the hip crowd, because expanding the intrinsics in the backend will induce overhead. Which will primarily stop me removing the target specific dispatch from the language implementations. Not on the critical path, just stops me simplifying things.

This patch is conceptually a few things.

My first step towards making spirv64-unknown-unknown usable across vendors
Target independent intrinsics for the read_first_lane, ballot etc primitives of the SIMT model
Intrinsics for the various grid model accessors
A few misc things (exit, suspend) which are kind of niche
Doing target specific lowering in an IR pass
Doing lowering for two different targets in the same IR pass

I want the first point. Compile code once then JIT it onto some arbitrary GPU architecture? Love it. Noone else seems keen but that's fine, I think some end users will like that functionality. It seems to have been some of the motivation behind spirv too.

Intrinsics for ballot are necessary in the same way that convergent needs to be represented in the IR. Presently the different targets have their own intrinsics for these, because they really should be intrinsics. That I think is absolutely uncontentious. It's absurd that we have multiple different intrinsics that mean the same thing for a common concept that we want to manipulate in IR.

The grid builtins are convenient for the opencl/cuda programming model. They don't matter much to IR manipulation. I want them here because the cost is tiny, most are directly mapped to a register read and the others to some trivial code sequence, and it frees one from the endless squabble over which special compiler-rt is the bestest.

The exit/suspend stuff is niche. Maybe I can't get those through review. They're here because I've gone with "what would it take to run libc on Intel GPUs, assuming Intel won't put target specific code in upstream". I can make them both nops under spirv for the time being.

I've done the lowering in IR because that's the right and proper thing to do, fully aware that most people disagree with me on that. We get trivial testing, composability, not writing the thing twice for SDag and GlobalISel, and the ability to lower right out of clang where the target is known early. So OpenMP that uses these intrinsics and targets nvptx64-- directly pays zero cost. Likewise amdgpu-- as the target will immediately turn into the target intrinsics and interact with the amdgpu attributor as if this was never here. The pass should be scheduled early after clang as an optimisation and run unconditionally in the backend. Some people might recognise this strategy from my variadic lowering pass.

I've lowered amdgpu and nvptx in the same IR pass, which is heresy, because I want to emphasise that these are not semantically clever or novel things, that can be rewritten to target specific things whenever is convenient. That seems to have been misread above. I also want it to be very obvious to Intel exactly where they need to add a column of function pointers to have this work on their backend. An extra target gets a column, an extra intrinsic gets a row, fill in all the grid spaces. I like dumb code that walks over tables.

However. This has not sailed cleanly through review, and I'm totally willing to compromise the pretty implementation to get the functionality of build once, JIT to any GPU. If I have to copy&paste files across directories and delete parts so be it. Likewise I'll cheerfully leave a load of conceptually redundant target dispatch cruft in libc and openmp if you guys won't let me lower these in an IR pass shortly after clang.

I've dropped the clang headers part from this patch, rewritten to elide the grid intrinsic and moved the test. Matt suggested llvm.simt as the prefix which I like very much more than llvm.gpu but the rename is probably going to take a moment.

Turns out I do need to update the amdgpu attributor pass to know about the new names which is upsetting but fine.

There appears to be reasonable consensus offline that actually we do really like having lots of intrinsics with different names and the same meaning. Beats me why but so be it. I think I'm going to change this patch to drop the "target agnostic" premise and name all the things llvm.spv instead of llvm.gpu, then either someone from spirv will like it and it'll ship or they'll hate it and I'll do something else.

Patch opposed internally. Will see if I can think of an alternative.

For posterity, the point of this line of attack was to remove the #ifdef nvptx else... or equivalent openmp variant dispatch which exists in roughly all gpu programming projects to paper over the spurious variation in names of intrinsics.

If the compiler does the dispatch, the application code gets to write builtin_thread_id_x instead of the preprocessor link time dispatch. Instead, we're going to be adding ifdef (SPIRV64) builtin_spirv_thread_id to any of those dispatch blocks that care to tolerate the extra target. See clang's gpuintrin.h for an example of what could be mostly deleted if this patch were acceptable.

Reviewers

arsenm arsenm left review comments