Reading Between The Threads: Shader Intrinsics (original) (raw)

When writing compute shaders, it’s often necessary to communicate values between threads. This is typically done through shared memory. Kepler GPUs introduced shuffle intrinsics, which enable threads of a warp to directly read each other’s registers, avoiding memory access and synchronization. Shared memory is relatively fast but instructions that operate without using memory of any kind are significantly faster still.

This post discusses those warp shuffle and warp vote intrinsics and how you can take advantage of them in your DirectX, OpenGL, and Vulkan applications, in addition to CUDA. We also provide the ShuffleIntrinsicsVk sample, which illustrates basic use cases of those intrinsics.

The intrinsics can be further divided into the following categories:

The vote and shuffle intrinsics (but not fragment quadswizzle) are available not just in compute shaders but to all graphics shaders!

How does shuffle help?

There are advantages to using warp shuffle and warp vote intrinsics instead of shared memory:

Where can shuffle help?

“During the port of the EGO engine to the next-gen console, we discovered that warp/wave-level operations enabled substantial optimizations to our light culling system. We were excited to learn that NVIDIA offered production-quality, ready-to-use HLSL extensions to access the same functionality on GeForce GPUs. We were able to exploit the same warp vote and lane access functionality as we had done on the console, yielding wins of up to 1ms at 1080p on an NVIDIA GTX 980. We continue to find new optimizations to exploit these intrinsics,” said Tom Hammersley, principal graphics programmer at Codemasters Birmingham.

There are quite a few algorithms (or building blocks) that use shared memory and could benefit from using shuffle intrinsics:

As always, it’s advisable to profile and measure as you are optimizing your shaders!

Threads, warps, and streaming multiprocessors

NVIDIA GPUS, such as those from the NVIDIA Pascal generation, are composed of different configurations of graphics processing clusters (GPCs), streaming multiprocessors (SMs), and memory controllers. Threads from compute and graphics shaders are organized in groups called warps that execute in lock-step.

On current hardware, a warp has a width of 32 threads. In case of future changes, it’s a good idea to use the following intrinsics to determine the number of threads within a warp as well as the thread index (or lane) within the current warp:

HLSL: NV_WARP_SIZE NvGetLaneId() GLSL: gl_WarpSizeNV [or gl_SubGroupSizeARB] gl_ThreadInWarpNV [or gl_SubGroupInvocationARB]

The ShuffleIntrinsicsVk sample contains a rendermode that maps the lane ID within a warp to a color (Figure 1).

Figure 1. Lane ID warp threads mapped to colors

Warp vote: Cross-warp predicates

The warp vote functions enable a shader to evaluate a predicate for each thread and then broadcast the result to all threads within a warp. The all/any variants combine the results of all threads into a single Boolean value that then gets broadcast to all threads within a thread group. The ballot variant provides the individual predicates of each thread to all threads within the warp in the form of a 32-bit mask, where each bit corresponds to the predicate of the respective thread.

HLSL: uint NvAny(int predicate) uint NvAll(int predicate) uint NvBallot(int predicate) GLSL: bool anyThreadNV(bool predicate) [or bool anyInvocationARB(bool predicate)] bool allThreadsNV(bool predicate) [or bool allInvocationsARB(bool predicate)] bool allThreadsEqualNV(bool predicate) [or bool allInvocationsEqualARB(bool predicate)] uint ballotThreadNV(bool predicate) [or uint64_t ballotARB(bool predicate)]

The ShuffleIntrinsicsVk code sample demonstrates how the ballot intrinsic can be used to color code the first and last thread of a warp:

uint activeThreads = ballotThreadNV(true);

uint firstLaneId = findLSB(activeThreads); uint lastLaneId = findMSB(activeThreads);

if (firstLaneId == gl_ThreadInWarpNV) { oFrag = vec4(1, 1, 0, 0); } else if (lastLaneId == gl_ThreadInWarpNV) { oFrag = vec4(1, 0, 1, 0); } else { oFrag = color; }

Image of the lane ID mapping triangle with RGB points.

Figure 2. Lane ID warp threads with cross-warp predicates

Warp shuffle: Cross-warp data exchange

Warp shuffle functions enable active threads within a thread group to exchange data using four different modes (indexed, up, down, xor).

Diagram shows a row of threads in each mode: shuffle (indexed any-to-any); shuffle up (shift right to nth neighbor); shuffle down (shift left to nth neighbor); and shuffle xor (butterfly XOR exchange).

Figure 3. Warp shuffle function modes

They all load a value from the current thread, which can be different per thread. They return a value read from another thread, whose index can be specified using various methods, depending on the flavor of the shuffle.

The subsequent discussion of the individual shuffle functions makes use of the following terms:

When a shuffle function attempts to access a value from an out-of-range thread, it returns the value of the current thread and sets threadIdValid to false (if provided as an argument). This is typically not a problem, for example when computing a min/max reduction across the warp but it might need special consideration when computing a sum across the warp.

Indexed any-to-any shuffle

This intrinsic enables a thread to read the value from any thread within the thread group.

HLSL: int NvShfl(int data, uint index, int width = NV_WARP_SIZE) GLSL: int shuffleNV(int data, uint index, uint width, [out bool threadIdValid])

Here, index directly specifies the source thread or lane and must be in the range of 0 to 31.

Shuffle up and shuffle down

The shuffle up and shuffle down intrinsics differ from the indexed shuffle, as they determine the index of the source thread by adding or subtracting an offset to the current thread index within the warp.

HLSL: int NvShflUp(int data, uint delta, int width = NV_WARP_SIZE) int NvShflDown(int data, uint delta, int width = NV_WARP_SIZE)

GLSL: int shuffleUpNV(int data, uint delta, uint width, [out bool threadIdValid]) int shuffleDownNV(int data, uint delta, uint width, [out bool threadIdValid])

Here, the delta argument is the offset that gets subtracted (shuffleUp) or added (shuffleDown) to the current thread ID to get the source thread ID. This has the effect of shifting the segment up or down by delta threads.

Butterfly or XOR shuffle

The butterfly (or XOR) shuffle does a bitwise xor between the lane mask and the current thread ID to get the source thread ID.

HLSL: int NvShflXor(int data, uint laneMask, int width = NV_WARP_SIZE)

GLSL: int shuffleXorNV(int data, uint laneMask, uint width, [out bool threadIdValid])

The laneMask parameter specifies a value that gets “xor”-ed with the current thread ID to get the source thread ID.

This can be used to perform a reduction across the warp as shown in the ShuffleIntrinsicsVK sample which computes the maximum of a color value across the warp.

vec4 warpMax = color; warpMax = max(warpMax, shuffleXorNV(warpMax, 16,32)); warpMax = max(warpMax, shuffleXorNV(warpMax, 8, 32)); warpMax = max(warpMax, shuffleXorNV(warpMax, 4, 32)); warpMax = max(warpMax, shuffleXorNV(warpMax, 2, 32)); warpMax = max(warpMax, shuffleXorNV(warpMax, 1, 32)); oFrag = warpMax;

Fragment QuadSwizzle: Data exchange and arithmetic

There are a few commands in the HLSL and GLSL standard libraries that rely on data from neighboring fragments, for example, computing screen space derivatives or computing the LOD for texture lookups. For those, the hardware groups fragment shader invocations into 2×2 grids (quads) and then use specialized variants of cross-warp data exchange as building blocks for the standard library functions.

The quadSwizzle intrinsics (currently GLSL only) expose those building blocks to application developers. Six of those functions enable fragments within a quad to exchange data.

Replicate the value of a single thread across the quad:

float quadSwizzle0NV(float swizzledValue, [float unswizzledValue]) float quadSwizzle1NV(float swizzledValue, [float unswizzledValue]) float quadSwizzle2NV(float swizzledValue, [float unswizzledValue]) float quadSwizzle3NV(float swizzledValue, [float unswizzledValue])

Data exchange across threads corresponding to pixel neighbors:

float quadSwizzleXNV(float swizzledValue, [float unswizzledValue]) float quadSwizzleYNV(float swizzledValue, [float unswizzledValue])

All those functions read a floating point operand swizzledValue, which can come from any fragment in the quad. Another optional floating point operand unswizzledValue, which comes from the current fragment, can be added to swizzledValue. The only difference between all these quadSwizzle functions is the location where they get the swizzledValue operand within the 2×2 pixel quad.

If any thread in a 2×2 pixel quad is inactive (that is, gl_HelperThreadNV or gl_HelperInvocation are false), then the quad is divergent. In this case, quadSwizzle*NV returns 0 for all fragments in the quad.

Hardware and API support

All intrinsics discussed in this post are available on the NVIDIA Kepler, NVIDIA Maxwell, and NVIDIA Pascal GPUs, across NVIDIA Quadro and NVIDIA GeForce graphics cards as well as on NVIDIA Tegra K1 and NVIDIA Tegra X1 mobile GPUs.

HLSL for DirectX 11 and DirectX 12

NVIDIA provides a mechanism for using the intrinsics from HLSL in DirectX 11 and DirectX 12. For more information about how to access the intrinsics using the NVIDIA NVAPI library, see Unlocking GPU Intrinsics in HLSL.

GLSL for OpenGL and Vulkan

NVIDIA drivers expose the warp shuffle and warp vote intrinsics as a series of OpenGL GLSL extensions:

This is in addition to the cross-vendor Khronos OpenGL ARB extensions:

Most differences between those are minor (subgroup instead of warp, invocation instead of thread). What is notable, however, is that the ARB extensions support implementations with maximum warp widths of 64, whereas the NVIDIA extensions assume a maximum warp width of 32 threads. This is mostly transparent to a shader developer, except that ballotARB returns the bit mask as a 64-bit integer, unlike ballotThreadNV, which returns the bitmask as a 32-bit integer.

For reference, here is how to roughly implement the cross-vendor intrinsics in terms of the native hardware functionality:

uint64_t ballotARB(bool value) { return uint64_t(ballotThreadNV(value)); }

float readInvocationARB(float data, uint index) { return shuffleNV(data, index, 32); }

float readFirstInvocationARB(data) { // Do ballotThreadNV(true) to get bitmask for active threads // Then find lowest set bit indicating first active thread // Use then as index for shuffleNV return shuffleNV(data, findLSB(ballotThreadNV(true)), 32); }

The GLSL extensions discussed above are already supported by NVIDIA Vulkan drivers through VK_NV_glsl_shader, which makes them available for experimentation today!

In parallel, we are also working with the respective Khronos working groups to find the best way to bring cross-vendor shader intrinsics already standardized in OpenGL over to Vulkan and SPIR-V. We are working on Vulkan and SPIR-V extensions to expose our native intrinsics, but we prioritized the cross-vendor functionality higher, especially since there is a notable overlap in functionality.

Intrinsics cheat sheet

GLSL provides additional overloads for the shuffle functions that operate on scalar and vector flavors of float, int, unsigned int, and bool. In HLSL, those can be implemented easily using asuint and asfloat and multiple calls to the shuffle functions.

Description HLSL GLSL
indexed shuffle NvShfl shuffleNVreadInvocationARBreadFirstInvocationARB
shuffle up NvShflUp shuffleUpNV
shuffle down NvShflDown shuffleDownNV
butterfly shuffle NvShflXor shuffleXorNV

Table 1. Warp shuffle functions

Description HLSL GLSL
predicate true for all threads NvAll anyThreadNVanyInvocationARB
predicates equal for all threads allThreadsEqualNVallInvocationsEqualARB
broadcast predicates to thread group NvBallot ballotThreadNVballotARB

Table 2. Warp vote functions

Description HLSL GLSL
horizontal quadSwizzleXNV
vertical quadSwizzleYNV
arbitrary quadSwizzle0NVquadSwizzle1NVquadSwizzle2NVquadSwizzle3NV
helper thread(fragment shader) gl_HelperThreadNVgl_HelperInvocation

Table 3. Quadsizzle functions

Description HLSL GLSL
warp size NV_WARP_SIZE gl_WarpSizeNVgl_SubGroupSizeARB
thread index(lane ID) NvGetLaneId gl_ThreadInWarpNVgl_SubGroupInvocationARB
Bit masks derived from lane ID gl_Thread[OP]EqMaskNVgl_SubGroup[OP]MaskARBWhere OP = {Eq, Ge, Gt, Le, Lt }

Table 4. Thread index functions

Description HLSL GLSL
Index of current warp gl_WarpIDNV
Number of warps per SM gl_WarpsPerSMNV
Index of current SM gl_SMIDNV
Number of SMs gl_SMCountNV

Table 5. Miscellaneous functions

Sample code and references

The ShuffleIntrinsicsVK sample is available in the /NVIDIAGameWorks/GraphicsSamples GitHub repo. It renders a triangle and uses various intrinsics in the fragment shader to show various use cases. For more information, see Sample Setup Guide.

For more information, see the following resources: