AMD RDNA™ Performance Guide (original) (raw)

Optimizing a modern real-time renderer can be a somewhat daunting task. Explicit APIs hand you more control over how you craft your frame than ever before, allowing you to achieve higher frame rates, prettier pixels, and ultimately, better use of the hardware.

Our AMD RDNA™ Performance Guide, with updates for RDNA 3, will help guide you through the optimization process with a collection of tidbits, tips, and tricks which aim to support you in your performance quest.

Short on time?

Many of our performance suggestions on this page are available through Microsoft® PIX and Vulkan®’s Best Practice validation layer.

DirectX® 12

If you’re a DirectX® 12 developer, you’ll find many AMD-specific checks are already incorporated into the Microsoft® PIX performance tuning and debugging tool.

Vulkan®

If you’re a Vulkan® developer, from version 1.2.189 of the SDK onwards, you’ll find AMD-specific checks incorporated into the Best Practice validation layer.

Command buffers

Command buffers are the heart of the low-level graphics APIs. Most of the CPU time spent in DirectX®12 (DX12) and Vulkan® will be spent recording draws into the command buffers. One of the biggest optimizations is that an application can now multi-thread command buffer recording.

With the previous generation of APIs, the amount of multi-threading that could be done was severely limited to what the driver could muster up.

DirectX® 12

Vulkan®

Pipeline state objects (PSO)

All the smaller state structures where combined into a single state known as a Pipeline State Object. This allows the driver to know everything up front so that it can compile the shaders into the correct assembly. This removes any stutters that could happen in the previous API generations when the driver had to recompile shaders at the time of the draw call due to a change in state. This also allows easier state optimizations since there is no longer a need to track multiple small states.

DirectX® 12

Vulkan®

Barriers

Barriers are how dependencies between operations are conveyed to the API and driver. Barriers open up a whole new world of operations by allowing the application to decide if the GPU can overlap work. They are also an easy way to slow down the rendering by adding too many barriers or can cause corruptions from not having correct resource transitions. The validation layers can often help with identifying missing barriers.

DirectX® 12
When possible only set non pixel shader resource or pixel shader resource states instead of combining them.

Memory

Explicitly managing GPU memory has been exposed in both Vulkan® and DirectX®12. While this does allow many new optimization opportunities, it can also be hard to write an efficient GPU memory manager. That’s why we created small open source libraries for that purpose for both Vulkan® and DirectX®12.

Related tool

Radeon™ Memory Visualizer (RMV) instruments every level of our Radeon™ driver stack, and is able to understand the full state of your application’s memory allocation at any point during your application’s life.

link to the Radeon Memory Visulizer product page

Radeon™ Memory Visualizer (RMV) is a tool to allow you to gain a deep understanding of how your application uses memory for graphics resources.

DirectX® 12

Vulkan®

Resources

A lot of hardware optimizations depend on a resource being used in a certain way. The driver uses the provided data at resource creation time to determine what optimizations can be enabled. Thus, it is crucial that resource creation is handled with care to profit from as many optimizations as possible.

Descriptors

Descriptors are used by shaders to address resources. It is up to the application to provide a description of where the resources will be laid out during PSO creation. This allows the application to optimize the layout of descriptors based on the knowledge of what resources will be accessed the most.

DirectX® 12

Vulkan®

Synchronization

One large new addition to the graphics programmers tool box is the ability to control the synchronization between GPU and CPU. No longer is synchronization hidden behind the API. Submission of command buffers should be kept to a minimum as submitting requires a call into kernel mode as well as some implicit barriers on the GPU.

Presenting

When overlapping frame rendering with async compute, present from a different queue to reduce the chances of stalling.

DirectX® 12

Vulkan®

Clears

AMD hardware has the ability to do a fast clear. It cannot be overstated how much faster these clears are when compared to filling the full target. Fast clears have a few requirements to get the most out of them.

Vulkan®
Prefer using LOAD_OP_CLEAR and vkCmdClearAttachments over vkCmdClearColorImage and vkCmdClearDepthStencilImage.

Async compute

GCN and RDNA hardware have support for submitting compute shaders though an additional queue that will not be blocked from executing by the fixed function hardware. This allows filling the GPU with work while the graphics queue is bottlenecked on the frontend of the graphics pipeline.

Copying

In addition to having a compute queue, GCN and RDNA also have dedicated copy queues. These copy queues map to special DMA engines on the GPU that were designed for maximizing transfers across the PCIe® bus. Vulkan and DX12 give direct access to this hardware though the copy/transfer queues.

DirectX® 12
Resources in UPLOAD memory are accessible to shaders. Consider using this memory directly instead of copying, if each byte is accessed at most once by the GPU with a high degree of spatial locality. This is usually faster than copying and reading and reduces memory usage and synchronization overhead. These savings are only relevant if data is accessed immediately after being written. Remember to profile.

ExecuteIndirect

ExecuteIndirect is a DirectX® 12 feature that allows generating work from the GPU without reading back to the CPU.

Shaders

General

Compute shaders

Vertex shaders

Pixel shaders

Sampler feedback

16-bit math

Variable rate shading

Ray tracing

link to Radeon GPU Profiler product page

RGP gives you unprecedented, in-depth access to a GPU. Easily analyze graphics, async compute usage, event timing, pipeline stalls, barriers, bottlenecks, and other performance inefficiencies.

link to Radeon Raytracing Analyzer product page

Radeon™ Raytracing Analyzer (RRA) is a tool which allows you to investigate the performance of your raytracing applications and highlight potential bottlenecks.

link to the 'Improving raytracing performance with the Radeon Raytracing Analyzer' blog post

Improving raytracing performance with the Radeon™ Raytracing Analyzer (RRA)

Optimizing the raytracing pipeline can be difficult. Discover how to spot and diagnose common RT pitfalls with RRA, and how to fix them!

DirectX® 12
DXR 1.0 Avoid recursion by setting max depth to 1 DXR 1.1 lets you call TraceRay() from any shader stage The best performance comes from using it in compute shaders, on a compute queue Always have just 1 active RayQuery object in scope in your shader at any time

Debugging

While debugging is not optimization it is still worth mentioning a few tips for debugging that will come in handy when optimizing rendering.

DirectX® 12
RGP can use PIX3 markers by using the PIX3.h file included with RGP.