What is a Warp? | GPU Glossary (original) (raw)

A warp is a group of threads that are scheduled together and execute in parallel. Allthreads in a warp are scheduled onto a singleStreaming Multiprocessor (SM) . A single SM typically executes multiple warps, at the very least all warps from the sameCooperative Thread Array , aka thread block .

Warps are the typical unit of execution on a GPU. In normal execution, allthreads of a warp execute the same instruction in parallel — the so-called "Single-Instruction, Multiple Thread" or SIMT model. When the threads in a warp split from one another to execute different instructions, also known aswarp divergence , performance generally drops precipitously.

Warp size is technically a machine-dependent constant, but in practice (and elsewhere in this glossary) it is 32.

When a warp is issued an instruction, the results are generally not available within a single clock cycle, and so dependent instructions cannot be issued. While this is most obviously true for fetches fromglobal memory , which generallygo off-chip , it is also true for some arithmetic instructions (seethe CUDA C++ Best Practices Guide for a table of results per clock cycle for specific instructions).

A warp whose next instruction is delayed by missing operands is said to bestalled .

Instead of waiting for an instruction's results to return, when multiple warps are scheduled onto a singleSM , theWarp Scheduler will select another warp to execute. Thislatency-hiding is how GPUs achieve high throughput and ensure work is always available for all of their cores during execution. For this reason, it is often beneficial to maximize the number of warps scheduled onto eachSM , ensuring there is always an eligible warp for theSM to run. The fraction of cycles on which a warp was issued an instruction is known as theissue efficiency . The degree of concurrency in warp scheduling is known asoccupancy .

Warps are not actually part of theCUDA programming model 'sthread hierarchy . Instead, they are an implementation detail of the implementation of that model on NVIDIA GPUs. In that way, they are somewhat akin tocache lines in CPUs: a feature of the hardware that you don't directly control and don't need to consider for program correctness, but which is important for achievingmaximum performance .

Warps are named in reference to weaving, "the first parallel thread technology", according toLindholm et al., 2008 . The equivalent of warps in other GPU programming models includesubgroups in WebGPU,waves in DirectX, andsimdgroups in Metal.