What is Parallel Thread Execution? | GPU Glossary (original) (raw)

Parallel Thread eXecution (PTX) is an intermediate representation (IR) for code that will run on a parallel processor (almost always an NVIDIA GPU). It is one of the formats output by nvcc, theNVIDIA CUDA Compiler Driver . It is pronounced "pee-tecks" by many NVIDIA engineers and "pee-tee-ecks" by everyone else.

NVIDIA documentation refers to PTX as both a "virtual machine" and an "instruction set architecture".

From the programmer's perspective, PTX is an instruction set for programming against a virtual machine model. Programmers or compilers producing PTX can be confident their program will run with the same semantics on many distinct physical machines, including machines that do not yet exist. In this way, it is also similar to CPU instruction set architectures likex86_64 ,aarch64 , orSPARC .

Unlike those ISAs, PTX is very much anintermediate representation , like LLVM-IR. The PTX components of aCUDA binary will be just-in-time (JIT) compiled by the hostCUDA Drivers into device-specific SASS for execution.

In the case of NVIDIA GPUs, PTX is forward-compatible: GPUs with a matching or higher compute capability version will be able to run the program, thanks to this mechanism of JIT compilation. In this way, PTX is a"narrow waist" that separates the worlds of hardware and software.

Some exemplary PTX:

a compiler directive for the PTX-to-SASS compiler indicating that this kernel consumes seven 32-bit floating pointregisters . Registers are dynamically allocated to groups ofthreads (warps ) from theSM 'sregister file .

ptx

fma.rn.f32 %f5, %f4, %f3, 0f3FC00000;

apply a fused multiply-add (fma) operation to multiply the contents of registers f3 and f4 and add the constant 0f3FC00000, storing the result in f5. All numbers are in 32 bit floating point representation. The rnsuffix for the FMA operation sets the floating point rounding mode toIEEE 754 "round even" (the default).

ptx

mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %tid.x;

move the x-axis values of the cooperative thread array index, the cooperative thread array dimension index (ntid), and the threadindex into three u32 registers r1 - r3.

The PTX programming model exposes multiple levels of parallelism to the programmer. These levels map directly onto the hardware through the PTX machine model, diagrammed below.

Notably, in this machine model there is a single instruction unit for multiple processors. While each processor runs onethread , those threads must execute the same instructions — hence parallel thread execution, or PTX. They coordinate with each other throughshared memory and effect different results by means of privateregisters .

The documentation for the latest version of PTX is available from NVIDIAhere . The instruction sets of PTX are versioned with a number called the "compute capability ", which is synonymous with "minimum supportedStreaming Multiprocessor architecture version".

Writing in-line PTX by hand is uncommon outside of the cutting edge of performance, similar to writing in-line x86_64 assembly, as is done in high-performance vectorized query operators in analytical databases and in performance-sensitive sections of operating system kernels. At time of writing in September of 2025, in-line PTX is the only way to take advantage of some Hopper-specific hardware features like the wgmma and tma instructions, as inFlash Attention 3 or in theMachete w4a16 kernels . ViewingCUDA C/C++ ,SASS , andPTX together is supported on Godbolt . See theNVIDIA "Inline PTX Assembly in CUDA" guide for details.