Or want to contribute?
Click this button to
let us know on GitHub.
What is Parallel Thread Execution?
Parallel Thread eXecution (PTX) is an intermediate representation (IR) for code that will run on a parallel processor (almost always an NVIDIA GPU). It is one of the formats output by nvcc, the NVIDIA CUDA Compiler Driver . It is pronounced "pee-tecks" by many NVIDIA engineers and "pee-tee-ecks" by everyone else.
NVIDIA documentation refers to PTX as both a "virtual machine" and an "instruction set architecture".
From the programmer's perspective, PTX is an instruction set for programming against a virtual machine model. Programmers or compilers producing PTX can be confident their program will run with the same semantics on many distinct physical machines, including machines that do not yet exist. In this way, it is also similar to CPU instruction set architectures like x86_64 , aarch64 , or SPARC .
Unlike those ISAs, PTX is very much an intermediate representation , like LLVM-IR. The PTX components of a CUDA binary will be just-in-time (JIT) compiled by the host CUDA Drivers into device-specific SASS for execution.
In the case of NVIDIA GPUs, PTX is forward-compatible: GPUs with a matching or higher compute capability version will be able to run the program, thanks to this mechanism of JIT compilation. In this way, PTX is a "narrow waist" that separates the worlds of hardware and software.
Some exemplary PTX:
.reg .f32 %f<7>; - a compiler directive for the PTX-to-SASS compiler indicating that this kernel consumes seven 32-bit floating point registers . Registers are dynamically allocated to groups of threads (warps ) from the SM 's register file .
fma.rn.f32 %f5, %f4, %f3, 0f3FC00000; - apply a fused multiply-add (
fma) operation to multiply the contents of registersf3andf4and add the constant0f3FC00000, storing the result inf5. All numbers are in 32 bit floating point representation. Thernsuffix for the FMA operation sets the floating point rounding mode to IEEE 754 "round even" (the default).
mov.u32 %r1, %ctaid.x; mov.u32 %r2, %ntid.x; mov.u32 %r3, %tid.x; move thex-axis values of thecooperativethreadarrayindex, the cooperative thread array dimension index (ntid), and thethreadindex into threeu32registersr1-r3.
The PTX programming model exposes multiple levels of parallelism to the programmer. These levels map directly onto the hardware through the PTX machine model, diagrammed below.
Notably, in this machine model there is a single instruction unit for multiple processors. While each processor runs one thread , those threads must execute the same instructions — hence parallel thread execution, or PTX. They coordinate with each other through shared memory and effect different results by means of private registers .
The documentation for the latest version of PTX is available from NVIDIA here . The instruction sets of PTX are versioned with a number called the "compute capability ", which is synonymous with "minimum supported Streaming Multiprocessor architecture version".
Writing in-line PTX by hand is uncommon outside of the cutting edge of performance, similar to writing in-line x86_64 assembly, as is done in high-performance vectorized query operators in analytical databases and in performance-sensitive sections of operating system kernels. At time of writing in September of 2025, in-line PTX is the only way to take advantage of some Hopper-specific hardware features like the wgmma and tma instructions, as in Flash Attention 3 or in the Machete w4a16 kernels . Viewing CUDA C/C++ , SASS , and PTX together is supported on Godbolt . See the NVIDIA "Inline PTX Assembly in CUDA" guide for details.
Building on GPUs? We know a thing or two about it.
Modal is an ergonomic Python SDK wrapped around a global GPU fleet. Deploy serverless AI workloads instantly without worrying about quota requests, driver compatibility issues, or managing bulky ML dependencies.