Overview — NVIDIA CUTLASS Documentation (original) (raw)

CUTLASS 4.x bridges the gap between productivity and performance for CUDA kernel development. By providing Python-based DSLs to the powerful CUTLASS C++ template library, it enables faster iteration, easier prototyping, and a gentler learning curve for high-performance linear algebra on NVIDIA GPUs.

Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.

Why CUTLASS DSLs?#

While CUTLASS offers exceptional performance through its C++ template abstractions, the complexity can present challenges for many developers. CUTLASS 4.x addresses this by:

Students can learn GPU programming concepts without the complexity of C++ templates. Researchers and performance engineers can rapidly explore algorithms, prototype, and tune kernels before moving to production implementations.

Key Concepts and Approach#

CUTLASS DSLs translate Python code into a custom intermediate representation (IR), which is then Just-In-Time (JIT) compiled into optimized CUDA kernels using MLIR and ptxas.

Core CuTe DSL Abstractions#

For more on CuTe abstractions, refer to the CuTe C++ library documentation.

Pythonic Kernel Expression

Developers express kernel logic, data movement, and computation using familiar Python syntax and control flow.

The DSLs simplify expressing loop tiling, threading strategies, and data transformations using concise Python code.

JIT Compilation

Python kernels are compiled at runtime into CUDA device code using MLIR infrastructure and NVIDIA’s ptxas toolchain, enabling rapid iteration and interactive debugging.

Relationship to CUTLASS C++#

CUTLASS DSLs are not a replacement for the CUTLASS C++ library or its 2.x and 3.x APIs. Instead, it aims to be a high-productivity kernel authoring framework that shares all concepts with CUTLASS 3.x C++ API such as CuTe, pipelines, schedulers etc.

Getting Started#

Current Status & Roadmap#

CuTe DSL is in public beta and actively evolving. Interfaces and features are subject to change as we improve the system.

Upcoming Milestones#

For known issues and workarounds, please consult the Limitations and FAQs.