cuTile Python — cuTile Python (original) (raw)

cuTile is a parallel programming model for NVIDIA GPUs and a Python-based DSL. It automatically leverages advanced hardware capabilities, such as tensor cores and tensor memory accelerators, while providing portability across different NVIDIA GPU architectures. cuTile enables the latest hardware features without requiring code changes.

cuTile kernels are GPU programs that are executed in parallel on a logical grid of blocks. The @ct.kernel decorator marks a Python function as a kernel’s entry point. Kernels cannot be called directly from the host code; the host must queue kernels for execution on GPU using the ct.launch() function:

import cuda.tile as ct import cupy

TILE_SIZE = 16

cuTile kernel for adding two dense vectors. It runs in parallel on the GPU.

@ct.kernel def vector_add_kernel(a, b, result): block_id = ct.bid(0) a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,)) b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,)) result_tile = a_tile + b_tile ct.store(result, index=(block_id,), tile=result_tile)

Host-side function that launches the above kernel.

def vector_add(a: cupy.ndarray, b: cupy.ndarray, result: cupy.ndarray): assert a.shape == b.shape == result.shape grid = (ct.cdiv(a.shape[0], TILE_SIZE), 1, 1) ct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a, b, result))

Kernels move data between arrays and tiles using functions likect.load() and ct.store(). Both arrays and tiles are tensor-like data structures: each has a specific shape (i.e., the number of elements along each axis) and a dtype (i.e., the data type of elements). However, there are important differences:

Proceed to the Quickstart page for installation instructions and a complete working example.