CUDA Stream Sanitizer — PyTorch 2.7 documentation (original) (raw)

Note

This is a prototype feature, which means it is at an early stage for feedback and testing, and its components are subject to change.

Overview

This module introduces CUDA Sanitizer, a tool for detecting synchronization errors between kernels ran on different streams.

It stores information on accesses to tensors to determine if they are synchronized or not. When enabled in a python program and a possible data race is detected, a detailed warning will be printed and the program will exit.

It can be enabled either by importing this module and callingenable_cuda_sanitizer() or by exporting the TORCH_CUDA_SANITIZERenvironment variable.

Usage

Here is an example of a simple synchronization error in PyTorch:

import torch

a = torch.rand(4, 2, device="cuda")

with torch.cuda.stream(torch.cuda.Stream()): torch.mul(a, 5, out=a)

The a tensor is initialized on the default stream and, without any synchronization methods, modified on a new stream. The two kernels will run concurrently on the same tensor, which might cause the second kernel to read uninitialized data before the first one was able to write it, or the first kernel might overwrite part of the result of the second. When this script is run on the commandline with:

TORCH_CUDA_SANITIZER=1 python example_error.py

the following output is printed by CSAN:

============================ CSAN detected a possible data race on tensor with data pointer 139719969079296 Access by stream 94646435460352 during kernel: aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!) writing to argument(s) self, out, and to the output With stack trace: File "example_error.py", line 6, in torch.mul(a, 5, out=a) ... File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch stack_trace = traceback.StackSummary.extract(

Previous access by stream 0 during kernel: aten::rand(int[] size, *, int? dtype=None, Device? device=None) -> Tensor writing to the output With stack trace: File "example_error.py", line 3, in a = torch.rand(10000, device="cuda") ... File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch stack_trace = traceback.StackSummary.extract(

Tensor was allocated with stack trace: File "example_error.py", line 3, in a = torch.rand(10000, device="cuda") ... File "pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation traceback.StackSummary.extract(

This gives extensive insight into the origin of the error:

See also

The list of supported torch operators and their schemas can be viewedhere.

The bug can be fixed by forcing the new stream to wait for the default stream:

with torch.cuda.stream(torch.cuda.Stream()): torch.cuda.current_stream().wait_stream(torch.cuda.default_stream()) torch.mul(a, 5, out=a)

When the script is run again, there are no errors reported.

API Reference

torch.cuda._sanitizer.enable_cuda_sanitizer()[source][source]

Enable CUDA Sanitizer.

The sanitizer will begin to analyze low-level CUDA calls invoked by torch functions for synchronization errors. All data races found will be printed to the standard error output along with stack traces of suspected causes. For best results, the sanitizer should be enabled at the very beginning of the program.