torch.cuda — PyTorch 2.7 documentation (original) (raw)

This package adds support for CUDA tensor types.

It implements the same function as CPU tensors, but they utilize GPUs for computation.

It is lazily initialized, so you can always import it, and useis_available() to determine if your system supports CUDA.

CUDA semantics has more details about working with CUDA.

StreamContext Context-manager that selects a given stream.
can_device_access_peer Check if peer access between two devices is possible.
current_blas_handle Return cublasHandle_t pointer to current cuBLAS handle
current_device Return the index of a currently selected device.
current_stream Return the currently selected Stream for a given device.
cudart Retrieves the CUDA runtime API module.
default_stream Return the default Stream for a given device.
device Context-manager that changes the selected device.
device_count Return the number of GPUs available.
device_memory_used Return used global (device) memory in bytes as given by nvidia-smi or amd-smi.
device_of Context-manager that changes the current device to that of given object.
get_arch_list Return list CUDA architectures this library was compiled for.
get_device_capability Get the cuda capability of a device.
get_device_name Get the name of a device.
get_device_properties Get the properties of a device.
get_gencode_flags Return NVCC gencode flags this library was compiled with.
get_stream_from_external Return a Stream from an externally allocated CUDA stream.
get_sync_debug_mode Return current value of debug mode for cuda synchronizing operations.
init Initialize PyTorch's CUDA state.
ipc_collect Force collects GPU memory after it has been released by CUDA IPC.
is_available Return a bool indicating if CUDA is currently available.
is_initialized Return whether PyTorch's CUDA state has been initialized.
is_tf32_supported Return a bool indicating if the current CUDA/ROCm device supports dtype tf32.
memory_usage Return the percent of time over the past sample period during which global (device) memory was being read or written as given by nvidia-smi.
set_device Set the current device.
set_stream Set the current stream.This is a wrapper API to set the stream.
set_sync_debug_mode Set the debug mode for cuda synchronizing operations.
stream Wrap around the Context-manager StreamContext that selects a given stream.
synchronize Wait for all kernels in all streams on a CUDA device to complete.
utilization Return the percent of time over the past sample period during which one or more kernels was executing on the GPU as given by nvidia-smi.
temperature Return the average temperature of the GPU sensor in Degrees C (Centigrades).
power_draw Return the average power draw of the GPU sensor in mW (MilliWatts)
clock_rate Return the clock speed of the GPU SM in MHz (megahertz) over the past sample period as given by nvidia-smi.
OutOfMemoryError Exception raised when device is out of memory

Random Number Generator

get_rng_state Return the random number generator state of the specified GPU as a ByteTensor.
get_rng_state_all Return a list of ByteTensor representing the random number states of all devices.
set_rng_state Set the random number generator state of the specified GPU.
set_rng_state_all Set the random number generator state of all devices.
manual_seed Set the seed for generating random numbers for the current GPU.
manual_seed_all Set the seed for generating random numbers on all GPUs.
seed Set the seed for generating random numbers to a random number for the current GPU.
seed_all Set the seed for generating random numbers to a random number on all GPUs.
initial_seed Return the current random seed of the current GPU.

Communication collectives

Streams and events

Stream Wrapper around a CUDA stream.
ExternalStream Wrapper around an externally allocated CUDA stream.
Event Wrapper around a CUDA event.

Graphs (beta)

Jiterator (beta)

TunableOp

Some operations could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the cublas/cublasLt libraries or hipblas/hipblasLt libraries, respectively. How does one know which implementation is the fastest and should be chosen? That’s what TunableOp provides. Certain operators have been implemented using multiple strategies as Tunable Operators. At runtime, all strategies are profiled and the fastest is selected for all subsequent operations.

See the documentation for information on how to use it.

Stream Sanitizer (prototype)

CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch. See the documentation for information on how to use it.

GPUDirect Storage (prototype)

The APIs in torch.cuda.gds provide thin wrappers around certain cuFile APIs that allow direct memory access transfers between GPU memory and storage, avoiding a bounce buffer in the CPU. See thecufile api documentationfor more details.

These APIs can be used in versions greater than or equal to CUDA 12.6. In order to use these APIs, one must ensure that their system is appropriately configured to use GPUDirect Storage per theGPUDirect Storage documentation.

See the docs for GdsFile for an example of how to use these.