torch.cuda — PyTorch 2.7 documentation (original) (raw)

This package adds support for CUDA tensor types.

It implements the same function as CPU tensors, but they utilize GPUs for computation.

It is lazily initialized, so you can always import it, and useis_available() to determine if your system supports CUDA.

CUDA semantics has more details about working with CUDA.

StreamContext	Context-manager that selects a given stream.
can_device_access_peer	Check if peer access between two devices is possible.
current_blas_handle	Return cublasHandle_t pointer to current cuBLAS handle
current_device	Return the index of a currently selected device.
current_stream	Return the currently selected Stream for a given device.
cudart	Retrieves the CUDA runtime API module.
default_stream	Return the default Stream for a given device.
device	Context-manager that changes the selected device.
device_count	Return the number of GPUs available.
device_memory_used	Return used global (device) memory in bytes as given by nvidia-smi or amd-smi.
device_of	Context-manager that changes the current device to that of given object.
get_arch_list	Return list CUDA architectures this library was compiled for.
get_device_capability	Get the cuda capability of a device.
get_device_name	Get the name of a device.
get_device_properties	Get the properties of a device.
get_gencode_flags	Return NVCC gencode flags this library was compiled with.
get_stream_from_external	Return a Stream from an externally allocated CUDA stream.
get_sync_debug_mode	Return current value of debug mode for cuda synchronizing operations.
init	Initialize PyTorch's CUDA state.
ipc_collect	Force collects GPU memory after it has been released by CUDA IPC.
is_available	Return a bool indicating if CUDA is currently available.
is_initialized	Return whether PyTorch's CUDA state has been initialized.
is_tf32_supported	Return a bool indicating if the current CUDA/ROCm device supports dtype tf32.
memory_usage	Return the percent of time over the past sample period during which global (device) memory was being read or written as given by nvidia-smi.
set_device	Set the current device.
set_stream	Set the current stream.This is a wrapper API to set the stream.
set_sync_debug_mode	Set the debug mode for cuda synchronizing operations.
stream	Wrap around the Context-manager StreamContext that selects a given stream.
synchronize	Wait for all kernels in all streams on a CUDA device to complete.
utilization	Return the percent of time over the past sample period during which one or more kernels was executing on the GPU as given by nvidia-smi.
temperature	Return the average temperature of the GPU sensor in Degrees C (Centigrades).
power_draw	Return the average power draw of the GPU sensor in mW (MilliWatts)
clock_rate	Return the clock speed of the GPU SM in MHz (megahertz) over the past sample period as given by nvidia-smi.
OutOfMemoryError	Exception raised when device is out of memory

Random Number Generator¶

get_rng_state	Return the random number generator state of the specified GPU as a ByteTensor.
get_rng_state_all	Return a list of ByteTensor representing the random number states of all devices.
set_rng_state	Set the random number generator state of the specified GPU.
set_rng_state_all	Set the random number generator state of all devices.
manual_seed	Set the seed for generating random numbers for the current GPU.
manual_seed_all	Set the seed for generating random numbers on all GPUs.
seed	Set the seed for generating random numbers to a random number for the current GPU.
seed_all	Set the seed for generating random numbers to a random number on all GPUs.
initial_seed	Return the current random seed of the current GPU.

Communication collectives¶

Streams and events¶

Stream	Wrapper around a CUDA stream.
ExternalStream	Wrapper around an externally allocated CUDA stream.
Event	Wrapper around a CUDA event.

Graphs (beta)¶

Jiterator (beta)¶

TunableOp¶

Some operations could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the cublas/cublasLt libraries or hipblas/hipblasLt libraries, respectively. How does one know which implementation is the fastest and should be chosen? That’s what TunableOp provides. Certain operators have been implemented using multiple strategies as Tunable Operators. At runtime, all strategies are profiled and the fastest is selected for all subsequent operations.

See the documentation for information on how to use it.

Stream Sanitizer (prototype)¶

CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch. See the documentation for information on how to use it.

GPUDirect Storage (prototype)¶

The APIs in torch.cuda.gds provide thin wrappers around certain cuFile APIs that allow direct memory access transfers between GPU memory and storage, avoiding a bounce buffer in the CPU. See thecufile api documentationfor more details.

These APIs can be used in versions greater than or equal to CUDA 12.6. In order to use these APIs, one must ensure that their system is appropriately configured to use GPUDirect Storage per theGPUDirect Storage documentation.

See the docs for GdsFile for an example of how to use these.