torch.cuda — PyTorch 2.7 documentation (original) (raw)
This package adds support for CUDA tensor types.
It implements the same function as CPU tensors, but they utilize GPUs for computation.
It is lazily initialized, so you can always import it, and useis_available() to determine if your system supports CUDA.
CUDA semantics has more details about working with CUDA.
StreamContext | Context-manager that selects a given stream. |
---|---|
can_device_access_peer | Check if peer access between two devices is possible. |
current_blas_handle | Return cublasHandle_t pointer to current cuBLAS handle |
current_device | Return the index of a currently selected device. |
current_stream | Return the currently selected Stream for a given device. |
cudart | Retrieves the CUDA runtime API module. |
default_stream | Return the default Stream for a given device. |
device | Context-manager that changes the selected device. |
device_count | Return the number of GPUs available. |
device_memory_used | Return used global (device) memory in bytes as given by nvidia-smi or amd-smi. |
device_of | Context-manager that changes the current device to that of given object. |
get_arch_list | Return list CUDA architectures this library was compiled for. |
get_device_capability | Get the cuda capability of a device. |
get_device_name | Get the name of a device. |
get_device_properties | Get the properties of a device. |
get_gencode_flags | Return NVCC gencode flags this library was compiled with. |
get_stream_from_external | Return a Stream from an externally allocated CUDA stream. |
get_sync_debug_mode | Return current value of debug mode for cuda synchronizing operations. |
init | Initialize PyTorch's CUDA state. |
ipc_collect | Force collects GPU memory after it has been released by CUDA IPC. |
is_available | Return a bool indicating if CUDA is currently available. |
is_initialized | Return whether PyTorch's CUDA state has been initialized. |
is_tf32_supported | Return a bool indicating if the current CUDA/ROCm device supports dtype tf32. |
memory_usage | Return the percent of time over the past sample period during which global (device) memory was being read or written as given by nvidia-smi. |
set_device | Set the current device. |
set_stream | Set the current stream.This is a wrapper API to set the stream. |
set_sync_debug_mode | Set the debug mode for cuda synchronizing operations. |
stream | Wrap around the Context-manager StreamContext that selects a given stream. |
synchronize | Wait for all kernels in all streams on a CUDA device to complete. |
utilization | Return the percent of time over the past sample period during which one or more kernels was executing on the GPU as given by nvidia-smi. |
temperature | Return the average temperature of the GPU sensor in Degrees C (Centigrades). |
power_draw | Return the average power draw of the GPU sensor in mW (MilliWatts) |
clock_rate | Return the clock speed of the GPU SM in MHz (megahertz) over the past sample period as given by nvidia-smi. |
OutOfMemoryError | Exception raised when device is out of memory |
Random Number Generator¶
get_rng_state | Return the random number generator state of the specified GPU as a ByteTensor. |
---|---|
get_rng_state_all | Return a list of ByteTensor representing the random number states of all devices. |
set_rng_state | Set the random number generator state of the specified GPU. |
set_rng_state_all | Set the random number generator state of all devices. |
manual_seed | Set the seed for generating random numbers for the current GPU. |
manual_seed_all | Set the seed for generating random numbers on all GPUs. |
seed | Set the seed for generating random numbers to a random number for the current GPU. |
seed_all | Set the seed for generating random numbers to a random number on all GPUs. |
initial_seed | Return the current random seed of the current GPU. |
Communication collectives¶
Streams and events¶
Stream | Wrapper around a CUDA stream. |
---|---|
ExternalStream | Wrapper around an externally allocated CUDA stream. |
Event | Wrapper around a CUDA event. |
Graphs (beta)¶
Jiterator (beta)¶
TunableOp¶
Some operations could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the cublas/cublasLt libraries or hipblas/hipblasLt libraries, respectively. How does one know which implementation is the fastest and should be chosen? That’s what TunableOp provides. Certain operators have been implemented using multiple strategies as Tunable Operators. At runtime, all strategies are profiled and the fastest is selected for all subsequent operations.
See the documentation for information on how to use it.
Stream Sanitizer (prototype)¶
CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch. See the documentation for information on how to use it.
GPUDirect Storage (prototype)¶
The APIs in torch.cuda.gds
provide thin wrappers around certain cuFile APIs that allow direct memory access transfers between GPU memory and storage, avoiding a bounce buffer in the CPU. See thecufile api documentationfor more details.
These APIs can be used in versions greater than or equal to CUDA 12.6. In order to use these APIs, one must ensure that their system is appropriately configured to use GPUDirect Storage per theGPUDirect Storage documentation.
See the docs for GdsFile for an example of how to use these.