PyTorch Distributed Overview — PyTorch Tutorials 2.7.0+cu126 documentation (original) (raw)
beginner/dist_overview
Run in Google Colab
Colab
Download Notebook
Notebook
View on GitHub
GitHub
Created On: Jul 28, 2020 | Last Updated: Oct 08, 2024 | Last Verified: Nov 05, 2024
Author: Will Constable
Note
View and edit this tutorial in github.
This is the overview page for the torch.distributed
package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case.
Introduction¶
The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs.
Sharding primitives¶
DTensor
and DeviceMesh
are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
- DTensor represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
- DeviceMesh abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying
ProcessGroup
instances for collective communications in multi-dimensional parallelisms. Try out our Device Mesh Recipe to learn more.
Launcher¶
torchrun is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
PyTorch Distributed Developers¶
If you’d like to contribute to PyTorch Distributed, refer to ourDeveloper Guide.