PyTorch Distributed Overview — PyTorch Tutorials 2.7.0+cu126 documentation (original) (raw)

beginner/dist_overview

Run in Google Colab

Colab

Download Notebook

Notebook

View on GitHub

GitHub

Created On: Jul 28, 2020 | Last Updated: Oct 08, 2024 | Last Verified: Nov 05, 2024

Author: Will Constable

Note

View and edit this tutorial in github.

This is the overview page for the torch.distributed package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case.

Introduction¶

The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs.

Sharding primitives¶

DTensor and DeviceMesh are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.

DTensor represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
DeviceMesh abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ProcessGroup instances for collective communications in multi-dimensional parallelisms. Try out our Device Mesh Recipe to learn more.

Launcher¶

torchrun is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.

PyTorch Distributed Developers¶

If you’d like to contribute to PyTorch Distributed, refer to ourDeveloper Guide.