PyTorch Distributed Overview — PyTorch Tutorials 2.7.0+cu126 documentation (original) (raw)

beginner/dist_overview

Run in Google Colab

Colab

Download Notebook

Notebook

View on GitHub

GitHub

Created On: Jul 28, 2020 | Last Updated: Oct 08, 2024 | Last Verified: Nov 05, 2024

Author: Will Constable

Note

edit View and edit this tutorial in github.

This is the overview page for the torch.distributed package. The goal of this page is to categorize documents into different topics and briefly describe each of them. If this is your first time building distributed training applications using PyTorch, it is recommended to use this document to navigate to the technology that can best serve your use case.

Introduction

The PyTorch Distributed library includes a collective of parallelism modules, a communications layer, and infrastructure for launching and debugging large training jobs.

Sharding primitives

DTensor and DeviceMesh are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.

Launcher

torchrun is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.

PyTorch Distributed Developers

If you’d like to contribute to PyTorch Distributed, refer to ourDeveloper Guide.