composer.utils.dist (original) (raw)

Back to top

Edit this page

Toggle table of contents sidebar

Helper methods for torch.distributed.

To use torch.distributed, launch your training script with thecomposer launcher for distributed training. For example, the following command launches an eight-process training run.

composer -n 8 path/to/train.py

The composer launcher will automatically configure the following environment variables, which are required for distributed training:

If none of these environment variables are set, this module will safely assume a single-rank configuration, where:

RANK=0 LOCAL_RANK=0 NODE_RANK=0 WORLD_SIZE=1 LOCAL_WORLD_SIZE=1

Functions

all_gather Collects a Tensor from each rank.
all_gather_object Collect a pickleable object from each rank and return a list of these objects indexed by rank.
all_reduce Reduce a tensor by applying the reduce_operation.
barrier Synchronizes all processes.
broadcast Broadcasts the tensor to the whole group.
broadcast_object_list Broadcasts picklable objects in object_list to the whole group.
get_global_rank Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].
get_local_rank Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].
get_local_world_size Returns the local world size, which is the number of processes for the current node.
get_node_rank Returns the node rank.
get_sampler Constructs a DistributedSampler for a dataset.
get_world_size Returns the world size, which is the number of processes participating in this training run.
initialize_dist Initialize the default PyTorch distributed process group.
is_available Returns whether PyTorch was built with distributed support.
is_initialized Returns whether PyTorch distributed is initialized.