composer.utils.dist (original) (raw)

Toggle table of contents sidebar

To use torch.distributed, launch your training script with thecomposer launcher for distributed training. For example, the following command launches an eight-process training run.

composer -n 8 path/to/train.py

The composer launcher will automatically configure the following environment variables, which are required for distributed training:

RANK: The global rank of the process, which should be on [0; WORLD_SIZE - 1].
LOCAL_RANK: The local rank for the process, which should be on [0; LOCAL_WORLD_SIZE - 1].
NODE_RANK: The rank of the node.
WORLD_SIZE: The total number of processes.
LOCAL_WORLD_SIZE: The number of processes on the current node.
MASTER_ADDR: The hostname for the rank-zero process.
MASTER_PORT: The port for the rank-zero process.

If none of these environment variables are set, this module will safely assume a single-rank configuration, where:

RANK=0 LOCAL_RANK=0 NODE_RANK=0 WORLD_SIZE=1 LOCAL_WORLD_SIZE=1

Functions

all_gather	Collects a Tensor from each rank.
all_gather_object	Collect a pickleable object from each rank and return a list of these objects indexed by rank.
all_reduce	Reduce a tensor by applying the reduce_operation.
barrier	Synchronizes all processes.
broadcast	Broadcasts the tensor to the whole group.
broadcast_object_list	Broadcasts picklable objects in object_list to the whole group.
get_global_rank	Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1].
get_local_rank	Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1].
get_local_world_size	Returns the local world size, which is the number of processes for the current node.
get_node_rank	Returns the node rank.
get_sampler	Constructs a DistributedSampler for a dataset.
get_world_size	Returns the world size, which is the number of processes participating in this training run.
initialize_dist	Initialize the default PyTorch distributed process group.
is_available	Returns whether PyTorch was built with distributed support.
is_initialized	Returns whether PyTorch distributed is initialized.