composer.utils.dist (original) (raw)
Toggle table of contents sidebar
Helper methods for torch.distributed.
To use torch.distributed, launch your training script with thecomposer launcher for distributed training. For example, the following command launches an eight-process training run.
composer -n 8 path/to/train.py
The composer launcher will automatically configure the following environment variables, which are required for distributed training:
RANK
: The global rank of the process, which should be on[0; WORLD_SIZE - 1]
.LOCAL_RANK
: The local rank for the process, which should be on[0; LOCAL_WORLD_SIZE - 1]
.NODE_RANK
: The rank of the node.WORLD_SIZE
: The total number of processes.LOCAL_WORLD_SIZE
: The number of processes on the current node.MASTER_ADDR
: The hostname for the rank-zero process.MASTER_PORT
: The port for the rank-zero process.
If none of these environment variables are set, this module will safely assume a single-rank configuration, where:
RANK=0 LOCAL_RANK=0 NODE_RANK=0 WORLD_SIZE=1 LOCAL_WORLD_SIZE=1
Functions
all_gather | Collects a Tensor from each rank. |
---|---|
all_gather_object | Collect a pickleable object from each rank and return a list of these objects indexed by rank. |
all_reduce | Reduce a tensor by applying the reduce_operation. |
barrier | Synchronizes all processes. |
broadcast | Broadcasts the tensor to the whole group. |
broadcast_object_list | Broadcasts picklable objects in object_list to the whole group. |
get_global_rank | Returns the global rank of the current process in the input PG, which is on [0; group.WORLD_SIZE - 1]. |
get_local_rank | Returns the local rank for the current process, which is on [0; LOCAL_WORLD_SIZE - 1]. |
get_local_world_size | Returns the local world size, which is the number of processes for the current node. |
get_node_rank | Returns the node rank. |
get_sampler | Constructs a DistributedSampler for a dataset. |
get_world_size | Returns the world size, which is the number of processes participating in this training run. |
initialize_dist | Initialize the default PyTorch distributed process group. |
is_available | Returns whether PyTorch was built with distributed support. |
is_initialized | Returns whether PyTorch distributed is initialized. |