Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism (original) (raw)

This section explains how the ranking mechanism of model parallelism works with tensor parallelism. This is extended from the Ranking Basics for Core Features of the SageMaker Model Parallelism Library. With tensor parallelism, the library introduces three types of ranking and process group APIs:smp.tp_rank() for tensor parallel rank, smp.pp_rank() for pipeline parallel rank, and smp.rdp_rank() for reduced-data parallel rank. The corresponding communication process groups are tensor parallel group (TP_GROUP), pipeline parallel group (PP_GROUP), and reduced-data parallel group (RDP_GROUP). These groups are defined as follows:

To learn more about the communication process APIs in the SageMaker model parallelism library, see the Common API and the PyTorch-specific APIs in the SageMaker Python SDK documentation.

Ranking mechanism, parameter distribution, and associated AllReduce operations of tensor parallelism.

For example, consider process groups for a single node with 8 GPUs, where the degree of tensor parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of MP_GROUP,PP_GROUP, and TP_GROUP. The lower right figure showsRDP_GROUP, DP_GROUP, and WORLD over the same set of GPUs. The gradients for the layers and layer slices that have the same color are allreduced together for data parallelism. For example, the first layer (light blue) gets the allreduce operations acrossDP_GROUP, whereas the dark orange slice in the second layer only gets the allreduce operations within the RDP_GROUP of its process. The bold dark red arrows represent tensors with the batch of its entireTP_GROUP.

GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0
GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1
GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2
GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3
GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0
GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1
GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2
GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3

In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition, data parallelism (allreduce) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7. Tensor parallelism happens over subsets of DP_GROUPs, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).