Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism (original) (raw)
This section explains how the ranking mechanism of model parallelism works with tensor parallelism. This is extended from the Ranking Basics for Core Features of the SageMaker Model Parallelism Library. With tensor parallelism, the library introduces three types of ranking and process group APIs:smp.tp_rank()
for tensor parallel rank, smp.pp_rank()
for pipeline parallel rank, and smp.rdp_rank()
for reduced-data parallel rank. The corresponding communication process groups are tensor parallel group (TP_GROUP
), pipeline parallel group (PP_GROUP
), and reduced-data parallel group (RDP_GROUP
). These groups are defined as follows:
- A tensor parallel group (
TP_GROUP
) is an evenly divisible subset of the data parallel group, over which tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1,TP_GROUP
is the same as model parallel group (MP_GROUP
). - A pipeline parallel group (
PP_GROUP
) is the group of processes over which pipeline parallelism takes place. When the tensor parallelism degree is 1,PP_GROUP
is the same asMP_GROUP
. - A reduced-data parallel group (
RDP_GROUP
) is a set of processes that hold both the same pipeline parallelism partitions and the same tensor parallel partitions, and perform data parallelism among themselves. This is called the reduced data parallel group because it is a subset of the entire data parallelism group,DP_GROUP
. For the model parameters that are distributed within theTP_GROUP
, the gradientallreduce
operation is performed only for reduced-data parallel group, while for the parameters that are not distributed, the gradientallreduce
takes place over the entireDP_GROUP
. - A model parallel group (
MP_GROUP
) refers to a group of processes that collectively store the entire model. It consists of the union of thePP_GROUP
s of all the ranks that are in theTP_GROUP
of the current process. When the degree of tensor parallelism is 1,MP_GROUP
is equivalent toPP_GROUP
. It is also consistent with the existing definition ofMP_GROUP
from previoussmdistributed
releases. Note that the currentTP_GROUP
is a subset of both the currentDP_GROUP
and the currentMP_GROUP
.
To learn more about the communication process APIs in the SageMaker model parallelism library, see the Common API and the PyTorch-specific APIs in the SageMaker Python SDK documentation.
For example, consider process groups for a single node with 8 GPUs, where the degree of tensor parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of MP_GROUP
,PP_GROUP
, and TP_GROUP
. The lower right figure showsRDP_GROUP
, DP_GROUP
, and WORLD
over the same set of GPUs. The gradients for the layers and layer slices that have the same color are allreduce
d together for data parallelism. For example, the first layer (light blue) gets the allreduce
operations acrossDP_GROUP
, whereas the dark orange slice in the second layer only gets the allreduce
operations within the RDP_GROUP
of its process. The bold dark red arrows represent tensors with the batch of its entireTP_GROUP
.
GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0
GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1
GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2
GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3
GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0
GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1
GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2
GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3
In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition, data parallelism (allreduce
) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7. Tensor parallelism happens over subsets of DP_GROUP
s, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).