Distribution Communication — mmengine 0.10.7 documentation (original) (raw)

In distributed training, different processes sometimes need to apply different logics depending on their ranks, local_ranks, etc. They also need to communicate with each other and do synchronizations on data. These demands rely on distributed communication. PyTorch provides a set of basic distributed communication primitives. Based on these primitives, MMEngine provides some higher level APIs to meet more diverse demands. Using these APIs provided by MMEngine, modules can:

These APIs are roughly categorized into 3 types:

We will detail on these APIs in the following chapters.

Initialization

Query and control

The query and control functions are all argument free. They can be used in both distributed and non-distributed environment. Their functionalities are listed below:

Collective communication

Collective communication functions are used for data transfer between processes in the same process group. We provide the following APIs based on PyTorch native functions including all_reduce, all_gather, gather, broadcast. These APIs are compatible with non-distributed environment and support more data types apart from Tensor.