DDP NCCL Parameters For Performance 路 Issue #7179 路 Lightning-AI/pytorch-lightning (original) (raw)
馃殌 Feature
Motivation
From several experiments, DDP on NCCL backend
- NCCL_NSOCKS_PERTHREAD (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nsocks-perthread)
- NCCL_SOCKET_NTHREADS (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-nthreads)
- NCCL_MIN_NCHANNELS (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-min-nchannels)
are important environment parameters to tune for optimizing communication performance.
For example: XLM-RoBERTa (https://arxiv.org/abs/1911.02116), 30% speedup for NCCL_NSOCKS_PERTHREA = 4
and NCCL_SOCKET_NTHREADS = 2
Detectron2 (https://github.com/facebookresearch/detectron2), 15% speedup for NCCL_NSOCKS_PERTHREA = 4
and NCCL_SOCKET_NTHREADS = 2
Pitch
we could pass these parameters from kwargs
(similarly as find_unused_parameters
), and set these parameters in setup_environment
when initializing ddp process.