Megatron-SWIFT Training — swift 3.6.0.dev0 documentation (original) (raw)

SWIFT incorporates Megatron’s parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports the pre-training and fine-tuning of models such as Qwen3, Qwen3-MoE, Qwen2.5, Llama3, and the Deepseek-R1 distillation series. For a complete list of supported models, please refer to the Supported Models and Datasets documentation.

Environment Setup

To use Megatron-SWIFT, in addition to installing the swift dependencies, you also need to install the following:

Recommended PyTorch version: 2.5 / 2.6

pip install pybind11

transformer_engine

If an installation error occurs, you can refer to this issue for resolution: https://github.com/modelscope/ms-swift/issues/3793

pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.3

apex

git clone https://github.com/NVIDIA/apex cd apex

https://github.com/modelscope/ms-swift/issues/4176

git checkout e13873debc4699d39c6861074b9a3b2a02327f92 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

megatron-core

pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.12.0

Alternatively, you can also use the image:

modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1 modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1 modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.5.post1-modelscope1.27.0-swift3.5.1

The training module in the dependent library Megatron-LM will be cloned and installed by swift via git clone. Alternatively, you can use the environment variable MEGATRON_LM_PATH to point to the path of an already downloaded repository (in offline environments, use the core_r0.12.0 branch).

Quick Start Example

This section introduces a quick start example for fine-tuning the self-awareness of the Qwen2.5-7B-Instruct model using two 80GiB A100 GPUs. The following best practices can be completed within 10 minutes.

First, we need to convert the weights from HF (Hugging Face) format to Megatron format:

CUDA_VISIBLE_DEVICES=0
swift export
--model Qwen/Qwen2.5-7B-Instruct
--to_mcore true
--torch_dtype bfloat16
--output_dir Qwen2.5-7B-Instruct-mcore

Next, use the following script to start training. The required GPU memory resources are 2*80GiB:

NPROC_PER_NODE=2
CUDA_VISIBLE_DEVICES=0,1
megatron sft
--load Qwen2.5-7B-Instruct-mcore
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500'
'AI-ModelScope/alpaca-gpt4-data-en#500'
'swift/self-cognition#500'
--tensor_model_parallel_size 2
--sequence_parallel true
--micro_batch_size 16
--global_batch_size 16
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_iters 10
--min_lr 1e-6
--max_epochs 1
--save megatron_output/Qwen2.5-7B-Instruct
--save_interval 100
--max_length 2048
--system 'You are a helpful assistant.'
--num_workers 4
--no_save_optim true
--no_save_rng true
--dataset_num_proc 4
--model_author swift
--model_name swift-robot

Finally, convert the Megatron format weights back to HF format:

CUDA_VISIBLE_DEVICES=0
swift export
--mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx
--to_hf true
--torch_dtype bfloat16
--output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf

We then perform inference on the generated HF format weights:

CUDA_VISIBLE_DEVICES=0
swift infer
--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf
--stream true
--temperature 0
--max_new_tokens 2048

The inference results are as follows:

<<< who are you? I am a language model developed by swift, you can call me swift-robot. How can I assist you?

Benchmark

The speed comparison of full-parameter training for Dense/MoE models using megatron sft and swift sft on a single machine with eight A800 GPUs is shown below. The corresponding scripts can be found here.

Dense Qwen2.5-14B:

Megatron-LM Deepspeed-ZeRO2 Deepspeed-ZeRO3
Training Speed 9.04s/it 10.32s/it 10.56s/it
GPU Memory Usage 8*64GB 8*80GB 8*58GB

MoE Qwen1.5-MoE-A2.7B:

Megatron-LM Deepspeed-ZeRO2 Deepspeed-ZeRO3
Training Speed 2.93s/it 6.02s/it 24.30s/it
GPU Memory Usage 8*66GB 8*72GB 8*50GB

Command Line Arguments

Megatron Parameters

Training Parameters:

Learning Rate Parameters:

Regularization Parameters:

Checkpoint Parameters:

Distributed Parameters:

Logging Parameters:

Evaluation Parameters:

Mixed Precision Parameters:

Model Parameters: (The following parameters typically do not need to be set as they will be configured based on the HF model’s config.json; users don’t need to worry about them)

MoE Parameters:

DPO Parameters

Training Parameters

Megatron training parameters inherit from Megatron parameters and basic parameters. For information on basic parameters, see here. Additionally, the following parameters are included:

RLHF Parameters

In addition to inheriting the training parameters, the following parameters are also supported: