Training GPT-NeoX 20B with Tensor Parallelism and ZeRO-1 Optimizer — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

Training GPT-NeoX 20B with Tensor Parallelism and ZeRO-1 Optimizer#

In this section, we showcase to pretrain a GPT-NeoX 20B model by using the sequence parallel optimization of tensor parallelism in the neuronx-distributed package. Please refer to the Neuron Samples repository to view the files in this tutorial.

This GPT-NeoX 20B tutorial differs from the GPT-NeoX 6.9B tutorial in the following ways:

Setting up environment is same as the GPT-NeoX 6.9B tutorial.

Let’s download the scripts for pretraining:

cd ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/ ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/adamw_fp32_optim_params.py ./ ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/get_dataset.py ./ ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/requirements.txt ./ python3 -m pip install -r requirements.txt

Next let’s download and pre-process the dataset:

At this point, you are all set to start training.

Running training

We first pre-compile the graphs using the neuron_parallel_compile. Let’s run the command below:

sbatch --exclusive
--nodes 4
--cpus-per-task 128
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"

This script uses a tensor-parallel size of 32. This will automatically set the zero-1 sharding degree to 4 (4 * 32 workers / tensor_parallel_size). Once the graphs are compiled we can now run training and observe our loss goes down. To run the training, we just the above command but without neuron_parallel_compile.

sbatch --exclusive
--nodes 4
--cpus-per-task 128
--wrap="srun bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"

Sequence Parallel

We made the following model level modifications to enable sequence parallel:

In the training training script level, we enable:

Please check modeling_gpt_neox_nxd.py and tp_dp_gpt_neox_20b_hf_pretrain.py for details.

Parallel Cross Entropy

To enable parallel cross entropy, we made the following model level modeifincations:

Please check modeling_gpt_neox_nxd.py for details.

This document is relevant for: Inf2, Trn1, Trn2