Nvidia NeMo — SkyPilot documentation (original) (raw)
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
Source: examples/nemo
This example shows how to launch Nvidia NeMo jobs with SkyPilot.
Included files#
nemo_bert.yaml
Distributed training a BERT model with Nvidia NeMo
Finetunes a BERT-like model on the GLUE CoLA task. Uses the NeMo toolkit
to train across multiple nodes, each node having a V100 GPU.
Uses glue_benchmark.py script from the NeMo examples:
https://github.com/NVIDIA/NeMo/blob/2ce45369f7ab6cd20c376d1ed393160f5e54be0c/examples/nlp/glue_benchmark/glue_benchmark.py
Usage:
sky launch -c nemo_bert nemo_bert.yaml
# Or try on spot A100 GPUs:
sky launch -c nemo_bert nemo_bert.yaml --use-spot --gpus A100:1
# Terminate cluster after you're done
sky down nemo_bert
resources: accelerators: V100:1
num_nodes: 2
setup: | conda activate nemo if [ $? -eq 0 ]; then echo "conda env exists" else conda create -y --name nemo python==3.10.12 conda activate nemo
# Install PyTorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install nemo
sudo apt-get update
sudo apt-get install -y libsndfile1 ffmpeg
pip install Cython
pip install nemo_toolkit['all']
# Clone the NeMo repo to get the examples
git clone https://github.com/NVIDIA/NeMo.git
# Download GLUE dataset
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/70e86a10fbf4ab4ec3f04c9ba82ba58f87c530bf/download_glue_data.py
python download_glue_data.py --data_dir glue_data --tasks CoLA
fi
run: | conda activate nemo
Get the number of nodes and master address from SkyPilot envvars
num_nodes=echo "$SKYPILOT_NODE_IPS" | wc -l
master_addr=echo "$SKYPILOT_NODE_IPS" | head -n1
Run glue_benchmark.py
python -m torch.distributed.run
--nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE}
--nnodes=${num_nodes}
--node_rank=${SKYPILOT_NODE_RANK}
--master_addr=${master_addr}
--master_port=8008
NeMo/examples/nlp/glue_benchmark/glue_benchmark.py
model.dataset.data_dir=glue_data/CoLA
model.task_name=cola
trainer.max_epochs=10
trainer.num_nodes=${num_nodes}
nemo_gpt_distributed.yaml
Distributed training a GPT style model with Nvidia NeMo on multiple nodes.
Inspired from https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst
Note that we provide a read-only bucket at gs://sky-wiki-data that is used to
download preprocessed data to local disk. If you want to preprocess the data
yourself, see nemo_gpt_preprocessing.yaml.
We use a shared bucket to store the index files that are used to coordinate
between the head and worker nodes. This shared bucket is mounted as a
network filesystem (NFS) on the head and worker nodes.
After the script completes, the model checkpoints will be saved in
/ckpts on the head node (can be changed to /shared for cloud storage).
Usage:
sky launch --env SHARED_NFS_BUCKET_NAME= -c nemo_gpt nemo_gpt_distributed.yaml
# Terminate cluster after you're done
sky down nemo_gpt
resources: cpus: 8+ memory: 64+ accelerators: A100-80GB:1 image_id: docker:nvcr.io/nvidia/nemo:24.05
num_nodes: 2
envs: DATASET_ROOT: /wiki SHARED_NFS_ROOT: /shared SHARED_NFS_BUCKET_NAME: # Enter a unique bucket name here for the shared directory - if it doesn't exist SkyPilot will create it CHECKPOINT_PATH: /ckpts # Store checkpoints at a local path. You can change this to /shared for checkpointing to cloud bucket at every callback, but this will slow down training.
file_mounts: ${DATASET_ROOT}: source: gs://sky-wiki-data # This is a read-only bucket provided by SkyPilot for the dataset mode: COPY
The SHARED_NFS_ROOT path acts as a network filesystem (NFS) between the
head and worker nodes. In NeMo, the head node writes an indexmap to this
shared filesystem that is read by workers.
Note that NeMo requires this shared filesystem to be strongly consistent -
any writes made by the head should be immediately visible to the workers.
${SHARED_NFS_ROOT}: name: ${SHARED_NFS_BUCKET_NAME} store: gcs # We recommend using GCS in mount mode - S3 based mounts may fail with "transport endpoint is not connected" error. mode: MOUNT
setup: | conda deactivate
Clone NeMo repo if not already present
if [ ! -d NeMo ]; then git clone https://github.com/NVIDIA/NeMo.git cd NeMo git checkout 5df8e11255802a2ce2f33db6362e60990e215b64 fi
run: | conda deactivate
============= Training =============
Get the number of nodes and master address from SkyPilot envvars
num_nodes=echo "$SKYPILOT_NODE_IPS" | wc -l
master_addr=echo "$SKYPILOT_NODE_IPS" | head -n1
Kill any existing megatron processes
pkill -f -9 megatron
mkdir -p ${CHECKPOINT_PATH}
echo "Writing checkpoints to ${CHECKPOINT_PATH}" echo "Writing index files to shared storage ${SHARED_NFS_ROOT}"
python -m torch.distributed.run
--nproc_per_node=${SKYPILOT_NUM_GPUS_PER_NODE}
--nnodes=${num_nodes}
--node_rank=${SKYPILOT_NODE_RANK}
--master_addr=${master_addr}
--master_port=12375
NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
--config-path=conf
--config-name=megatron_gpt_config
trainer.devices=${SKYPILOT_NUM_GPUS_PER_NODE}
trainer.num_nodes=${num_nodes}
trainer.max_epochs=null
trainer.max_steps=300000
trainer.val_check_interval=50
trainer.log_every_n_steps=50
trainer.limit_val_batches=50
trainer.limit_test_batches=50
trainer.accumulate_grad_batches=1
trainer.precision=16
model.mcore_gpt=True
model.micro_batch_size=6
model.global_batch_size=192
model.tensor_model_parallel_size=1
model.pipeline_model_parallel_size=1
model.max_position_embeddings=1024
model.encoder_seq_length=1024
model.hidden_size=768
model.ffn_hidden_size=3072
model.num_layers=12
model.num_attention_heads=12
model.init_method_std=0.021
model.hidden_dropout=0.1
model.layernorm_epsilon=1e-5
model.tokenizer.vocab_file=${DATASET_ROOT}/gpt2-vocab.json
model.tokenizer.merge_file=${DATASET_ROOT}/gpt2-merges.txt
model.data.data_prefix=[1.0,${DATASET_ROOT}/hfbpe_gpt_training_data_text_document]
model.data.num_workers=2
model.data.seq_length=1024
model.data.splits_string='980,10,10'
model.data.index_mapping_dir=${SHARED_NFS_ROOT}
model.optim.name=fused_adam
model.optim.lr=6e-4
model.optim.betas=[0.9,0.95]
model.optim.weight_decay=0.1
model.optim.sched.name=CosineAnnealing
model.optim.sched.warmup_steps=750
model.optim.sched.constant_steps=80000
model.optim.sched.min_lr=6e-5
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
+exp_manager.checkpoint_callback_params.dirpath=${CHECKPOINT_PATH}
exp_manager.checkpoint_callback_params.monitor=val_loss
exp_manager.checkpoint_callback_params.save_top_k=3
exp_manager.checkpoint_callback_params.mode=min
exp_manager.checkpoint_callback_params.always_save_nemo=True
Optional - if writing checkpoints to a local directory,
copy final checkpoints to the shared bucket at the end of training (~6 GB)
if [ ${SKYPILOT_NODE_RANK} -eq 0 ]; then
mkdir -p ${SHARED_NFS_ROOT}/results
cp -R ${CHECKPOINT_PATH}
fi
nemo_gpt_preprocessing.yaml
Prepares the wiki dataset for training with NeMo. Downloads the data, runs
preprocessing and saves the data in mmap format on a cloud bucket. This same
bucket can then be used for training.
This YAML is for demonstration purposes and is not a necessary step before
running nemo_gpt_train.yaml. Since this preprocessing can take
upto 6 hours, we provide a read-only bucket with the preprocessed data (gs://sky-wiki-data)
that can be downloaded to your bucket (see nemo_gpt_train.yaml).
Usage:
sky launch -s -c nemo_gpt_preprocessing nemo_gpt_preprocessing.yaml
# Terminate cluster after you're done
sky down nemo_gpt_preprocessing
num_nodes: 1
envs: LOCAL_DATASET_ROOT: /wiki DATASET_BUCKET_ROOT: /bucket BUCKET_NAME: # Enter a unique bucket name here - if it doesn't exist SkyPilot will create it
file_mounts: ${DATASET_BUCKET_ROOT}: name: ${BUCKET_NAME} store: gcs # We recommend using GCS for large datasets in mount mode - S3 based mounts may fail with "transport endpoint is not connected" error. mode: MOUNT
setup: | conda activate nemo if [ $? -eq 0 ]; then echo "Nemo conda env exists" else echo "Setup start"
conda create -y --name nemo python==3.10.12
conda activate nemo
# Install PyTorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install nemo
git clone https://github.com/NVIDIA/NeMo.git
cd NeMo
git checkout b4ad7eaa7873d632391d6985aa6b359f39c20bab
pip install Cython
pip install .[all]
cd ..
# Install megatron-core
# We install in editable mode because setup.py does not install all
# required modules if we install in non-editable mode.
git clone https://github.com/NVIDIA/Megatron-LM
cd Megatron-LM
git checkout dc21350806361564b8ce61d4a8d247cb195cc5f0
pip install -e .
cd ..
# Install ninja for faster compilation
pip install ninja packaging
# Install transformer engine and flash-attn (Takes ~1hr to compile)
MAX_JOBS=4 pip install flash-attn==2.0.4 --no-build-isolation # Version upper capped by TransformerEngine
MAX_JOBS=4 pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
pip install pytorch-extension
# Install Apex
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 52e18c894223800cb611682dce27d88050edf1de
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
cd ..
fi
run: | conda activate nemo
======== Download and preprocess the wikipedia dataset ========
if [ -f ${LOCAL_DATASET_ROOT}/train_data.jsonl ]; then echo "Dataset exists" else # Install axel for faster downloads sudo apt-get install -y axel
mkdir -p ${LOCAL_DATASET_ROOT}
cd ${LOCAL_DATASET_ROOT}
# Download the wikipedia dataset (takes ~15 min)
axel -n 20 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
# Preprocess the wikipedia dataset (takes ~2 hours)
pip install wikiextractor
python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl
fi
======== Download tokenizer files ========
Check if the tokenizer files exist
if [ -f ${LOCAL_DATASET_ROOT}/gpt2-vocab.json ]; then echo "Tokenizer files exist" else # Download the tokenizer files cd {LOCAL_DATASET_ROOT} axel -n 20 https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json axel -n 20 https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt fi
======== Convert data to mmap format and write to bucket ========
Check if the mmap files exist
if [ -f ${LOCAL_DATASET_ROOT}/hfbpe_gpt_training_data_text_document.bin ]; then
echo "Mmap files exist"
else
# Convert the data to mmap format`
cd ${LOCAL_DATASET_ROOT}
python $HOME/sky_workdir/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py
--input=train_data.jsonl
--json-keys=text
--tokenizer-library=megatron
--vocab gpt2-vocab.json
--dataset-impl mmap
--tokenizer-type GPT2BPETokenizer
--merge-file gpt2-merges.txt
--output-prefix=hfbpe_gpt_training_data
--append-eod
--workers=32
fi
echo "Done preprocessing dataset, copying to mounted bucket now." cp {gpt2-merges.txt,gpt2-vocab.json,hfbpe_gpt_training_data_text_document.bin,hfbpe_gpt_training_data_text_document.idx} ${DATASET_BUCKET_ROOT} echo "Done copying - data is now available on ${BUCKET_NAME} bucket."
nemo_gpt_singlenode.yaml
Single node training a GPT style model with Nvidia NeMo
This script downloads data from read-only bucket at gs://sky-wiki-data.
If you want to preprocess the data yourself, see nemo_gpt_preprocessing.yaml.
The specific model used here should fit on GPU with 16GB memory.
After the script completes, the model checkpoints will be saved in
/ckpts (configurable through CHECKPOINT_PATH env var) on the head node.
Usage:
sky launch -c nemo_gpt nemo_gpt_singlenode.yaml
# Or try on spot A100 GPUs:
sky launch -c nemo_gpt nemo_gpt_singlenode.yaml --use-spot --gpus A100:1
# Terminate cluster after you're done
sky down nemo_gpt
resources: cpus: 8+ memory: 64+ accelerators: A100-80GB:1 image_id: docker:nvcr.io/nvidia/nemo:24.05
num_nodes: 1
envs: DATASET_ROOT: /wiki CHECKPOINT_PATH: /ckpts
file_mounts: ${DATASET_ROOT}: source: gs://sky-wiki-data # This is a read-only bucket provided by SkyPilot for the dataset mode: COPY
setup: | conda deactivate
Clone NeMo repo if not already present
if [ ! -d NeMo ]; then git clone https://github.com/NVIDIA/NeMo.git cd NeMo git checkout 5df8e11255802a2ce2f33db6362e60990e215b64 fi
Install gsutil if it doesn't exist
if ! command -v gsutil &> /dev/null then pip install gsutil else echo "gsutil exists" fi
run: | conda deactivate
Kill any existing megatron processes
pkill -f -9 megatron
mkdir -p ${CHECKPOINT_PATH}
============= Training =============
python NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py
--config-path=conf
--config-name=megatron_gpt_config
trainer.devices=${SKYPILOT_NUM_GPUS_PER_NODE}
trainer.num_nodes=1
trainer.max_epochs=null
trainer.max_steps=300000
trainer.val_check_interval=50
trainer.log_every_n_steps=50
trainer.limit_val_batches=50
trainer.limit_test_batches=50
trainer.accumulate_grad_batches=1
trainer.precision=16
model.mcore_gpt=True
model.micro_batch_size=6
model.global_batch_size=192
model.tensor_model_parallel_size=1
model.pipeline_model_parallel_size=1
model.max_position_embeddings=1024
model.encoder_seq_length=1024
model.hidden_size=768
model.ffn_hidden_size=3072
model.num_layers=12
model.num_attention_heads=12
model.init_method_std=0.021
model.hidden_dropout=0.1
model.layernorm_epsilon=1e-5
model.tokenizer.vocab_file=${DATASET_ROOT}/gpt2-vocab.json
model.tokenizer.merge_file=${DATASET_ROOT}/gpt2-merges.txt
model.data.data_prefix=[1.0,${DATASET_ROOT}/hfbpe_gpt_training_data_text_document]
model.data.num_workers=2
model.data.seq_length=1024
model.data.splits_string='980,10,10'
model.optim.name=fused_adam
model.optim.lr=6e-4
model.optim.betas=[0.9,0.95]
model.optim.weight_decay=0.1
model.optim.sched.name=CosineAnnealing
model.optim.sched.warmup_steps=750
model.optim.sched.constant_steps=80000
model.optim.sched.min_lr=6e-5
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
+exp_manager.checkpoint_callback_params.dirpath=${CHECKPOINT_PATH}
exp_manager.checkpoint_callback_params.monitor=val_loss
exp_manager.checkpoint_callback_params.save_top_k=3
exp_manager.checkpoint_callback_params.mode=min
exp_manager.checkpoint_callback_params.always_save_nemo=False