A Scalable and Efficient Post-Training Library β€” NeMo-RL (original) (raw)

Nemo RL is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

πŸ“£ News#

Features#

βœ… Available now | πŸ”œ Coming in v0.3

Prerequisites#

Clone NeMo RL.

git clone git@github.com:NVIDIA/NeMo-RL.git nemo-rl cd nemo-rl

Install uv.

For faster setup and environment isolation, we use uv

pip install uv

Initialize NeMo RL project virtual environment

NOTE: Please do not use -p/--python and instead allow uv venv to read it from .python-version

This ensures that the version of python used is always what we prescribe.

uv venv

If you cannot install at the system level, you can install for your user with

pip install --user uv

Use uv run to launch all commands. It handles pip installing implicitly and

ensures your environment is up to date with our lock file.

Note that it is not recommended to activate the venv and instead use uv run since

it ensures consistent environment usage across different shells and sessions.

Example: uv run python examples/run_grpo_math.py

Important Notes:

GRPO#

We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.

GRPO Single Node#

To run GRPO on a single GPU for Qwen/Qwen2.5-1.5B:

Run the GRPO math example using a 1B parameter model

uv run python examples/run_grpo_math.py

By default, this uses the configuration in examples/configs/grpo_math_1B.yaml. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,

Run the GRPO math example using a 1B parameter model using 8 GPUs

uv run python examples/run_grpo_math.py
cluster.gpus_per_node=8

You can override any of the parameters listed in the yaml configuration file. For example,

uv run python examples/run_grpo_math.py
policy.model_name="meta-llama/Llama-3.2-1B-Instruct"
checkpointing.checkpoint_dir="results/llama1b_math"
logger.wandb_enabled=True
logger.wandb.name="grpo-llama1b_math"
logger.num_val_samples_to_print=10

GRPO Multi-node#

Run from the root of NeMo RL repo

NUM_ACTOR_NODES=2

grpo_math_8b uses Llama-3.1-8B-Instruct model

COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'"
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

The required CONTAINER can be built by following the instructions in the Docker documentation.

GRPO Qwen2.5-32B#

This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length.

Run from the root of NeMo RL repo

NUM_ACTOR_NODES=16

Download Qwen before the job starts to avoid spending time downloading during the training loop

HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B

Ensure HF_HOME is included in your MOUNTS

HF_HOME=/path/to/hf_home
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'"
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

GRPO Multi-Turn#

We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game:

uv run python examples/run_grpo_sliding_puzzle.py

Supervised Fine-Tuning (SFT)#

We provide an example SFT experiment using the SQuAD dataset.

SFT Single Node#

The default SFT configuration is set to run on a single GPU. To start the experiment:

uv run python examples/run_sft.py

This fine-tunes the Llama3.2-1B model on the SQuAD dataset using a 1 GPU.

To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:

uv run python examples/run_sft.py
policy.model_name="meta-llama/Meta-Llama-3-8B"
policy.train_global_batch_size=128
sft.val_global_batch_size=128
cluster.gpus_per_node=8

Refer to examples/configs/sft.yaml for a full list of parameters that can be overridden.

SFT Multi-node#

Run from the root of NeMo RL repo

NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'"
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

DPO#

We provide a sample DPO experiment that uses the HelpSteer3 dataset for preference-based training.

DPO Single Node#

The default DPO experiment is configured to run on a single GPU. To launch the experiment:

uv run python examples/run_dpo.py

This trains Llama3.2-1B-Instruct on one GPU.

If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:

uv run python examples/run_dpo.py
policy.model_name="meta-llama/Llama-3.1-8B-Instruct"
policy.train_global_batch_size=256
cluster.gpus_per_node=8

Any of the DPO parameters can be customized from the command line. For example:

uv run python examples/run_dpo.py
dpo.sft_loss_weight=0.1
dpo.preference_average_log_probs=True
checkpointing.checkpoint_dir="results/llama_dpo_sft"
logger.wandb_enabled=True
logger.wandb.name="llama-dpo-sft"

Refer to examples/configs/dpo.yaml for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the DPO documentation.

DPO Multi-node#

For distributed DPO training across multiple nodes, modify the following script for your use case:

Run from the root of NeMo RL repo

number of nodes to use for your job

NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'"
RAY_DEDUP_LOGS=0
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

Evaluation#

We provide evaluation tools to assess model capabilities.

Convert Model Format (Optional)#

If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:

Example for a GRPO checkpoint at step 170

uv run python examples/convert_dcp_to_hf.py
--config results/grpo/step_170/config.yaml
--dcp-ckpt-path results/grpo/step_170/policy/weights/
--hf-ckpt-path results/grpo/hf

Note: Adjust the paths according to your training output directory structure.

For an in-depth explanation of checkpointing, refer to the Checkpointing documentation.

Run Evaluation#

Run evaluation script with converted model:

uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf

Run evaluation script with custom settings:

Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs

Pass@1 accuracy averaged over 16 samples for each problem

uv run python examples/run_eval.py
generation.model_name=agentica-org/DeepScaleR-1.5B-Preview
generation.temperature=0.6
generation.top_p=0.95
generation.vllm_cfg.max_model_len=32768
data.dataset_name=HuggingFaceH4/MATH-500
data.dataset_key=test
eval.num_tests_per_prompt=16
cluster.gpus_per_node=8

Note: Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.

Refer to examples/configs/eval.yaml for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the Evaluation documentation.

Set Up Clusters#

For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated Cluster Start documentation.

Citation#

If you use NeMo RL in your research, please cite it using the following BibTeX entry:

@misc{nemo-rl, title = {NeMo RL: A Scalable and Efficient Post-Training Library}, howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}}, year = {2025}, note = {GitHub repository}, }

Contributing#

We welcome contributions to NeMo RL! Please see our Contributing Guidelines for more information on how to get involved.

Licenses#

NVIDIA NeMo RL is licensed under the Apache License 2.0.