A Scalable and Efficient Post-Training Library — NeMo-RL (original) (raw)

Nemo RL: A Scalable and Efficient Post-Training Library
- 📣 News
- Features
- Prerequisites
- GRPO
  * GRPO Single Node
  * GRPO Multi-node
  * GRPO Qwen2.5-32B
  * GRPO Multi-Turn
- Supervised Fine-Tuning (SFT)
  * SFT Single Node
  * SFT Multi-node
- DPO
  * DPO Single Node
  * DPO Multi-node
- Evaluation
  * Convert Model Format (Optional)
  * Run Evaluation
- Set Up Clusters
- Citation
- Contributing
- Licenses

Nemo RL is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

Seamless integration with Hugging Face for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
High-performance implementation with Megatron Core, supporting various parallelism techniques for large models (>100B) and large context lengths.
Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
Flexibility with a modular design that allows easy integration and customization.
Comprehensive documentation that is both detailed and user-friendly, with practical examples.

📣 News#

[5/14/2025] Reproduce DeepscaleR with NeMo RL!

Features#

✅ Available now | 🔜 Coming in v0.3

✅ Fast Generation - vLLM backend for optimized inference.
✅ HuggingFace Integration - Works with 1-32B models (Qwen2.5, Llama).
✅ Distributed Training - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.
✅ Environment Support - Support for multi-environment training.
✅ Learning Algorithms - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
✅ Multi-Turn RL - Multi-turn generation and training for RL with tool use, games, etc.
✅ Large Model Support - Native PyTorch support for models up to 32B parameters.
✅ Advanced Parallelism - PyTorch native FSDP2, TP, and SP for efficient training.
✅ Worker Isolation - Process isolation between RL Actors (no worries about global state).
✅ Environment Isolation - Dependency isolation between components.
🔜 Improved Native Performance - Improve training time for Native Pytorch Models.
🔜 (even) Larger Model Support with Long(er) Sequence - Support advanced parallelism in training with Megatron Core.
🔜 MoE Models - Support DeepseekV3 and Llama4.
🔜 Megatron Inference - Support Megatron Inference for day-0 support for new megatron models.

Prerequisites#

Clone NeMo RL.

git clone git@github.com:NVIDIA/NeMo-RL.git nemo-rl cd nemo-rl

Install uv.

For faster setup and environment isolation, we use `uv`

pip install uv

Initialize NeMo RL project virtual environment

NOTE: Please do not use -p/--python and instead allow uv venv to read it from .python-version

This ensures that the version of python used is always what we prescribe.

uv venv

If you cannot install at the system level, you can install for your user with

pip install --user uv

Use `uv run` to launch all commands. It handles pip installing implicitly and

ensures your environment is up to date with our lock file.

Note that it is not recommended to activate the venv and instead use `uv run` since

it ensures consistent environment usage across different shells and sessions.

Example: uv run python examples/run_grpo_math.py

Important Notes:

Use the uv run <command> to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
Reminder: Don’t forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You’ll need to do a huggingface-cli login as well for Llama models.

GRPO#

We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.

GRPO Single Node#

To run GRPO on a single GPU for Qwen/Qwen2.5-1.5B:

Run the GRPO math example using a 1B parameter model

uv run python examples/run_grpo_math.py

By default, this uses the configuration in examples/configs/grpo_math_1B.yaml. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,

Run the GRPO math example using a 1B parameter model using 8 GPUs

uv run python examples/run_grpo_math.py
cluster.gpus_per_node=8

You can override any of the parameters listed in the yaml configuration file. For example,

uv run python examples/run_grpo_math.py
policy.model_name="meta-llama/Llama-3.2-1B-Instruct"
checkpointing.checkpoint_dir="results/llama1b_math"
logger.wandb_enabled=True
logger.wandb.name="grpo-llama1b_math"
logger.num_val_samples_to_print=10

GRPO Multi-node#

Run from the root of NeMo RL repo

NUM_ACTOR_NODES=2

grpo_math_8b uses Llama-3.1-8B-Instruct model

COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'"
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

The required CONTAINER can be built by following the instructions in the Docker documentation.

GRPO Qwen2.5-32B#

This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length.

Run from the root of NeMo RL repo

NUM_ACTOR_NODES=16

Download Qwen before the job starts to avoid spending time downloading during the training loop

HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B

Ensure HF_HOME is included in your MOUNTS

HF_HOME=/path/to/hf_home
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'"
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

GRPO Multi-Turn#

We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game:

uv run python examples/run_grpo_sliding_puzzle.py

Supervised Fine-Tuning (SFT)#

We provide an example SFT experiment using the SQuAD dataset.

SFT Single Node#

The default SFT configuration is set to run on a single GPU. To start the experiment:

uv run python examples/run_sft.py

This fine-tunes the Llama3.2-1B model on the SQuAD dataset using a 1 GPU.

To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:

uv run python examples/run_sft.py
policy.model_name="meta-llama/Meta-Llama-3-8B"
policy.train_global_batch_size=128
sft.val_global_batch_size=128
cluster.gpus_per_node=8

Refer to examples/configs/sft.yaml for a full list of parameters that can be overridden.

SFT Multi-node#

Run from the root of NeMo RL repo

NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'"
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

DPO#

We provide a sample DPO experiment that uses the HelpSteer3 dataset for preference-based training.

DPO Single Node#

The default DPO experiment is configured to run on a single GPU. To launch the experiment:

uv run python examples/run_dpo.py

This trains Llama3.2-1B-Instruct on one GPU.

If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:

uv run python examples/run_dpo.py
policy.model_name="meta-llama/Llama-3.1-8B-Instruct"
policy.train_global_batch_size=256
cluster.gpus_per_node=8

Any of the DPO parameters can be customized from the command line. For example:

uv run python examples/run_dpo.py
dpo.sft_loss_weight=0.1
dpo.preference_average_log_probs=True
checkpointing.checkpoint_dir="results/llama_dpo_sft"
logger.wandb_enabled=True
logger.wandb.name="llama-dpo-sft"

Refer to examples/configs/dpo.yaml for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the DPO documentation.

DPO Multi-node#

For distributed DPO training across multiple nodes, modify the following script for your use case:

Run from the root of NeMo RL repo

number of nodes to use for your job

NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'"
RAY_DEDUP_LOGS=0
CONTAINER=YOUR_CONTAINER
MOUNTS="$PWD:$PWD"
sbatch
--nodes=${NUM_ACTOR_NODES}
--account=YOUR_ACCOUNT
--job-name=YOUR_JOBNAME
--partition=YOUR_PARTITION
--time=4:0:0
--gres=gpu:8
ray.sub

Evaluation#

We provide evaluation tools to assess model capabilities.

Convert Model Format (Optional)#

If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:

Example for a GRPO checkpoint at step 170

uv run python examples/convert_dcp_to_hf.py
--config results/grpo/step_170/config.yaml
--dcp-ckpt-path results/grpo/step_170/policy/weights/
--hf-ckpt-path results/grpo/hf

Note: Adjust the paths according to your training output directory structure.

For an in-depth explanation of checkpointing, refer to the Checkpointing documentation.

Run Evaluation#

Run evaluation script with converted model:

uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf

Run evaluation script with custom settings:

Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs

Pass@1 accuracy averaged over 16 samples for each problem

uv run python examples/run_eval.py
generation.model_name=agentica-org/DeepScaleR-1.5B-Preview
generation.temperature=0.6
generation.top_p=0.95
generation.vllm_cfg.max_model_len=32768
data.dataset_name=HuggingFaceH4/MATH-500
data.dataset_key=test
eval.num_tests_per_prompt=16
cluster.gpus_per_node=8

Note: Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.

Refer to examples/configs/eval.yaml for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the Evaluation documentation.

Set Up Clusters#

For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated Cluster Start documentation.

Citation#

If you use NeMo RL in your research, please cite it using the following BibTeX entry:

@misc{nemo-rl, title = {NeMo RL: A Scalable and Efficient Post-Training Library}, howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}}, year = {2025}, note = {GitHub repository}, }

Contributing#

We welcome contributions to NeMo RL! Please see our Contributing Guidelines for more information on how to get involved.

Licenses#

NVIDIA NeMo RL is licensed under the Apache License 2.0.

A Scalable and Efficient Post-Training Library — NeMo-RL (original) (raw)

📣 News#

Features#

Prerequisites#

For faster setup and environment isolation, we use uv

Initialize NeMo RL project virtual environment

NOTE: Please do not use -p/--python and instead allow uv venv to read it from .python-version

This ensures that the version of python used is always what we prescribe.

If you cannot install at the system level, you can install for your user with

pip install --user uv

Use uv run to launch all commands. It handles pip installing implicitly and

ensures your environment is up to date with our lock file.

Note that it is not recommended to activate the venv and instead use uv run since

it ensures consistent environment usage across different shells and sessions.

Example: uv run python examples/run_grpo_math.py

GRPO#

GRPO Single Node#

Run the GRPO math example using a 1B parameter model

Run the GRPO math example using a 1B parameter model using 8 GPUs

GRPO Multi-node#

Run from the root of NeMo RL repo

grpo_math_8b uses Llama-3.1-8B-Instruct model

GRPO Qwen2.5-32B#

Run from the root of NeMo RL repo

Download Qwen before the job starts to avoid spending time downloading during the training loop

Ensure HF_HOME is included in your MOUNTS

GRPO Multi-Turn#

Supervised Fine-Tuning (SFT)#

SFT Single Node#

SFT Multi-node#

Run from the root of NeMo RL repo

DPO#

DPO Single Node#

DPO Multi-node#

Run from the root of NeMo RL repo

number of nodes to use for your job

Evaluation#

Convert Model Format (Optional)#

Example for a GRPO checkpoint at step 170

Run Evaluation#

Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs

Pass@1 accuracy averaged over 16 samples for each problem

Set Up Clusters#

Citation#

Contributing#

Licenses#

For faster setup and environment isolation, we use `uv`

Use `uv run` to launch all commands. It handles pip installing implicitly and

Note that it is not recommended to activate the venv and instead use `uv run` since