GitHub - NVIDIA-NeMo/Skills: A project to improve skills of large language models (original) (raw)
Nemo-Skills is a collection of pipelines to improve "skills" of large language models (LLMs). We support everything needed for LLM development, from synthetic data generation, to model training, to evaluation on a wide range of benchmarks. Start developing on a local workstation and move to a large-scale Slurm cluster with just a one-line change.
Here are some of the features we support:
- Flexible LLM inference:
- Seamlessly switch between API providers, local server and large-scale slurm jobs for LLM inference.
- Host models (on 1 or many nodes) with TensorRT-LLM, vLLM, sglang or Megatron.
- Scale SDG jobs from 1 GPU on a local machine all the way to tens of thousands of GPUs on a slurm cluster.
- Model evaluation:
- Evaluate your models on many popular benchmarks.
* Math (natural language): e.g. aime24, aime25, hmmt_feb25
* Math (formal language): e.g. minif2f, proofnet, putnam-bench
* Code: e.g. swe-bench, livecodebench, bird
* Scientific knowledge: e.g., hle, scicode, gpqa
* Instruction following: e.g. ifbench, ifeval
* Long-context: e.g. ruler, mrcr, aalcr, longbench-v2
* Tool-calling: e.g. bfcl_v3
* Multilingual: e.g. mmlu-prox, flores-200, wmt24pp
* Speech & Audio: e.g. asr-leaderboard, mmau-pro
* Vision-Language Models (VLM): e.g. mmmu-pro - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- Evaluate your models on many popular benchmarks.
- Model training: Train models using NeMo-RL or verl.
News
- [12/15/2025]: Released the recipe for reproducing Nemotron-Math-v2 and Nemotron-Math-Proofs-v1 datasets that were used as part of the training data for NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.
- [11/25/2025]: Added the recipe for reproducing the main experimental results for Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection.
- [08/22/2025]: Added details for reproducing evals for the NVIDIA-Nemotron-Nano-9B-v2 model by NVIDIA.
- [08/15/2025]: Added details for reproducing evals for the Llama-3_3-Nemotron-Super-49B-v1_5 model by NVIDIA.
- [07/30/2025]: The datasets used to train OpenReasoning models are released! Math and code are available as part of Nemotron-Post-Training-Dataset-v1 and science is available inOpenScienceReasoning-2. See our documentation for more details.
- [07/18/2025]: We released OpenReasoning models! SOTA scores on math, coding and science benchmarks.
- [04/23/2025]: We released OpenMathReasoning dataset and models!
- OpenMathReasoning dataset has 306K unique mathematical problems sourced from AoPS forums with:
* 3.2M long chain-of-thought (CoT) solutions
* 1.7M long tool-integrated reasoning (TIR) solutions
* 566K samples that select the most promising solution out of many candidates (GenSelect) - OpenMath-Nemotron models are SoTA open-weight models on math reasoning benchmarks at the time of release!
- OpenMathReasoning dataset has 306K unique mathematical problems sourced from AoPS forums with:
- [10/03/2024]: We released OpenMathInstruct-2 dataset and models!
- OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model.
- OpenMath-2-Llama models show significant improvements compared to their Llama3.1-Instruct counterparts.
Getting started
To get started, follow these steps, browse available pipelines or run ns --help to see all available commands and their options.
You can find more examples of how to use Nemo-Skills in the tutorials page.
We've built and released many popular models and datasets using Nemo-Skills. See all of them in the Papers & Releases documentation.
You can find the full documentation here.
Contributing
We welcome contributions to Nemo-Skills! Please see our Contributing Guidelines for more information on how to get involved.
Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.

