GitHub - NVIDIA-NeMo/Gym: Evaluate and improve models and agents using environments (original) (raw)
Aalcr
other
-
-
-
-
-
-
Abstention
rlhf
Train models to abstain when unsure using three-tier reward on HotPotQA with LLM judge
Improve calibration by rewarding abstention over incorrect answers
✓
✓
Creative Commons Attribution-ShareAlike 4.0 International
-
Arc Agi
knowledge
Solve puzzles designed to test intelligence. See https://arcprize.org/arc-agi.
Improve puzzle-solving capabilities.
-
✓
-
-
Arena Judge
-
-
-
-
-
-
Asr With Pc
other
ASR with WER scoring (standard, case-sensitive, punctuation+capitalization)
Improve transcription quality with structural detail
-
-
-
-
Aviary
agent
Multi-hop question answering on the HotPotQA dataset with Wikipedia search
Improve knowledge and agentic capability
✓
✓
Apache 2.0
-
Aviary
math
GSM8k benchmark with calculator tool
Test math and agentic capability
✓
✓
Apache 2.0
-
Bigcodebench
coding
Verifies model-generated Python solutions against the BigCodeBench unittest suite.
Improve practical, library-rich Python coding capabilities.
-
-
-
-
Bird Sql
coding
Text-to-SQL with execution-based evaluation on BIRD dev (1534 SQLite tasks). Binary reward from unordered result-set equality.
Improve text-to-SQL capabilities on BIRD's realistic dev split using execution-based binary reward without an LLM judge.
-
-
-
-
Blackjack
games
Blackjack. Model hits or stands. Reward +1 win, 0 draw, -1 loss/bust.
Example gymnasium-style multi-step environment
-
-
-
-
Browsecomp Advanced Harness
agent
Model uses search tools to satisfy a user query.
Measure agentic search capability
-
-
-
browsecomp_advanced_harness.yaml
-
Calendar
agent
Multi-turn calendar scheduling dataset. User states events and constraints in natural language; model schedules events to satisfy all constraints.
Improve multi-turn instruction following capabilities
✓
✓
Apache 2.0
Nemotron-RL-agent-calendar_scheduling
Calendar
agent
Multi-turn calendar scheduling dataset. User states events and constraints in natural language; model schedules events to satisfy all constraints.
Improve multi-turn instruction following capabilities
✓
✓
Creative Commons Attribution 4.0 International
Nemotron-RL-Instruction-Following-Calendar-v2
Circle Click
other
Click on circles in images
Improve visual grounding and spatial reasoning
-
-
-
-
Circle Count
other
Count circles of a given color in images
Improve visual counting and color recognition
-
-
-
-
Code Fim
coding
Code Fill-in-the-Middle judged by HumanEval-Infilling test suite (single_line, multi_line, random_span, random_span_light)
Improve Python code-infilling capabilities (prefix + completion + suffix)
-
-
-
-
Code Gen
coding
Model must submit the right code to solve a problem
Improve competitive coding capabilities
✓
✓
Apache 2.0
nemotron-RL-coding-competitive_coding
Competitive Coding Challenges
coding
Execution of competitive programming competition questions
Improve competitive coding capabilities on contest-style problems
-
-
-
competitive_coding_challenges.yaml
-
Critpt
other
Research-level physics problems scored by the Artificial Analysis API
Evaluate model performance on research-level physics reasoning
-
-
-
-
Cvdp
coding
CVDP benchmark dataset for code generation
Evaluate RTL code generation capabilities
-
✓
-
-
Equivalence Llm Judge
agent
Short bash command generation questions with LLM-as-a-judge
Improve foundational bash and IF capabilities
✓
✓
GNU General Public License v3.0
-
Equivalence Llm Judge
knowledge
Short answer questions with LLM-as-a-judge
Improve knowledge-related benchmarks like GPQA / HLE
-
-
-
-
Equivalence Rule
knowledge
Question - Answering with rule-based reward
Improve retrieval and counting capabilities
-
-
-
-
Ether0
knowledge
ether0 chemistry benchmark verifiers
Evalutate chemistry knowledge and reasoning with ether0 benchmark
-
✓
-
-
Evalplus
coding
Function-completion code judged by EvalPlus base + plus tests (HumanEval+, MBPP+)
Improve Python function-completion capabilities
-
-
-
-
Finance Sec Search
agent
SEC EDGAR filing search for financial analysis questions
Enable LLMs to search and analyze SEC filings
-
-
-
-
Format Verification
instruction_following
Verify citation/reference markers in model responses via string matching
Improve instruction following for citation format adherence
✓
-
Apache 2.0
-
Format Verification
instruction_following
Verify freeform text formatting (bullets, headings, tables, etc.) via regex patterns
Improve instruction following for text formatting constraints
✓
-
Apache 2.0
-
Frontierscience Judge
other
FrontierScience answer grading via single-pass LLM judge
Evaluate FrontierScience Olympiad short answers or Research rubric-scored answers
-
-
-
-
Genrm Compare
rlhf
GenRM pairwise comparison for RLHF training
Compare multiple candidate responses using GenRM model
-
-
-
-
Google Search
agent
Multi-choice question answering problems with search tools integrated
Improve knowledge-related benchmarks with search tools
✓
-
Apache 2.0
Nemotron-RL-knowledge-web_search-mcqa
Gpqa Diamond
knowledge
GPQA Diamond multiple-choice question answering problems
Evaluate graduate-level scientific reasoning via MCQ verification
✓
-
MIT
-
Graphwalks
other
Long-context graph-walks (BFS / parents) with F1-over-node-sets grading from openai/graphwalks
Improve long-context multi-step graph reasoning and adjacency-list traversal
-
-
-
-
Grl Sokoban
games
Single-box Sokoban in Gymnasium API style.
Model emits one move per turn until the puzzle is solved.
-
-
-
-
Grl Tetris
games
Tetris in Gymnasium API style. Model emits one or more moves per turn.
Multi-step Tetris environment
-
-
-
-
Gymnasium
other
Base class for Gymnasium-style servers. Not a standalone server.
Reusable base class for step/reset style environments
-
-
-
-
Harbor Agent
agent
Harbor integration for ageng harnesses and environments.
Improve models in popular agentic environments supported by Harbor such as Terminus2.
✓
-
-
-
Harbor Agent
agent
Harbor integration for agent harnesses and environments.
Improve models in popular agentic environments supported by Harbor such as Terminus2.
✓
-
-
-
Hotpotqa Qa
knowledge
Short-answer QA with deterministic SQuAD-style + alternative-aware substring verification (HotpotQA closed-book).
Improve closed-book multi-hop question-answering accuracy.
-
-
-
-
Ifbench
instruction_following
IFBench instruction following evaluation using AllenAI's IFBench library (57 instruction types)
Improve IFBench instruction following
-
-
-
-
Imo Gradingbench
math
Four-class grading of math proofs — the policy model reads a problem plus a candidate proof and emits one of correct / almost / partial / incorrect as the last word.
Improve the IMO-GradingBench benchmark and proof-grading skill.
-
-
-
-
Imo Proofbench Judge
math
IMO ProofBench grader using a strong LLM judge with the IMO 0-7 rubric
Score IMO-style proof submissions with a problem-specific grading rubric
-
-
-
-
Indirect Prompt Injection
safety
Indirect prompt injection resistance for multi-domain tool-use agents
Improve agentic security by teaching robustness against tool outputs containing malicious instructions
✓
✓
Apache 2.0
indirect_prompt_injection.yaml
-
Instruction Following
instruction_following
Instruction following datasets targeting IFEval and IFBench style instruction following capabilities
Improve IFEval and IFBench
✓
-
Apache 2.0
Nemotron-RL-instruction_following
Inverse If
knowledge
Inverse IF instruction-following benchmark with per-task LLM judge
-
✓
-
TBD
-
Jailbreak Detection
safety
Jailbreak detection with Nemotron judge + combined reward
Improve Jailbreak Robustness and Safety/Security Behavior Guide Enforcement
-
-
-
jailbreak_detection_nemotron_combined_reward_tp8.yaml
-
Labbench2 Vlm
knowledge
labbench2 VLM benchmarks: scientific figure/table QA (figqa2, tableqa2), protocol troubleshooting (protocolqa2), LLM-as-judge
Measure scientific reasoning on figures, tables, and lab protocols
-
✓
-
-
Longmt Eval
other
Document-level MT verifier for pg19 books using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score)
Rewards long-form book translation at the document level using reference-free COMETKiwi scores as the RL reward signal.
-
-
-
-
Longmt Eval
other
Document-level MT verifier for wmt24pp short docs using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score).
Rewards document-level translation quality across 55 language pairs using reference-free COMETKiwi scores as the RL reward signal.
-
-
-
-
Longmt Eval
other
Document-level MT verifier using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score)
Rewards long-form translation quality at the document level using reference-free COMETKiwi scores as the RL reward signal.
-
-
-
-
Math Advanced Calculations
agent
An instruction following math environment with counter-intuitive calculators
Improve instruction following capabilities in specific math environments
✓
-
Apache 2.0
math_advanced_calculations.yaml
Nemotron-RL-math-advanced_calculations
Math Formal Lean
math
Lean4 formal proof verification environment
Improve formal theorem proving capabilities
✓
-
Apache 2.0
-
Math Formal Lean
math
Lean4 formal proof verification environment
Improve formal theorem proving capabilities
✓
-
Apache 2.0
-
Math Formal Lean
math
Lean4 formal proof verification environment
Improve formal theorem proving capabilities
✓
-
Apache 2.0
-
Math Formal Lean
math
Lean4 formal proof verification environment
Improve formal theorem proving capabilities
✓
-
Apache 2.0
-
Math Formal Lean
math
Lean4 formal proof verification environment
Improve formal theorem proving capabilities
✓
-
MIT
-
Math Formal Lean
math
Lean4 formal proof verification environment with multi-turn self-correction
Improve formal theorem proving capabilities
✓
-
MIT
math_formal_lean_multi_turn.yaml
-
Math Proof Judgement
math
Binary judgement of math proofs — the policy model reads a problem plus a candidate proof and outputs Judgement: Yes/No.
Improve the NVIDIA ProofBench judge benchmark and math-proof verification skill.
-
-
-
-
Math With Autograder
math
Math QA verified by a Skills-style autograder LLM judge with math-verify symbolic fallback
Score hard-math benchmarks (e.g. IMO AnswerBench) where the judge is a unidirectional Correct/Incorrect grader
-
-
-
-
Math With Code
math
Model solves competitive math problems using simple calculator tools
Improve math and simple tool use capabilities
✓
-
Apache 2.0
-
Math With Judge
math
DAPO17k math dataset with math-verify
Improve math capabilities including AIME 24 / 25
✓
✓
Apache 2.0
-
Math With Judge
math
Hermes Agent with terminal, file, code_execution, skills, todo toolsets on OpenMathReasoning math dataset with math-verify and LLM-as-a-judge
Improve model math capabilities in hermes agent harness such as AIME25
✓
-
Creative Commons Attribution 4.0 International
math_with_judge_hermes_agent.yaml
Nemotron-RL-math-OpenMathReasoning
Math With Judge
math
MathStackOverflow math dataset with math-verify
Improve math capabilities including AIME 24 / 25
✓
✓
Creative Commons Attribution-ShareAlike 4.0 International
Nemotron-RL-math-stack_overflow
Math With Judge
math
OpenMathReasoning math dataset with math-verify and LLM-as-a-judge
Improve math capabilities including AIME 24 / 25
✓
✓
Creative Commons Attribution 4.0 International
Nemotron-RL-math-OpenMathReasoning
Mcqa
knowledge
Multi-choice question answering problems
Improve benchmarks like MMLU / GPQA / HLE
✓
✓
Apache 2.0
Mini Swe Agent
coding
Software engineering tasks driven by mini-swe agent harness.
Improve agentic software engineering capabilities.
✓
✓
MIT
Mrcr
other
Multi-round coreference resolution over multi-turn conversations with prefix-gated SequenceMatcher grading
Improve long-context in-context retrieval and needle-count-aware reasoning
-
-
-
-
Multichallenge
knowledge
Targets inference memory, instruction retention, version editing, and self-coherence.
Improve complex multi-turn conversational capability
✓
-
Creative Commons Attribution 4.0 International
Nemotron-RL-Instruction-Following-MultiTurnChat-v1
Newton Bench
math
Scientific law discovery tasks through agentic experimentation across 12 physics domains
Improve science, reasoning, and tool use capabilities
✓
-
Apache 2.0
-
Ns Tools
agent
NeMo Skills tool execution with math verification
-
-
-
-
-
Nvarc
knowledge
ARC-AGI inductive mode: model outputs Python code with transform()
Improve ARC-AGI puzzle-solving by inducing executable transformation programs
✓
✓
Apache 2.0
-
Nvarc
knowledge
ARC-AGI transductive mode: model outputs grid directly
Improve ARC-AGI puzzle-solving by directly predicting transformed grids
✓
✓
Apache 2.0
-
Omniscience
knowledge
Omniscience factual knowledge QA with LLM judge verification
Evaluate factual recall and calibration via LLM-graded open-ended QA
-
-
-
-
Openenv
agent
Echo environment via OpenEnv (MCP). Echoes messages back with length-based rewards.
-
-
-
-
-
Openenv
coding
Python code execution environment via OpenEnv. Executes code and returns stdout/stderr.
-
-
-
-
-
Openenv
games
Maze navigation environment via OpenEnv. Agent navigates an 8x8 grid to find the exit.
-
-
-
-
-
Over Refusal Detection
-
-
✓
-
TBD
-
Physics Judge
math
Physics QA verified by NeMo Skills' physics judge LLM with math-verify symbolic fallback
Score open-ended physics benchmarks (e.g. PHYSICS) where the judge emits [Correct] / [Incorrect] verdicts
-
-
-
-
Polymath
math
PolyMath multilingual math benchmark with weighted (difficulty) and per-language metrics
Improve multilingual math reasoning across 18 languages and 4 difficulty tiers
-
-
-
-
Proof Genselect
math
Pairwise proof selection with binary correctness reward
-
-
-
-
-
Proof Judge
math
Theorem proving with verifier + meta-verifier judge (combined env)
-
-
-
-
-
Proof Verification
math
Proof verification scored against ground truth and meta-verifier agreement
-
-
-
-
-
Rdkit Chemistry
knowledge
Molecular chemistry question answering: calculate properties of SMILES. Includes a mix of tool-use (python + rdkit) and no-tool-use questions.
Improve molecular reasoning and SMILES parsing.
✓
-
TBD
-
Reasoning Gym
knowledge
Claude Code agent harness for reasoning gym tasks
Evaluate model capabilities in the Claude Code agent harness
✓
-
Creative Commons Attribution 4.0 International
reasoning_gym_claude_code_agent.yaml
Reasoning Gym
knowledge
LangGraph orchestrator agent compatible with resource servers that do not use tools; enables diverse agent training data and test time scaling vs a simple agent, extensible to use tools or other agent architectures
Iterative test time scaling for improved performance in reasoning tasks
✓
-
Apache 2.0
-
Reasoning Gym
knowledge
LangGraph parallel thinking agent compatible with resource servers that do not use tools; enables diverse agent training data and test time scaling vs a simple agent, extensible to use tools or other agent architectures
Iterative test time scaling for improved performance in reasoning tasks
✓
-
Apache 2.0
-
Reasoning Gym
knowledge
LangGraph reflection agent compatible with resource servers that do not use tools; provides iterative reflection for diverse agent training data and test time scaling, extensible to use tools or other agent architectures
Iterative test time scaling for improved performance in reasoning tasks
✓
-
Apache 2.0
-
Reasoning Gym
knowledge
LangGraph ReWOO agent compatible with resource servers that do not use tools; enables diverse agent training data and test time scaling vs a simple agent, extensible to use tools or other agent architectures
Iterative test time scaling for improved performance in reasoning tasks
✓
-
Apache 2.0
-
Reasoning Gym
knowledge
Over 100 tasks including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games.
Improve robustness, generalization, broad knowledge and reasoning
✓
-
Creative Commons Attribution 4.0 International
Ruler
other
-
-
-
-
-
-
Scicode
coding
Multi-step scientific code generation; sub-step solutions executed against SciCode test cases
Improve multi-step scientific code generation
-
-
-
-
Simpleqa
knowledge
SimpleQA short-form factual QA with 3-tier LLM judge (CORRECT / INCORRECT / NOT_ATTEMPTED)
Evaluate factual recall and abstention calibration via LLM-graded open-ended QA
-
-
-
-
Single Step Tool Use With Argument Comparison
agent
Conversational tool-use RL from expert trajectories; behavior cloning per step across auth, lookup, and servicing domains.
-
✓
✓
Creative Commons Attribution 4.0 International
single_step_tool_use_with_argument_comparison.yaml
Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1
Single Step Tool Use With Argument Comparison
agent
Droid agent pivot dataset; behavior cloning per step from successful Droid rollouts.
-
-
-
-
droid_pivot_single_step_tool_use_with_argument_comparison.yaml
-
Single Step Tool Use With Argument Comparison
agent
General function-calling RL dataset using expert trajectories; behavior cloning to match expert tool calls per step.
-
✓
✓
Creative Commons Attribution 4.0 International
toolcall_schema_single_step_tool_use_with_argument_comparison.yaml
Nemotron-RL-Agentic-Function-Calling-Pivot-v1
Single Step Tool Use With Argument Comparison
agent
GitHub-issue dataset for software-engineering agents; refactored from SWE-Gym and SWE-Bench-Verified for NeMo Gym.
-
✓
✓
Creative Commons Attribution 4.0 International
swe_pivot_single_step_tool_use_with_argument_comparison.yaml
Nemotron-RL-Agentic-SWE-Pivot-v1
Single Step Tool Use With Argument Comparison
agent
The model must output the next correct call in a given trajectory involving search tools.
Improve agentic search capability.
✓
✓
Apache 2.0
search_pivot_single_step_tool_use_with_argument_comparison.yaml
-
Speed Bench
other
Speculative-decoding throughput benchmark. Reads vLLM /metrics Prometheus counters before/after generation to compute acceptance length and acceptance rate.
Measure inference-time speculative-decoding effectiveness for serving research and regression tests.
-
-
-
-
Spider2 Lite
coding
Text-to-SQL with execution-based evaluation on Spider 2.0-Lite (135 SQLite tasks). Binary reward based on result-set equivalence.
Improve text-to-SQL capabilities for real-world enterprise queries using execution-based binary reward without an LLM judge.
-
✓
-
-
Structeval
instruction_following
StructEval non-renderable format verification (JSON, YAML, CSV, TOML, XML)
Improve structured output generation quality
✓
-
Apache 2.0
-
Structured Outputs
instruction_following
Check if responses are following structured output requirements in prompts
Improve instruction following capabilities
✓
✓
Apache 2.0
Nemotron-RL-instruction_following-structured_outputs
Structured Outputs
instruction_following
Check if responses are following structured output requirements in prompts
Improve instruction following capabilities
✓
✓
Apache 2.0
structured_outputs_json_yaml_xml_v1.yaml
-
Structured Outputs
instruction_following
Check if responses follow structured output requirements (JSON, YAML, XML, TOML, CSV). Created 20260409.
Improve schema adherence across all structured output formats
✓
-
Apache 2.0
-
Structured Outputs
instruction_following
Check if responses follow tool-call structured output schemas. Created 20260424.
Improve schema adherence when structured output is expressed through tool calls
✓
-
Apache 2.0
-
Swe Agents
-
-
✓
✓
Apache 2.0
-
Swe Agents
-
-
✓
✓
Apache 2.0
-
Swe Agents
-
-
✓
✓
Apache 2.0
swebench_openhands_training.yaml
-
Swe Agents
coding
Software engineering tasks with OpenHands agent harness.
Improve agentic software engineering capabilities.
✓
✓
MIT
-
Swe Pivot
agent
SWE pivot verifier for PivotRL on coding agent trajectories
Improve coding agent fix-design decisions
✓
✓
Apache 2.0
-
Swerl Gen
coding
Running sandboxed evaluation for SWE-style tasks (either patch generation or reproduction test generation)
Improve SWE capabilities useful for benchmarks like SWE-bench
✓
✓
Apache 2.0
-
Swerl Llm Judge
coding
SWE-style multiple-choice LLM-judge tasks scored via ... choice.
Improve SWE capabilities useful for benchmarks like SWE-bench
✓
✓
MIT
-
Tau2
agent
Tau2 benchmark integration
Evaluate multi-turn agentic capability with user simulation.
-
-
-
-
Tavily Search
agent
Model uses search tools to satisfy a user query.
Measure agentic search capability
✓
✓
Apache 2.0
tavily_search_judge_vllm_model.yaml
-
Terminal Multi Harness
agent
Agent006 harness structured-action verifier for next-step pivot RL.
-
-
-
-
terminal_multi_harness_agent006.yaml
-
Terminal Multi Harness
agent
Codex harness structured-action verifier for next-step pivot RL.
-
-
-
-
terminal_multi_harness_codex.yaml
-
Terminal Multi Harness
agent
OpenCode harness structured-action verifier for next-step pivot RL.
-
-
-
-
terminal_multi_harness_opencode.yaml
-
Terminal Multi Harness
agent
Stirrup harness structured-action verifier for next-step pivot RL.
-
-
-
-
terminal_multi_harness_stirrup.yaml
-
Terminus Judge
agent
single-step terminal based task (rubrics v4 judge prompt)
Improve on terminal-style tasks
✓
✓
Apache 2.0
-
Terminus Judge
agent
single-step terminal based task (simple judge prompt)
Improve on terminal-style tasks
✓
✓
Apache 2.0
-
Terminus Judge
agent
single-step terminal based task (string similarity only)
Improve on terminal-style tasks
✓
✓
Apache 2.0
terminus_judge_string_only.yaml
-
Text To Sql
coding
Text-to-SQL generation with LLM-as-a-judge equivalence checking
Improve text-to-SQL capabilities across multiple dialects
-
-
-
-
Ugphysics Judge
knowledge
Undergraduate physics QA verified by a TRUE/FALSE LLM judge with math-verify symbolic fallback
Score undergraduate-physics benchmarks (e.g. UGPhysics) where the judge is a TRUE/FALSE equivalence grader using a reference solution
-
-
-
-
Verifiers Agent
math
Prime intellect verifiers and environments hub integration, ace-reason math environment example.
Improve math reasoning capabilities.
✓
-
-
-
Verifif
instruction_following
VerifIF instruction following validators with rule-based and LLM judge support
Improve instruction following capabilities with comprehensive validation
-
-
-
-
Vlm Eval Kit
other
-
Measure VLM capabilities
-
✓
-
-
Vlm Eval Kit
other
-
Measure VLM capabilities
-
✓
-
-
Vlm Eval Kit
other
Run all supported VLMEvalKit benchmarks.
Measure VLM capabilities
-
✓
-
-
Wmt Translation
other
Machine-translation verifier computing corpus-level BLEU per language pair plus optional xCOMET-XXL neural QE via a Ray GPU actor.
Improves multilingual translation quality across BLEU and COMET-family metrics.
-
-
-
-
Workplace Assistant
agent
Workplace assistant multi-step tool-using environment
Improve multi-step tool use capability
✓
✓
Apache 2.0
Nemotron-RL-agent-workplace_assistant
Xlam Fc
agent
Salesforce xlam-function-calling-60k tool calling tasks
Improve tool-calling capabilities
✓
✓
Apache 2.0
-
Xstest
safety
XSTest safety benchmark - exaggerated safety (over-refusal) evaluation
Evaluate model safety calibration between helpfulness and harmlessness
-
-
-
-