GitHub - NVIDIA-NeMo/Gym: Evaluate and improve models and agents using environments (original) (raw)

Aalcr

other

-

-

-

-

-

aalcr.yaml

-

Abstention

rlhf

Train models to abstain when unsure using three-tier reward on HotPotQA with LLM judge

Improve calibration by rewarding abstention over incorrect answers

Creative Commons Attribution-ShareAlike 4.0 International

abstention.yaml

-

Arc Agi

knowledge

Solve puzzles designed to test intelligence. See https://arcprize.org/arc-agi.

Improve puzzle-solving capabilities.

-

-

arc_agi.yaml

-

Arena Judge

-

-

-

-

-

arena_judge.yaml

-

Asr With Pc

other

ASR with WER scoring (standard, case-sensitive, punctuation+capitalization)

Improve transcription quality with structural detail

-

-

-

asr_with_pc.yaml

-

Aviary

agent

Multi-hop question answering on the HotPotQA dataset with Wikipedia search

Improve knowledge and agentic capability

Apache 2.0

hotpotqa_aviary.yaml

-

Aviary

math

GSM8k benchmark with calculator tool

Test math and agentic capability

Apache 2.0

gsm8k_aviary.yaml

-

Bigcodebench

coding

Verifies model-generated Python solutions against the BigCodeBench unittest suite.

Improve practical, library-rich Python coding capabilities.

-

-

-

bigcodebench.yaml

-

Bird Sql

coding

Text-to-SQL with execution-based evaluation on BIRD dev (1534 SQLite tasks). Binary reward from unordered result-set equality.

Improve text-to-SQL capabilities on BIRD's realistic dev split using execution-based binary reward without an LLM judge.

-

-

-

bird_sql.yaml

-

Blackjack

games

Blackjack. Model hits or stands. Reward +1 win, 0 draw, -1 loss/bust.

Example gymnasium-style multi-step environment

-

-

-

blackjack.yaml

-

Browsecomp Advanced Harness

agent

Model uses search tools to satisfy a user query.

Measure agentic search capability

-

-

-

browsecomp_advanced_harness.yaml

-

Calendar

agent

Multi-turn calendar scheduling dataset. User states events and constraints in natural language; model schedules events to satisfy all constraints.

Improve multi-turn instruction following capabilities

Apache 2.0

calendar.yaml

Nemotron-RL-agent-calendar_scheduling

Calendar

agent

Multi-turn calendar scheduling dataset. User states events and constraints in natural language; model schedules events to satisfy all constraints.

Improve multi-turn instruction following capabilities

Creative Commons Attribution 4.0 International

calendar_v2.yaml

Nemotron-RL-Instruction-Following-Calendar-v2

Circle Click

other

Click on circles in images

Improve visual grounding and spatial reasoning

-

-

-

circle_click.yaml

-

Circle Count

other

Count circles of a given color in images

Improve visual counting and color recognition

-

-

-

circle_count.yaml

-

Code Fim

coding

Code Fill-in-the-Middle judged by HumanEval-Infilling test suite (single_line, multi_line, random_span, random_span_light)

Improve Python code-infilling capabilities (prefix + completion + suffix)

-

-

-

code_fim.yaml

-

Code Gen

coding

Model must submit the right code to solve a problem

Improve competitive coding capabilities

Apache 2.0

code_gen.yaml

nemotron-RL-coding-competitive_coding

Competitive Coding Challenges

coding

Execution of competitive programming competition questions

Improve competitive coding capabilities on contest-style problems

-

-

-

competitive_coding_challenges.yaml

-

Critpt

other

Research-level physics problems scored by the Artificial Analysis API

Evaluate model performance on research-level physics reasoning

-

-

-

critpt.yaml

-

Cvdp

coding

CVDP benchmark dataset for code generation

Evaluate RTL code generation capabilities

-

-

cvdp.yaml

-

Equivalence Llm Judge

agent

Short bash command generation questions with LLM-as-a-judge

Improve foundational bash and IF capabilities

GNU General Public License v3.0

nl2bash-equivalency.yaml

-

Equivalence Llm Judge

knowledge

Short answer questions with LLM-as-a-judge

Improve knowledge-related benchmarks like GPQA / HLE

-

-

-

equivalence_llm_judge.yaml

-

Equivalence Rule

knowledge

Question - Answering with rule-based reward

Improve retrieval and counting capabilities

-

-

-

lc.yaml

-

Ether0

knowledge

ether0 chemistry benchmark verifiers

Evalutate chemistry knowledge and reasoning with ether0 benchmark

-

-

ether0.yaml

-

Evalplus

coding

Function-completion code judged by EvalPlus base + plus tests (HumanEval+, MBPP+)

Improve Python function-completion capabilities

-

-

-

evalplus.yaml

-

Finance Sec Search

agent

SEC EDGAR filing search for financial analysis questions

Enable LLMs to search and analyze SEC filings

-

-

-

finance_sec_search.yaml

-

Format Verification

instruction_following

Verify citation/reference markers in model responses via string matching

Improve instruction following for citation format adherence

-

Apache 2.0

citation_format.yaml

-

Format Verification

instruction_following

Verify freeform text formatting (bullets, headings, tables, etc.) via regex patterns

Improve instruction following for text formatting constraints

-

Apache 2.0

freeform_formatting.yaml

-

Frontierscience Judge

other

FrontierScience answer grading via single-pass LLM judge

Evaluate FrontierScience Olympiad short answers or Research rubric-scored answers

-

-

-

frontierscience_judge.yaml

-

Genrm Compare

rlhf

GenRM pairwise comparison for RLHF training

Compare multiple candidate responses using GenRM model

-

-

-

genrm_compare.yaml

-

Google Search

agent

Multi-choice question answering problems with search tools integrated

Improve knowledge-related benchmarks with search tools

-

Apache 2.0

google_search.yaml

Nemotron-RL-knowledge-web_search-mcqa

Gpqa Diamond

knowledge

GPQA Diamond multiple-choice question answering problems

Evaluate graduate-level scientific reasoning via MCQ verification

-

MIT

gpqa_diamond.yaml

-

Graphwalks

other

Long-context graph-walks (BFS / parents) with F1-over-node-sets grading from openai/graphwalks

Improve long-context multi-step graph reasoning and adjacency-list traversal

-

-

-

graphwalks.yaml

-

Grl Sokoban

games

Single-box Sokoban in Gymnasium API style.

Model emits one move per turn until the puzzle is solved.

-

-

-

grl_sokoban.yaml

-

Grl Tetris

games

Tetris in Gymnasium API style. Model emits one or more moves per turn.

Multi-step Tetris environment

-

-

-

grl_tetris.yaml

-

Gymnasium

other

Base class for Gymnasium-style servers. Not a standalone server.

Reusable base class for step/reset style environments

-

-

-

gymnasium.yaml

-

Harbor Agent

agent

Harbor integration for ageng harnesses and environments.

Improve models in popular agentic environments supported by Harbor such as Terminus2.

-

-

harbor_agent.yaml

-

Harbor Agent

agent

Harbor integration for agent harnesses and environments.

Improve models in popular agentic environments supported by Harbor such as Terminus2.

-

-

harbor_agent_daytona.yaml

-

Hotpotqa Qa

knowledge

Short-answer QA with deterministic SQuAD-style + alternative-aware substring verification (HotpotQA closed-book).

Improve closed-book multi-hop question-answering accuracy.

-

-

-

hotpotqa_qa.yaml

-

Ifbench

instruction_following

IFBench instruction following evaluation using AllenAI's IFBench library (57 instruction types)

Improve IFBench instruction following

-

-

-

ifbench.yaml

-

Imo Gradingbench

math

Four-class grading of math proofs — the policy model reads a problem plus a candidate proof and emits one of correct / almost / partial / incorrect as the last word.

Improve the IMO-GradingBench benchmark and proof-grading skill.

-

-

-

imo_gradingbench.yaml

-

Imo Proofbench Judge

math

IMO ProofBench grader using a strong LLM judge with the IMO 0-7 rubric

Score IMO-style proof submissions with a problem-specific grading rubric

-

-

-

imo_proofbench_judge.yaml

-

Indirect Prompt Injection

safety

Indirect prompt injection resistance for multi-domain tool-use agents

Improve agentic security by teaching robustness against tool outputs containing malicious instructions

Apache 2.0

indirect_prompt_injection.yaml

-

Instruction Following

instruction_following

Instruction following datasets targeting IFEval and IFBench style instruction following capabilities

Improve IFEval and IFBench

-

Apache 2.0

instruction_following.yaml

Nemotron-RL-instruction_following

Inverse If

knowledge

Inverse IF instruction-following benchmark with per-task LLM judge

-

-

TBD

inverse_if.yaml

-

Jailbreak Detection

safety

Jailbreak detection with Nemotron judge + combined reward

Improve Jailbreak Robustness and Safety/Security Behavior Guide Enforcement

-

-

-

jailbreak_detection_nemotron_combined_reward_tp8.yaml

-

Labbench2 Vlm

knowledge

labbench2 VLM benchmarks: scientific figure/table QA (figqa2, tableqa2), protocol troubleshooting (protocolqa2), LLM-as-judge

Measure scientific reasoning on figures, tables, and lab protocols

-

-

labbench2_vlm.yaml

-

Longmt Eval

other

Document-level MT verifier for pg19 books using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score)

Rewards long-form book translation at the document level using reference-free COMETKiwi scores as the RL reward signal.

-

-

-

longmt_pg19.yaml

-

Longmt Eval

other

Document-level MT verifier for wmt24pp short docs using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score).

Rewards document-level translation quality across 55 language pairs using reference-free COMETKiwi scores as the RL reward signal.

-

-

-

longmt_wmt24pp.yaml

-

Longmt Eval

other

Document-level MT verifier using the SEGALE pipeline (ersatz segment → LASER2 embed → vecalign align → COMETKiwi score)

Rewards long-form translation quality at the document level using reference-free COMETKiwi scores as the RL reward signal.

-

-

-

longmt_eval.yaml

-

Math Advanced Calculations

agent

An instruction following math environment with counter-intuitive calculators

Improve instruction following capabilities in specific math environments

-

Apache 2.0

math_advanced_calculations.yaml

Nemotron-RL-math-advanced_calculations

Math Formal Lean

math

Lean4 formal proof verification environment

Improve formal theorem proving capabilities

-

Apache 2.0

nemotron_clean_easy.yaml

-

Math Formal Lean

math

Lean4 formal proof verification environment

Improve formal theorem proving capabilities

-

Apache 2.0

nemotron_first_try_hard.yaml

-

Math Formal Lean

math

Lean4 formal proof verification environment

Improve formal theorem proving capabilities

-

Apache 2.0

nemotron_medium_500.yaml

-

Math Formal Lean

math

Lean4 formal proof verification environment

Improve formal theorem proving capabilities

-

Apache 2.0

nemotron_very_easy.yaml

-

Math Formal Lean

math

Lean4 formal proof verification environment

Improve formal theorem proving capabilities

-

MIT

math_formal_lean.yaml

-

Math Formal Lean

math

Lean4 formal proof verification environment with multi-turn self-correction

Improve formal theorem proving capabilities

-

MIT

math_formal_lean_multi_turn.yaml

-

Math Proof Judgement

math

Binary judgement of math proofs — the policy model reads a problem plus a candidate proof and outputs Judgement: Yes/No.

Improve the NVIDIA ProofBench judge benchmark and math-proof verification skill.

-

-

-

math_proof_judgement.yaml

-

Math With Autograder

math

Math QA verified by a Skills-style autograder LLM judge with math-verify symbolic fallback

Score hard-math benchmarks (e.g. IMO AnswerBench) where the judge is a unidirectional Correct/Incorrect grader

-

-

-

math_with_autograder.yaml

-

Math With Code

math

Model solves competitive math problems using simple calculator tools

Improve math and simple tool use capabilities

-

Apache 2.0

math_with_code.yaml

-

Math With Judge

math

DAPO17k math dataset with math-verify

Improve math capabilities including AIME 24 / 25

Apache 2.0

dapo17k.yaml

-

Math With Judge

math

Hermes Agent with terminal, file, code_execution, skills, todo toolsets on OpenMathReasoning math dataset with math-verify and LLM-as-a-judge

Improve model math capabilities in hermes agent harness such as AIME25

-

Creative Commons Attribution 4.0 International

math_with_judge_hermes_agent.yaml

Nemotron-RL-math-OpenMathReasoning

Math With Judge

math

MathStackOverflow math dataset with math-verify

Improve math capabilities including AIME 24 / 25

Creative Commons Attribution-ShareAlike 4.0 International

math_stack_overflow.yaml

Nemotron-RL-math-stack_overflow

Math With Judge

math

OpenMathReasoning math dataset with math-verify and LLM-as-a-judge

Improve math capabilities including AIME 24 / 25

Creative Commons Attribution 4.0 International

math_with_judge.yaml

Nemotron-RL-math-OpenMathReasoning

Mcqa

knowledge

Multi-choice question answering problems

Improve benchmarks like MMLU / GPQA / HLE

Apache 2.0

mcqa.yaml

Nemotron-RL-knowledge-mcqa

Mini Swe Agent

coding

Software engineering tasks driven by mini-swe agent harness.

Improve agentic software engineering capabilities.

MIT

mini_swe_agent.yaml

SWE-Gym

Mrcr

other

Multi-round coreference resolution over multi-turn conversations with prefix-gated SequenceMatcher grading

Improve long-context in-context retrieval and needle-count-aware reasoning

-

-

-

mrcr.yaml

-

Multichallenge

knowledge

Targets inference memory, instruction retention, version editing, and self-coherence.

Improve complex multi-turn conversational capability

-

Creative Commons Attribution 4.0 International

multichallenge_nrl.yaml

Nemotron-RL-Instruction-Following-MultiTurnChat-v1

Newton Bench

math

Scientific law discovery tasks through agentic experimentation across 12 physics domains

Improve science, reasoning, and tool use capabilities

-

Apache 2.0

newton_bench.yaml

-

Ns Tools

agent

NeMo Skills tool execution with math verification

-

-

-

-

ns_tools.yaml

-

Nvarc

knowledge

ARC-AGI inductive mode: model outputs Python code with transform()

Improve ARC-AGI puzzle-solving by inducing executable transformation programs

Apache 2.0

inductive.yaml

-

Nvarc

knowledge

ARC-AGI transductive mode: model outputs grid directly

Improve ARC-AGI puzzle-solving by directly predicting transformed grids

Apache 2.0

transductive.yaml

-

Omniscience

knowledge

Omniscience factual knowledge QA with LLM judge verification

Evaluate factual recall and calibration via LLM-graded open-ended QA

-

-

-

omniscience.yaml

-

Openenv

agent

Echo environment via OpenEnv (MCP). Echoes messages back with length-based rewards.

-

-

-

-

openenv_echo.yaml

-

Openenv

coding

Python code execution environment via OpenEnv. Executes code and returns stdout/stderr.

-

-

-

-

openenv_coding.yaml

-

Openenv

games

Maze navigation environment via OpenEnv. Agent navigates an 8x8 grid to find the exit.

-

-

-

-

openenv_maze.yaml

-

Over Refusal Detection

-

-

-

TBD

over_refusal_detection.yaml

-

Physics Judge

math

Physics QA verified by NeMo Skills' physics judge LLM with math-verify symbolic fallback

Score open-ended physics benchmarks (e.g. PHYSICS) where the judge emits [Correct] / [Incorrect] verdicts

-

-

-

physics_judge.yaml

-

Polymath

math

PolyMath multilingual math benchmark with weighted (difficulty) and per-language metrics

Improve multilingual math reasoning across 18 languages and 4 difficulty tiers

-

-

-

polymath.yaml

-

Proof Genselect

math

Pairwise proof selection with binary correctness reward

-

-

-

-

proof_genselect.yaml

-

Proof Judge

math

Theorem proving with verifier + meta-verifier judge (combined env)

-

-

-

-

proof_judge.yaml

-

Proof Verification

math

Proof verification scored against ground truth and meta-verifier agreement

-

-

-

-

proof_verification.yaml

-

Rdkit Chemistry

knowledge

Molecular chemistry question answering: calculate properties of SMILES. Includes a mix of tool-use (python + rdkit) and no-tool-use questions.

Improve molecular reasoning and SMILES parsing.

-

TBD

rdkit_chemistry.yaml

-

Reasoning Gym

knowledge

Claude Code agent harness for reasoning gym tasks

Evaluate model capabilities in the Claude Code agent harness

-

Creative Commons Attribution 4.0 International

reasoning_gym_claude_code_agent.yaml

Nemotron-RL-ReasoningGym-v1

Reasoning Gym

knowledge

LangGraph orchestrator agent compatible with resource servers that do not use tools; enables diverse agent training data and test time scaling vs a simple agent, extensible to use tools or other agent architectures

Iterative test time scaling for improved performance in reasoning tasks

-

Apache 2.0

orchestrator_agent.yaml

-

Reasoning Gym

knowledge

LangGraph parallel thinking agent compatible with resource servers that do not use tools; enables diverse agent training data and test time scaling vs a simple agent, extensible to use tools or other agent architectures

Iterative test time scaling for improved performance in reasoning tasks

-

Apache 2.0

parallel_thinking_agent.yaml

-

Reasoning Gym

knowledge

LangGraph reflection agent compatible with resource servers that do not use tools; provides iterative reflection for diverse agent training data and test time scaling, extensible to use tools or other agent architectures

Iterative test time scaling for improved performance in reasoning tasks

-

Apache 2.0

reflection_agent.yaml

-

Reasoning Gym

knowledge

LangGraph ReWOO agent compatible with resource servers that do not use tools; enables diverse agent training data and test time scaling vs a simple agent, extensible to use tools or other agent architectures

Iterative test time scaling for improved performance in reasoning tasks

-

Apache 2.0

rewoo_agent.yaml

-

Reasoning Gym

knowledge

Over 100 tasks including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games.

Improve robustness, generalization, broad knowledge and reasoning

-

Creative Commons Attribution 4.0 International

reasoning_gym.yaml

Nemotron-RL-ReasoningGym-v1

Ruler

other

-

-

-

-

-

ruler.yaml

-

Scicode

coding

Multi-step scientific code generation; sub-step solutions executed against SciCode test cases

Improve multi-step scientific code generation

-

-

-

scicode.yaml

-

Simpleqa

knowledge

SimpleQA short-form factual QA with 3-tier LLM judge (CORRECT / INCORRECT / NOT_ATTEMPTED)

Evaluate factual recall and abstention calibration via LLM-graded open-ended QA

-

-

-

simpleqa.yaml

-

Single Step Tool Use With Argument Comparison

agent

Conversational tool-use RL from expert trajectories; behavior cloning per step across auth, lookup, and servicing domains.

-

Creative Commons Attribution 4.0 International

single_step_tool_use_with_argument_comparison.yaml

Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1

Single Step Tool Use With Argument Comparison

agent

Droid agent pivot dataset; behavior cloning per step from successful Droid rollouts.

-

-

-

-

droid_pivot_single_step_tool_use_with_argument_comparison.yaml

-

Single Step Tool Use With Argument Comparison

agent

General function-calling RL dataset using expert trajectories; behavior cloning to match expert tool calls per step.

-

Creative Commons Attribution 4.0 International

toolcall_schema_single_step_tool_use_with_argument_comparison.yaml

Nemotron-RL-Agentic-Function-Calling-Pivot-v1

Single Step Tool Use With Argument Comparison

agent

GitHub-issue dataset for software-engineering agents; refactored from SWE-Gym and SWE-Bench-Verified for NeMo Gym.

-

Creative Commons Attribution 4.0 International

swe_pivot_single_step_tool_use_with_argument_comparison.yaml

Nemotron-RL-Agentic-SWE-Pivot-v1

Single Step Tool Use With Argument Comparison

agent

The model must output the next correct call in a given trajectory involving search tools.

Improve agentic search capability.

Apache 2.0

search_pivot_single_step_tool_use_with_argument_comparison.yaml

-

Speed Bench

other

Speculative-decoding throughput benchmark. Reads vLLM /metrics Prometheus counters before/after generation to compute acceptance length and acceptance rate.

Measure inference-time speculative-decoding effectiveness for serving research and regression tests.

-

-

-

speed_bench.yaml

-

Spider2 Lite

coding

Text-to-SQL with execution-based evaluation on Spider 2.0-Lite (135 SQLite tasks). Binary reward based on result-set equivalence.

Improve text-to-SQL capabilities for real-world enterprise queries using execution-based binary reward without an LLM judge.

-

-

spider2_lite.yaml

-

Structeval

instruction_following

StructEval non-renderable format verification (JSON, YAML, CSV, TOML, XML)

Improve structured output generation quality

-

Apache 2.0

structeval_nonrenderable.yaml

-

Structured Outputs

instruction_following

Check if responses are following structured output requirements in prompts

Improve instruction following capabilities

Apache 2.0

structured_outputs_json.yaml

Nemotron-RL-instruction_following-structured_outputs

Structured Outputs

instruction_following

Check if responses are following structured output requirements in prompts

Improve instruction following capabilities

Apache 2.0

structured_outputs_json_yaml_xml_v1.yaml

-

Structured Outputs

instruction_following

Check if responses follow structured output requirements (JSON, YAML, XML, TOML, CSV). Created 20260409.

Improve schema adherence across all structured output formats

-

Apache 2.0

structured_outputs_v3.yaml

-

Structured Outputs

instruction_following

Check if responses follow tool-call structured output schemas. Created 20260424.

Improve schema adherence when structured output is expressed through tool calls

-

Apache 2.0

structured_outputs_v4.yaml

-

Swe Agents

-

-

Apache 2.0

swebench_multi_tools.yaml

-

Swe Agents

-

-

Apache 2.0

swebench_openhands.yaml

-

Swe Agents

-

-

Apache 2.0

swebench_openhands_training.yaml

-

Swe Agents

coding

Software engineering tasks with OpenHands agent harness.

Improve agentic software engineering capabilities.

MIT

swebench_swe_agent.yaml

-

Swe Pivot

agent

SWE pivot verifier for PivotRL on coding agent trajectories

Improve coding agent fix-design decisions

Apache 2.0

swe_pivot.yaml

-

Swerl Gen

coding

Running sandboxed evaluation for SWE-style tasks (either patch generation or reproduction test generation)

Improve SWE capabilities useful for benchmarks like SWE-bench

Apache 2.0

swerl_gen.yaml

-

Swerl Llm Judge

coding

SWE-style multiple-choice LLM-judge tasks scored via ... choice.

Improve SWE capabilities useful for benchmarks like SWE-bench

MIT

swerl_llm_judge.yaml

-

Tau2

agent

Tau2 benchmark integration

Evaluate multi-turn agentic capability with user simulation.

-

-

-

tau2_agent.yaml

-

Tavily Search

agent

Model uses search tools to satisfy a user query.

Measure agentic search capability

Apache 2.0

tavily_search_judge_vllm_model.yaml

-

Terminal Multi Harness

agent

Agent006 harness structured-action verifier for next-step pivot RL.

-

-

-

-

terminal_multi_harness_agent006.yaml

-

Terminal Multi Harness

agent

Codex harness structured-action verifier for next-step pivot RL.

-

-

-

-

terminal_multi_harness_codex.yaml

-

Terminal Multi Harness

agent

OpenCode harness structured-action verifier for next-step pivot RL.

-

-

-

-

terminal_multi_harness_opencode.yaml

-

Terminal Multi Harness

agent

Stirrup harness structured-action verifier for next-step pivot RL.

-

-

-

-

terminal_multi_harness_stirrup.yaml

-

Terminus Judge

agent

single-step terminal based task (rubrics v4 judge prompt)

Improve on terminal-style tasks

Apache 2.0

terminus_judge.yaml

-

Terminus Judge

agent

single-step terminal based task (simple judge prompt)

Improve on terminal-style tasks

Apache 2.0

terminus_judge_simple.yaml

-

Terminus Judge

agent

single-step terminal based task (string similarity only)

Improve on terminal-style tasks

Apache 2.0

terminus_judge_string_only.yaml

-

Text To Sql

coding

Text-to-SQL generation with LLM-as-a-judge equivalence checking

Improve text-to-SQL capabilities across multiple dialects

-

-

-

text_to_sql.yaml

-

Ugphysics Judge

knowledge

Undergraduate physics QA verified by a TRUE/FALSE LLM judge with math-verify symbolic fallback

Score undergraduate-physics benchmarks (e.g. UGPhysics) where the judge is a TRUE/FALSE equivalence grader using a reference solution

-

-

-

ugphysics_judge.yaml

-

Verifiers Agent

math

Prime intellect verifiers and environments hub integration, ace-reason math environment example.

Improve math reasoning capabilities.

-

-

acereason-math.yaml

-

Verifif

instruction_following

VerifIF instruction following validators with rule-based and LLM judge support

Improve instruction following capabilities with comprehensive validation

-

-

-

verifif.yaml

-

Vlm Eval Kit

other

-

Measure VLM capabilities

-

-

MMBench_DEV_EN_V11.yaml

-

Vlm Eval Kit

other

-

Measure VLM capabilities

-

-

OCRBench.yaml

-

Vlm Eval Kit

other

Run all supported VLMEvalKit benchmarks.

Measure VLM capabilities

-

-

vlm_eval_kit.yaml

-

Wmt Translation

other

Machine-translation verifier computing corpus-level BLEU per language pair plus optional xCOMET-XXL neural QE via a Ray GPU actor.

Improves multilingual translation quality across BLEU and COMET-family metrics.

-

-

-

wmt_translation.yaml

-

Workplace Assistant

agent

Workplace assistant multi-step tool-using environment

Improve multi-step tool use capability

Apache 2.0

workplace_assistant.yaml

Nemotron-RL-agent-workplace_assistant

Xlam Fc

agent

Salesforce xlam-function-calling-60k tool calling tasks

Improve tool-calling capabilities

Apache 2.0

xlam_fc.yaml

-

Xstest

safety

XSTest safety benchmark - exaggerated safety (over-refusal) evaluation

Evaluate model safety calibration between helpfulness and harmlessness

-

-

-

xstest.yaml

-