Scientific Knowledge - Nemo-Skills (original) (raw)

Nemo-Skills can be used to evaluate an LLM on various STEM datasets.

Dataset Overview

Dataset Questions Types Domain Images? NS default
HLE 2,500 Open ended, MCQ Engineering, Physics, Chemistry, Bio, etc. Yes text only
HLE-Verified 2,500 Open ended, MCQ Engineering, Physics, Chemistry, Bio, etc. Yes gold+revision text only
GPQA 448 (main)198 (diamond)546 (ext.) MCQ (4) Physics, Chemistry, Biology No diamond
SuperGPQA 26,529 MCQ (≤ 10) Science, Eng, Humanities, etc. No test
MMLU-Pro 12,032 MCQ (≤ 10) Multiple subjects No test
SciCode 80(338 subtasks) Code gen Scientific computing No test+val
FrontierScience 100 Short-answer Physics, Chemistry, Biology No all
Physics 1,000 (EN), 1,000 (ZH) Open-ended Physics No EN
UGPhysics 5,520 (EN), 5,520 (ZH) Open-ended MCQ Physics No EN
MMLU 14,042 MCQ (4) Multiple Subjects No test
MMLU-Redux 5,385 MCQ (4) Multiple Subjects No test
SimpleQA 4,326 (test), 1,000 (verified) Open ended Factuality, Parametric knowledge No verified

Evaluate NVIDIA-Nemotron-3-Nano on an MCQ dataset

[](#%5F%5Fcodelineno-0-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-0-2)cluster = "slurm" [](#%5F%5Fcodelineno-0-3)eval( [](#%5F%5Fcodelineno-0-4) ctx=wrap_arguments( [](#%5F%5Fcodelineno-0-5) "++inference.temperature=1.0 ++inference.top_p=1.0 " [](#%5F%5Fcodelineno-0-6) "++inference.tokens_to_generate=131072 " [](#%5F%5Fcodelineno-0-7) "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " [](#%5F%5Fcodelineno-0-8) ), [](#%5F%5Fcodelineno-0-9) cluster=cluster, [](#%5F%5Fcodelineno-0-10) server_type="vllm", [](#%5F%5Fcodelineno-0-11) server_gpus=1, [](#%5F%5Fcodelineno-0-12) server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", [](#%5F%5Fcodelineno-0-13) model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", [](#%5F%5Fcodelineno-0-14) benchmarks="gpqa:4", [](#%5F%5Fcodelineno-0-15) output_dir="/workspace/Nano_V3_evals" [](#%5F%5Fcodelineno-0-16))

Evaluate NVIDIA-Nemotron-3-Nano using LLM-as-a-judge

[](#%5F%5Fcodelineno-1-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-1-2)cluster = "slurm" [](#%5F%5Fcodelineno-1-3)eval( [](#%5F%5Fcodelineno-1-4) ctx=wrap_arguments( [](#%5F%5Fcodelineno-1-5) "++inference.temperature=1.0 ++inference.top_p=1.0 " [](#%5F%5Fcodelineno-1-6) "++inference.tokens_to_generate=131072 " [](#%5F%5Fcodelineno-1-7) "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " [](#%5F%5Fcodelineno-1-8) ), [](#%5F%5Fcodelineno-1-9) cluster=cluster, [](#%5F%5Fcodelineno-1-10) server_type="vllm", [](#%5F%5Fcodelineno-1-11) server_gpus=1, [](#%5F%5Fcodelineno-1-12) server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", [](#%5F%5Fcodelineno-1-13) model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", [](#%5F%5Fcodelineno-1-14) benchmarks="hle:4", [](#%5F%5Fcodelineno-1-15) output_dir="/workspace/Nano_V3_evals", [](#%5F%5Fcodelineno-1-16) judge_model="openai/gpt-oss-120b", [](#%5F%5Fcodelineno-1-17) judge_server_type="vllm", [](#%5F%5Fcodelineno-1-18) judge_server_gpus=8, [](#%5F%5Fcodelineno-1-19) judge_server_args="--async-scheduling", [](#%5F%5Fcodelineno-1-20) extra_judge_args="++chat_template_kwargs.reasoning_effort=high ++inference.temperature=1.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=120000 " [](#%5F%5Fcodelineno-1-21))

[](#%5F%5Fcodelineno-2-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-2-2)cluster = "slurm" [](#%5F%5Fcodelineno-2-3)eval( [](#%5F%5Fcodelineno-2-4) ctx=wrap_arguments( [](#%5F%5Fcodelineno-2-5) "++inference.temperature=0.6 ++inference.top_p=0.95 " [](#%5F%5Fcodelineno-2-6) "++inference.tokens_to_generate=131072 " [](#%5F%5Fcodelineno-2-7) "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " [](#%5F%5Fcodelineno-2-8) "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " [](#%5F%5Fcodelineno-2-9) [](#%5F%5Fcodelineno-2-10) ), [](#%5F%5Fcodelineno-2-11) cluster=cluster, [](#%5F%5Fcodelineno-2-12) server_type="vllm", [](#%5F%5Fcodelineno-2-13) server_gpus=1, [](#%5F%5Fcodelineno-2-14) server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice --tool-call-parser qwen3_coder", [](#%5F%5Fcodelineno-2-15) model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", [](#%5F%5Fcodelineno-2-16) benchmarks="gpqa:4", [](#%5F%5Fcodelineno-2-17) output_dir="/workspace/Nano_V3_evals", [](#%5F%5Fcodelineno-2-18) with_sandbox=True, [](#%5F%5Fcodelineno-2-19) [](#%5F%5Fcodelineno-2-20))