Scientific Knowledge - Nemo-Skills (original) (raw)
Nemo-Skills can be used to evaluate an LLM on various STEM datasets.
Dataset Overview¶
| Dataset | Questions | Types | Domain | Images? | NS default |
|---|---|---|---|---|---|
| HLE | 2,500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only |
| HLE-Verified | 2,500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | gold+revision text only |
| GPQA | 448 (main)198 (diamond)546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond |
| SuperGPQA | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test |
| MMLU-Pro | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test |
| SciCode | 80(338 subtasks) | Code gen | Scientific computing | No | test+val |
| FrontierScience | 100 | Short-answer | Physics, Chemistry, Biology | No | all |
| Physics | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN |
| UGPhysics | 5,520 (EN), 5,520 (ZH) | Open-ended MCQ | Physics | No | EN |
| MMLU | 14,042 | MCQ (4) | Multiple Subjects | No | test |
| MMLU-Redux | 5,385 | MCQ (4) | Multiple Subjects | No | test |
| SimpleQA | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge | No | verified |
Evaluate NVIDIA-Nemotron-3-Nano on an MCQ dataset¶
[](#%5F%5Fcodelineno-0-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-0-2)cluster = "slurm" [](#%5F%5Fcodelineno-0-3)eval( [](#%5F%5Fcodelineno-0-4) ctx=wrap_arguments( [](#%5F%5Fcodelineno-0-5) "++inference.temperature=1.0 ++inference.top_p=1.0 " [](#%5F%5Fcodelineno-0-6) "++inference.tokens_to_generate=131072 " [](#%5F%5Fcodelineno-0-7) "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " [](#%5F%5Fcodelineno-0-8) ), [](#%5F%5Fcodelineno-0-9) cluster=cluster, [](#%5F%5Fcodelineno-0-10) server_type="vllm", [](#%5F%5Fcodelineno-0-11) server_gpus=1, [](#%5F%5Fcodelineno-0-12) server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", [](#%5F%5Fcodelineno-0-13) model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", [](#%5F%5Fcodelineno-0-14) benchmarks="gpqa:4", [](#%5F%5Fcodelineno-0-15) output_dir="/workspace/Nano_V3_evals" [](#%5F%5Fcodelineno-0-16))
Evaluate NVIDIA-Nemotron-3-Nano using LLM-as-a-judge¶
[](#%5F%5Fcodelineno-1-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-1-2)cluster = "slurm" [](#%5F%5Fcodelineno-1-3)eval( [](#%5F%5Fcodelineno-1-4) ctx=wrap_arguments( [](#%5F%5Fcodelineno-1-5) "++inference.temperature=1.0 ++inference.top_p=1.0 " [](#%5F%5Fcodelineno-1-6) "++inference.tokens_to_generate=131072 " [](#%5F%5Fcodelineno-1-7) "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " [](#%5F%5Fcodelineno-1-8) ), [](#%5F%5Fcodelineno-1-9) cluster=cluster, [](#%5F%5Fcodelineno-1-10) server_type="vllm", [](#%5F%5Fcodelineno-1-11) server_gpus=1, [](#%5F%5Fcodelineno-1-12) server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", [](#%5F%5Fcodelineno-1-13) model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", [](#%5F%5Fcodelineno-1-14) benchmarks="hle:4", [](#%5F%5Fcodelineno-1-15) output_dir="/workspace/Nano_V3_evals", [](#%5F%5Fcodelineno-1-16) judge_model="openai/gpt-oss-120b", [](#%5F%5Fcodelineno-1-17) judge_server_type="vllm", [](#%5F%5Fcodelineno-1-18) judge_server_gpus=8, [](#%5F%5Fcodelineno-1-19) judge_server_args="--async-scheduling", [](#%5F%5Fcodelineno-1-20) extra_judge_args="++chat_template_kwargs.reasoning_effort=high ++inference.temperature=1.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=120000 " [](#%5F%5Fcodelineno-1-21))
[](#%5F%5Fcodelineno-2-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-2-2)cluster = "slurm" [](#%5F%5Fcodelineno-2-3)eval( [](#%5F%5Fcodelineno-2-4) ctx=wrap_arguments( [](#%5F%5Fcodelineno-2-5) "++inference.temperature=0.6 ++inference.top_p=0.95 " [](#%5F%5Fcodelineno-2-6) "++inference.tokens_to_generate=131072 " [](#%5F%5Fcodelineno-2-7) "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " [](#%5F%5Fcodelineno-2-8) "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " [](#%5F%5Fcodelineno-2-9) [](#%5F%5Fcodelineno-2-10) ), [](#%5F%5Fcodelineno-2-11) cluster=cluster, [](#%5F%5Fcodelineno-2-12) server_type="vllm", [](#%5F%5Fcodelineno-2-13) server_gpus=1, [](#%5F%5Fcodelineno-2-14) server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice --tool-call-parser qwen3_coder", [](#%5F%5Fcodelineno-2-15) model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", [](#%5F%5Fcodelineno-2-16) benchmarks="gpqa:4", [](#%5F%5Fcodelineno-2-17) output_dir="/workspace/Nano_V3_evals", [](#%5F%5Fcodelineno-2-18) with_sandbox=True, [](#%5F%5Fcodelineno-2-19) [](#%5F%5Fcodelineno-2-20))