Vision-Language Models (VLM) - Nemo-Skills (original) (raw)
This section details how to evaluate Vision-Language Model (VLM) benchmarks that require both text and image understanding.
VLM-specific features¶
VLM evaluation uses the standard vllm server type with multimodal support:
- Automatically converts local image paths to base64 data URLs
- Supports HTTP/HTTPS image URLs and pre-encoded base64 data URLs
- Works seamlessly with any vLLM-supported VLM model
Prompt configuration¶
VLM prompts support two additional fields in the prompt config YAML:
[](#%5F%5Fcodelineno-0-1)image_field: image_path # Field name in the input data containing the image path [](#%5F%5Fcodelineno-0-2)image_position: before # "before" or "after" - where to place image relative to text
For example, the MMMU-Pro prompt config:
[](#%5F%5Fcodelineno-1-1)image_field: image_path [](#%5F%5Fcodelineno-1-2)image_position: before [](#%5F%5Fcodelineno-1-3) [](#%5F%5Fcodelineno-1-4)user: |- [](#%5F%5Fcodelineno-1-5) Answer the following multiple choice question. The last line of your response should be in the following format: 'Answer: A/B/C/D/E/F/G/H/I/J' (e.g. 'Answer: A'). [](#%5F%5Fcodelineno-1-6) [](#%5F%5Fcodelineno-1-7) {problem}
Image path resolution¶
The image_path field in input data supports multiple formats:
| Format | Example | Behavior |
|---|---|---|
| Relative path | images/test.png | Resolved relative to input JSONL directory |
| Absolute path | /data/images/test.png | Used directly |
| HTTP URL | https://example.com/img.png | Passed through to vLLM |
| Data URL | data:image/png;base64,... | Passed through to vLLM |
Supported benchmarks¶
mmmu-pro¶
MMMU-Pro is a robust multi-discipline multimodal understanding benchmark from the MMMU team. It evaluates VLMs on expert-level tasks across various academic disciplines using the "vision" configuration where images are critical for problem-solving.
- Benchmark is defined in nemo_skills/dataset/mmmu-pro/__init__.py
- Original benchmark source is here.
- Evaluation follows AAI methodology for 10-choice MCQ.
Preparing data¶
VLM benchmarks require image files which need to be downloaded separately:
[](#%5F%5Fcodelineno-2-1)ns prepare_data mmmu-pro --data_dir=/workspace/ns-data --cluster=<cluster>
Running evaluation¶
Instruction-following VLMsReasoning VLMs
For standard instruction-following VLMs (e.g., Qwen3-VL-4B-Instruct):
[](#%5F%5Fcodelineno-3-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-3-2) [](#%5F%5Fcodelineno-3-3)eval( [](#%5F%5Fcodelineno-3-4) ctx=wrap_arguments("++inference.temperature=0 ++inference.tokens_to_generate=16384"), [](#%5F%5Fcodelineno-3-5) cluster="slurm", [](#%5F%5Fcodelineno-3-6) output_dir="/workspace/mmmu-pro-eval", [](#%5F%5Fcodelineno-3-7) server_type="vllm", [](#%5F%5Fcodelineno-3-8) server_gpus=1, [](#%5F%5Fcodelineno-3-9) model="Qwen/Qwen3-VL-4B-Instruct", [](#%5F%5Fcodelineno-3-10) benchmarks="mmmu-pro", [](#%5F%5Fcodelineno-3-11) data_dir="/workspace/ns-data", [](#%5F%5Fcodelineno-3-12))
Alternative: Command-line usage
[](#%5F%5Fcodelineno-4-1)ns eval \ [](#%5F%5Fcodelineno-4-2) --cluster=slurm \ [](#%5F%5Fcodelineno-4-3) --output_dir=/workspace/mmmu-pro-eval \ [](#%5F%5Fcodelineno-4-4) --server_type=vllm \ [](#%5F%5Fcodelineno-4-5) --server_gpus=1 \ [](#%5F%5Fcodelineno-4-6) --model=Qwen/Qwen3-VL-4B-Instruct \ [](#%5F%5Fcodelineno-4-7) --benchmarks=mmmu-pro \ [](#%5F%5Fcodelineno-4-8) --data_dir=/workspace/ns-data \ [](#%5F%5Fcodelineno-4-9) "++inference.temperature=0" \ [](#%5F%5Fcodelineno-4-10) "++inference.tokens_to_generate=16384"
For reasoning-enhanced VLMs (e.g., Qwen3-VL-30B-A3B-Thinking):
[](#%5F%5Fcodelineno-5-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-5-2) [](#%5F%5Fcodelineno-5-3)eval( [](#%5F%5Fcodelineno-5-4) ctx=wrap_arguments("++inference.temperature=0.7 ++inference.tokens_to_generate=131072"), [](#%5F%5Fcodelineno-5-5) cluster="slurm", [](#%5F%5Fcodelineno-5-6) output_dir="/workspace/mmmu-pro-eval", [](#%5F%5Fcodelineno-5-7) server_type="vllm", [](#%5F%5Fcodelineno-5-8) server_gpus=8, [](#%5F%5Fcodelineno-5-9) model="/hf_models/Qwen3-VL-30B-A3B-Thinking", [](#%5F%5Fcodelineno-5-10) benchmarks="mmmu-pro", [](#%5F%5Fcodelineno-5-11) data_dir="/workspace/ns-data", [](#%5F%5Fcodelineno-5-12))
Alternative: Command-line usage
[](#%5F%5Fcodelineno-6-1)ns eval \ [](#%5F%5Fcodelineno-6-2) --cluster=slurm \ [](#%5F%5Fcodelineno-6-3) --output_dir=/workspace/mmmu-pro-eval \ [](#%5F%5Fcodelineno-6-4) --server_type=vllm \ [](#%5F%5Fcodelineno-6-5) --server_gpus=8 \ [](#%5F%5Fcodelineno-6-6) --model=/hf_models/Qwen3-VL-30B-A3B-Thinking \ [](#%5F%5Fcodelineno-6-7) --benchmarks=mmmu-pro \ [](#%5F%5Fcodelineno-6-8) --data_dir=/workspace/ns-data \ [](#%5F%5Fcodelineno-6-9) "++inference.temperature=0.7" \ [](#%5F%5Fcodelineno-6-10) "++inference.tokens_to_generate=131072"
vLLM configuration tips¶
Based on vLLM VLM documentation:
- For image-only inference, add
--limit-mm-per-prompt.video 0to save memory - Set
--max-model-len 128000for most use cases (default 262K consumes more memory) - Use
--async-schedulingfor better performance
These can be passed via server_args:
[](#%5F%5Fcodelineno-7-1)eval( [](#%5F%5Fcodelineno-7-2) server_args="--limit-mm-per-prompt.video 0 --max-model-len 128000 --async-scheduling", [](#%5F%5Fcodelineno-7-3) ... [](#%5F%5Fcodelineno-7-4))