Vision-Language Models (VLM) - Nemo-Skills (original) (raw)

This section details how to evaluate Vision-Language Model (VLM) benchmarks that require both text and image understanding.

VLM-specific features¶

VLM evaluation uses the standard vllm server type with multimodal support:

Automatically converts local image paths to base64 data URLs
Supports HTTP/HTTPS image URLs and pre-encoded base64 data URLs
Works seamlessly with any vLLM-supported VLM model

Prompt configuration¶

VLM prompts support two additional fields in the prompt config YAML:

[](#%5F%5Fcodelineno-0-1)image_field: image_path # Field name in the input data containing the image path [](#%5F%5Fcodelineno-0-2)image_position: before # "before" or "after" - where to place image relative to text

For example, the MMMU-Pro prompt config:

[](#%5F%5Fcodelineno-1-1)image_field: image_path [](#%5F%5Fcodelineno-1-2)image_position: before [](#%5F%5Fcodelineno-1-3) [](#%5F%5Fcodelineno-1-4)user: |- [](#%5F%5Fcodelineno-1-5) Answer the following multiple choice question. The last line of your response should be in the following format: 'Answer: A/B/C/D/E/F/G/H/I/J' (e.g. 'Answer: A'). [](#%5F%5Fcodelineno-1-6) [](#%5F%5Fcodelineno-1-7) {problem}

Image path resolution¶

The image_path field in input data supports multiple formats:

Format	Example	Behavior
Relative path	images/test.png	Resolved relative to input JSONL directory
Absolute path	/data/images/test.png	Used directly
HTTP URL	https://example.com/img.png	Passed through to vLLM
Data URL	data:image/png;base64,...	Passed through to vLLM

Supported benchmarks¶

mmmu-pro¶

MMMU-Pro is a robust multi-discipline multimodal understanding benchmark from the MMMU team. It evaluates VLMs on expert-level tasks across various academic disciplines using the "vision" configuration where images are critical for problem-solving.

Benchmark is defined in nemo_skills/dataset/mmmu-pro/__init__.py
Original benchmark source is here.
Evaluation follows AAI methodology for 10-choice MCQ.

Preparing data¶

VLM benchmarks require image files which need to be downloaded separately:

[](#%5F%5Fcodelineno-2-1)ns prepare_data mmmu-pro --data_dir=/workspace/ns-data --cluster=<cluster>

Running evaluation¶

Instruction-following VLMsReasoning VLMs

For standard instruction-following VLMs (e.g., Qwen3-VL-4B-Instruct):

[](#%5F%5Fcodelineno-3-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-3-2) [](#%5F%5Fcodelineno-3-3)eval( [](#%5F%5Fcodelineno-3-4) ctx=wrap_arguments("++inference.temperature=0 ++inference.tokens_to_generate=16384"), [](#%5F%5Fcodelineno-3-5) cluster="slurm", [](#%5F%5Fcodelineno-3-6) output_dir="/workspace/mmmu-pro-eval", [](#%5F%5Fcodelineno-3-7) server_type="vllm", [](#%5F%5Fcodelineno-3-8) server_gpus=1, [](#%5F%5Fcodelineno-3-9) model="Qwen/Qwen3-VL-4B-Instruct", [](#%5F%5Fcodelineno-3-10) benchmarks="mmmu-pro", [](#%5F%5Fcodelineno-3-11) data_dir="/workspace/ns-data", [](#%5F%5Fcodelineno-3-12))

Alternative: Command-line usage

[](#%5F%5Fcodelineno-4-1)ns eval \ [](#%5F%5Fcodelineno-4-2) --cluster=slurm \ [](#%5F%5Fcodelineno-4-3) --output_dir=/workspace/mmmu-pro-eval \ [](#%5F%5Fcodelineno-4-4) --server_type=vllm \ [](#%5F%5Fcodelineno-4-5) --server_gpus=1 \ [](#%5F%5Fcodelineno-4-6) --model=Qwen/Qwen3-VL-4B-Instruct \ [](#%5F%5Fcodelineno-4-7) --benchmarks=mmmu-pro \ [](#%5F%5Fcodelineno-4-8) --data_dir=/workspace/ns-data \ [](#%5F%5Fcodelineno-4-9) "++inference.temperature=0" \ [](#%5F%5Fcodelineno-4-10) "++inference.tokens_to_generate=16384"

For reasoning-enhanced VLMs (e.g., Qwen3-VL-30B-A3B-Thinking):

[](#%5F%5Fcodelineno-5-1)from nemo_skills.pipeline.cli import wrap_arguments, eval [](#%5F%5Fcodelineno-5-2) [](#%5F%5Fcodelineno-5-3)eval( [](#%5F%5Fcodelineno-5-4) ctx=wrap_arguments("++inference.temperature=0.7 ++inference.tokens_to_generate=131072"), [](#%5F%5Fcodelineno-5-5) cluster="slurm", [](#%5F%5Fcodelineno-5-6) output_dir="/workspace/mmmu-pro-eval", [](#%5F%5Fcodelineno-5-7) server_type="vllm", [](#%5F%5Fcodelineno-5-8) server_gpus=8, [](#%5F%5Fcodelineno-5-9) model="/hf_models/Qwen3-VL-30B-A3B-Thinking", [](#%5F%5Fcodelineno-5-10) benchmarks="mmmu-pro", [](#%5F%5Fcodelineno-5-11) data_dir="/workspace/ns-data", [](#%5F%5Fcodelineno-5-12))

Alternative: Command-line usage

[](#%5F%5Fcodelineno-6-1)ns eval \ [](#%5F%5Fcodelineno-6-2) --cluster=slurm \ [](#%5F%5Fcodelineno-6-3) --output_dir=/workspace/mmmu-pro-eval \ [](#%5F%5Fcodelineno-6-4) --server_type=vllm \ [](#%5F%5Fcodelineno-6-5) --server_gpus=8 \ [](#%5F%5Fcodelineno-6-6) --model=/hf_models/Qwen3-VL-30B-A3B-Thinking \ [](#%5F%5Fcodelineno-6-7) --benchmarks=mmmu-pro \ [](#%5F%5Fcodelineno-6-8) --data_dir=/workspace/ns-data \ [](#%5F%5Fcodelineno-6-9) "++inference.temperature=0.7" \ [](#%5F%5Fcodelineno-6-10) "++inference.tokens_to_generate=131072"

vLLM configuration tips¶

Based on vLLM VLM documentation:

For image-only inference, add --limit-mm-per-prompt.video 0 to save memory
Set --max-model-len 128000 for most use cases (default 262K consumes more memory)
Use --async-scheduling for better performance

These can be passed via server_args:

[](#%5F%5Fcodelineno-7-1)eval( [](#%5F%5Fcodelineno-7-2) server_args="--limit-mm-per-prompt.video 0 --max-model-len 128000 --async-scheduling", [](#%5F%5Fcodelineno-7-3) ... [](#%5F%5Fcodelineno-7-4))