ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval (original) (raw)

Back to Articles

Why a new benchmark?

Since the release of the original ViDoRe Benchmark, evaluating visual models on document retrieval tasks, visual retrieval models have significantly advanced! While the original ColPali model reported an average score of 81.3 nDCG@5, current SOTA models on the leaderboard surpass a nDCG@5 of 90, with some tasks becoming “too easy” to yield a meaningful signal!

With the benchmark approaching saturation for SOTA models, there is limited room to truly measure improvements and understand model capabilities in realistic scenarios. To continue pushing the boundaries of visual retrieval, it became essential to introduce a new benchmark designed specifically to challenge these advanced models: ViDoRe Benchmark V2.

Motivating the Creation of ViDoRe Benchmark V2

In developing ViDoRe Benchmark V2, our main goal was to create a benchmark reflective of real-world retrieval challenges—difficult, diverse, and meaningful. Current benchmarks exhibit limitations that prevent them from accurately reflecting real user behavior and complex retrieval scenarios. We identified three critical issues in existing benchmarks:

  1. Extractive Nature of Queries: Current benchmarks typically rely on extractive queries, providing unrealistic retrieval contexts since real users rarely formulate queries from exact phrases in documents.
  2. Single-Page Query Bias: Many benchmarks overly emphasize retrieval from single-page contexts, neglecting complex, multi-document or cross-document queries common in real-world applications.
  3. Challenges in Synthetic Query Generation: Purely synthetic benchmarks, while appealing in theory, are difficult to implement effectively without extensive manual oversight. They often produce outliers, irrelevant or trivial queries, making human filtering essential yet costly.

Design Decisions and Techniques Used

To address these challenges and create a robust, realistic benchmark, ViDoRe Benchmark V2 includes several innovative features:

Dataset Selection for ViDoRe Benchmark V2

The selected datasets for ViDoRe Benchmark V2 are diverse, publicly available, and challenging. Each dataset presents distinct visual complexity and is suitable for realistic retrieval tasks, including multilingual versions with queries translated into French, English, Spanish, and German. This multilingual approach further extends the applicability and challenge level of the benchmark.

Dataset Name Original Version Multilingual Version Original Doc Lang Query Lang # Docs # Queries # Pages # Qrels Avg. Pages/Query Comments
Insurance vidore/synthetic_insurance_filtered_v1.0 vidore/synthetic_insurance_filtered_v1.0_multilingual French French 4 18 260 86 4.7 Small but challenging, multi-document
MIT Tissue Interaction vidore/synthetic_mit_biomedical_tissue_interactions_unfiltered vidore/synthetic_mit_biomedical_tissue_interactions_unfiltered_multilingual English English 27 160 1016 515 3.2 Largest dataset, most extractive
World Economic Reports vidore/synthetic_economics_macro_economy_2024_filtered_v1.0 vidore/synthetic_economics_macro_economy_2024_filtered_v1.0_multilingual English English 4 18 260 86 4.7 Cross-document queries, high complexity
ESG Reports vidore/synthetic_rse_restaurant_filtered_v1.0 vidore/synthetic_rse_restaurant_filtered_v1.0_multilingual English French 30 57 1538 222 3.9 Natively cross-lingual, industry-specific

Evaluating Models

To evaluate models on ViDoRe Benchmark 2, we follow these steps:

Option 1: Using the CLI

Here is a CLI example for using a colpali type retriever on vidore benchmark 2. For other retrievers, please refer to this repo.

    vidore-benchmark evaluate-retriever \
        --model-class colpali \
        --model-name vidore/colpali-v1.3 \
        --collection-name vidore/vidore-benchmark-v2-dev-67ae03e3924e85b36e7f53b0 \
        --dataset-format beir \
        --split test

Option 2: Creating a custom retriever

Detailed instructions on how to do that are available here

Results

Here are for example some ndcg_at_5 results of visual retrieval models on ViDoRe Benchmark 2:

Dataset voyageai metrics-colqwen2.5-3B colsmolvlm-v0.1 colqwen2-v1.0 colpali-v1.2 dse-qwen2-2b-mrl-v1 colSmol-256M colpali-v1.3 colqwen2.5-v0.2 dse-llamaindex tsystems-colqwen2.5-3b-multilingual-v1.0 gme-qwen2-VL-7B visrag-ret colSmol-500M colpali-v1.1
restaurant_esg_reports_beir 0.561 0.645 0.624 0.622 0.321 0.614 0.460 0.511 0.684 0.631 0.721 0.658 0.537 0.522 0.465
insurance 0.641 0.579 0.555 0.651 0.560 0.655 0.504 0.598 0.603 0.688 0.693 0.607 0.505 0.587 0.547
insurance_multilingual 0.595 0.557 0.432 0.572 0.458 0.563 0.341 0.501 0.532 0.610 0.600 0.554 0.452 0.377 0.484
synthetic_economics_macro_economy_2024 0.588 0.566 0.609 0.615 0.531 0.615 0.534 0.516 0.598 0.612 0.548 0.629 0.596 0.503 0.567
synthetic_mit_biomedical_tissue_interactions 0.564 0.639 0.581 0.618 0.585 0.592 0.532 0.597 0.636 0.606 0.653 0.640 0.548 0.543 0.564
synthetic_mit_biomedical_tissue_interactions_multilingual 0.515 0.569 0.505 0.565 0.557 0.551 0.340 0.565 0.611 0.569 0.617 0.551 0.477 0.421 0.507
synthetic_rse_restaurant 0.472 0.496 0.511 0.534 0.519 0.549 0.272 0.570 0.574 0.503 0.517 0.543 0.459 0.392 0.461
synthetic_rse_restaurant_multilingual 0.462 0.492 0.476 0.542 0.540 0.557 0.313 0.557 0.574 0.512 0.533 0.567 0.464 0.391 0.481
synthetics_economics_macro_economy_2024_multilingual 0.550 0.535 0.474 0.532 0.479 0.528 0.273 0.499 0.565 0.528 0.512 0.562 0.487 0.361 0.438
Average 0.550 0.564 0.530 0.583 0.505 0.580 0.397 0.546 0.597 0.584 0.599 0.590 0.503 0.455 0.502

Notes on the benchmark :
We adapted the evaluation procedure for the voyageAI API, resulting in slightly lower performance on the ViDoRe benchmark v1 compared to the values reported by voyageAI. This discrepancy likely arises from our resizing of input images to a maximum image height of 1200 pixels to facilitate efficient benchmarking, a preprocessing step presumably not applied in voyageAI's original benchmarking setup.

The best models so far seem to be based on Qwen2.5. Be careful however, these models do not fall under an open license.

Insights on the Results

Insights from the ViDoRe v2 Benchmark:

image/png

image/png

Our goal is for ViDoRe V2 to become a dynamic, "living benchmark" that regularly grows with new tasks and datasets. To achieve this, we welcome and encourage the community to contribute datasets and evaluation tasks. This collaborative approach helps ensure that the benchmark stays relevant, useful, and reflective of real-world challenges.

Note

Since the dataset release, the insurance dataset was removed from the dataset for legal copyright reasons.

Cite

@misc{macé2025vidorebenchmarkv2raising,
      title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval}, 
      author={Quentin Macé and António Loison and Manuel Faysse},
      year={2025},
      eprint={2505.17166},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.17166}, 
}

Acknowledgements

For professionals interested in deeper discussions and projects around Visual RAG, ColPali, or agentic systems, don't hesitate to reach out to contact@illuin.tech and reach our term of experts at Illuin Technology that can help accelerate your AI efforts!
We look forward to your feedback and contributions! If you have any sets of documents and associated queries that you would find interesting / challenging for a retrieval task feel free to shoot us a mail!