TIGER-Lab/MMEB-eval · Datasets at Hugging Face (original) (raw)

Massive Multimodal Embedding Benchmark

We compile a large set of evaluation tasks to understand the capabilities of multimodal embedding models. This benchmark covers 4 meta tasks and 36 datasets meticulously selected for evaluation.

The dataset is published in our paper VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks.

Dataset Usage

For each dataset, we have 1000 examples for evaluation. Each example contains a query and a set of targets. Both the query and target could be any combination of image and text. The first one in the candidate list is the groundtruth target.

Statistics

We show the statistics of all the datasets as follows: abs

Per-dataset Results

We list the performance of different embedding models in the following: abs

Submission

We will set a formal leaderboard soon. If you want to add your results to the leaderboard, please send email to us at ziyanjiang528@gmail.com.

Cite Us

@article{jiang2024vlm2vec,
  title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
  author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
  journal={arXiv preprint arXiv:2410.05160},
  year={2024}
}