GitHub - TIGER-AI-Lab/VLM2Vec: This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025] (original) (raw)

VLM2Vec-V2: Unified Multimodal Embedding for Videos, Images, and Documents

This repository contains the official code and data for VLM2Vec-V2, a unified framework for learning powerful multimodal embeddings across diverse visual formats including images, videos, and visual documents.

Our work introduces MMEB-V2, a comprehensive benchmark with 78 tasks designed to systematically evaluate embedding models across these modalities. VLM2Vec-V2 sets a new state-of-the-art, outperforming strong baselines across all categories.

This is an open-source project, and we welcome contributions from the community. We are particularly interested in additions of new functionalities, support for new datasets, bug fixes, and improvements to documentation. Please feel free to open an issue to discuss your ideas or submit a pull request!

📌 Please see our CHANGELOG for the latest features and bug fixes! CHANGELOG

🚨 Major V2 Update Alert (June 2025) 🚨

This repository has been updated to V2, which is a complete overhaul of the codebase. The previous VLM2Vec code has been archived and can be found in the v1 branch.

Warning: Please back up any local work before proceeding. If you have a local clone from before this update, you must reset your main branch to sync with the new code.

For detailed instructions, please see the "How to Upgrade to V2" section below.

Your feedback on this transition process is highly appreciated. If you run into any problems, please let us know by opening an issue.


🔥 News

Key Updates

How to Upgrade to V2

  1. Back Up Your Local Changes (Critical!) The update process will discard any uncommitted changes on your local main branch. If you have work you want to save, commit it to a new branch or use git stash.
  2. Reset Your Local Repository to V2. Run the following commands to fetch the new main branch and reset your local copy to match it.

Make sure you are on your main branch first

git checkout main

Fetch all recent updates from the remote and remove stale branch references

git fetch --all --prune

Force your local main branch to match the new remote main branch

git reset --hard origin/main

Model

VLM2Vec-V2 fine-tunes a state-of-the-art Vision-Language Model (VLM) using instruction-guided contrastive training. The model learns to produce a single, powerful fixed-dimensional embedding for any combination of text, image, video, and document inputs.

For current V2 models, we use Qwen2-VL as the model backbone, which capably handles interleaved sequences of text and visuals, variable resolutions, and long-form inputs like videos and visual documents.

Released checkpoints

MMEB-v2 Benchmark

We introduce MMEB-V2, an expanded benchmark that includes 78 total datasets covering images, videos, and visual documents.

MMEB-V2 Overview

Data Download

Please refer to experiments/public/data/download_data.sh.

Our training process uses a curated dataset from three main sources: video-language data (LLaVA-Hound), visual document data (Vidore, VisRAG), and image-text data (MMEB-train). We use an interleaved sub-batching strategy for stable and effective contrastive learning.

How to run: please see examples in experiments/public/train.

Evaluation

DDP inference on multiple GPUs is supported. The whole evaluation process is streamlined and can be finished within hours.

How to run: please see examples in experiments/public/eval.

Heads-up for Reproducing Baseline Models

Citation

@article{jiang2024vlm2vec, title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks}, author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu}, journal={arXiv preprint arXiv:2410.05160}, year={2024} }

@article{meng2025vlm2vecv2, title={VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents}, author={Rui Meng and Ziyan Jiang and Ye Liu and Mingyi Su and Xinyi Yang and Yuepeng Fu and Can Qin and Zeyuan Chen and Ran Xu and Caiming Xiong and Yingbo Zhou and Wenhu Chen and Semih Yavuz}, journal={arXiv preprint arXiv:2507.04590}, year={2025} }

Star History

Star History Chart