Paper page - LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted

Contrastive Learning (original) (raw)

Abstract

The proposed LLaVE framework dynamically enhances embedding models' representation of negative pairs, achieving state-of-the-art performance across various multimodal tasks with strong scalability and efficiency.

Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on theirdiscriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

View arXiv page View PDF GitHub 77 Add to collection

Get this paper in your agent:

hf papers read 2503.04812

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

zhibinlan/LLaVE-2B Image-Text-to-Text • 2B • Updated Mar 14, 2025 • 112 • 45

zhibinlan/LLaVE-0.5B Image-Text-to-Text • 0.9B • Updated Mar 14, 2025 • 24 • 7

zhibinlan/LLaVE-7B Image-Text-to-Text • 8B • Updated Mar 14, 2025 • 16 • 5

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.04812 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.04812 in a Space README.md to link it from this page.

Collections including this paper 3