MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval (original) (raw)

Junjie Zhou1, Zheng Liu2,5, Ze Liu3, Shitao Xiao2, Yueze Wang2, Bo Zhao2,4
Chen Jason Zhang5, Defu Lian3, Yongping Xiong1
1 Beijing University of Posts and Telecommunications, 2 Beijing Academy of Artificial Intelligence,
3 University of Science and Technology of China, 4 Shanghai Jiaotong University
5 The Hong Kong Polytechnic University
zhoujunjie@bupt.edu.cn zhengliu1026@gmail.com

Abstract

Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70×\times× more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

1 Introduction

Multimodal retrieval is a critical research problem for IR and AI communities. It aims to satisfy people’s information needs across different data modalities, especially texts and images. Nowadays, multimodal retrieval has been applied to a wide variety of real-world scenarios, such as image search Chen et al. (2015); Wu et al. (2021); Zhang et al. (2024), visual question answering (VQA) Marino et al. (2019); Mathew et al. (2021), and retrieval-augmented generation (RAG) of vision language models Chen et al. (2022); Yu et al. (2024). Given the widespread application scenarios, it’s necessary to develop universal multimodal retrievers which can uniformly support any task requirements and working domains.

The progress of universal multimodal retrievers have been substantially advanced on top of the pre-trained vision-languages models, like CLIP Radford et al. (2021), ALIGN Jia et al. (2021), and SigLIP Zhai et al. (2023). These models are pre-trained to produce discriminative and unified representations for texts and images, thus creating a solid foundation for multimodal retrieval. However, the existing vision-language encoders are mostly pre-trained from text-image matching tasks. Although these models have achieved an initial capability for text-to-image retrieval Young et al. (2014); Chen et al. (2015), they are insufficient for other common multimodal tasks, such as composed image retrieval Liu et al. (2021); Baldrati et al. (2023); Zhang et al. (2024) and multimodal document retrieval Chang et al. (2022); Liu et al. (2022).

To enhance the multi-task capacity, fine-tuning pre-trained models with comprehensive instructions, commonly known as instruction-tuning, has gained significant popularity. This approach was first applied in the supervised fine-tuning of large language models (LLMs) Ouyang et al. (2022); Wei et al. (2021); Chung et al. (2024), and later introduced for training text embeddings Su et al. (2022); Asai et al. (2022); Zhang et al. (2023); Xiao et al. (2024). Building on these successes, instruction-tuning has been further extended to multimodal embedding models Wei et al. (2024); Sharifymoghaddam et al. (2024), where pre-trained vision-language encoders are continually fine-tuned using a variety of multimodal retrieval instructions. Given the scarcity of instruction-tuning data for embedding models, researchers have proposed leveraging LLMs to generate synthetic data from Internet resources Wang et al. (2023). In the field of multimodal retrieval, a notable example is presented by MagicLens Zhang et al. (2024), which synthesizes open-ended search instructions for co-existing images within the same webpage.

Despite recent advancements by MagicLens, current data synthesis methods still face significant limitations in data scalability, quality, diversity, and availability. Specifically, only a small fraction of webpages on the internet contain multiple images (scalability), not to mention that many of these co-existing images are either unrelated or near-duplicates (quality). Besides, the remaining correlated images often exhibit monotonous relationships, such as different angles of the same object (diversity). Finally, large-scale instruction-tuning datasets for multimodal retrieval are typically held privately by individual research labs (availability).

In this paper, we introduce a novel data synthesis method called MegaPairs, accompanied by a large-scale instruction dataset generated using this approach. MegaPairs is distinguished by its construction of a heterogeneous KNN triplet for open-domain images. Particularly, it leverages three different similarity models to sample correlated image pairs, including CLIP vision-encoder for visual-semantic correlations Sun et al. (2023), DINO vision-encoder for visual-pattern correlations Oquab et al. (2024), and CLIP text-encoder for caption correlations. The sampled image pairs are presented for the VLM and LLM annotators, which generate comprehensive descriptions of the relationships between the two images and create pseudo-retrieval instructions based on the descriptions. This approach enables a huge amount of instructions to be generated for a general dataset, like Datacomp Gadre et al. (2024), which significantly improves the scalability of data synthesis. It also introduces diverse instructions of guaranteed quality, given its sampling of heterogeneous relationships from open-ended image corpora. Additionally, by utilizing open-source VLM and LLM models (e.g., InternVL2-26B Chen et al. (2024b), Llama-3-8B Dubey et al. (2024)), the entire process can operate at a low cost.

We’ve produced 26 million data instances in this stage, achieving superior data quality compared to the existing datasets. In our pilot experiment, with just 500K sampled instances from MegaPairs, the same pre-trained model’s fine-tuning performance already surpasses that of the entire 36.7M training instances from MagicLens, i.e., delivering better results with 70×\times× less training data. We further trained three multimodal retrievers, MMRet, of varying sizes based on the whole synthetic dataset and perform comprehensive evaluations with a wide range of multimodal retrieval tasks. Remarkably, MMRet achieved state-of-the-art performance on 4 popular composed image retrieval (CIR) benchmarks and the 36 datasets provided by MMEB Jiang et al. (2024b) in the zero-shot setting. Furthermore, the models demonstrated substantial improvements and maintain leading positions after downstream fine-tuning. The entire suite of assets, including the dataset, the well-trained models, and the data production pipeline, will be made publicly available to advance the future progress in this field.

Multimodal Retrieval.

Traditionally, retrieval tasks have focused on scenarios where queries and candidates exist in distinct modalities, such as unimodal retrieval Thakur et al. (2021) and cross-modal retrieval Chen et al. (2015). However, there is a growing demand for multimodal retrieval tasks, where queries or candidates integrate both image and text modalities. These tasks have wide applications, including image retrieval with instructions Wu et al. (2021); Liu et al. (2021); Zhang et al. (2024), multimodal document retrieval Chang et al. (2022); Liu et al. (2022), knowledge retrieval with multimodal queries Luo et al. (2023), and retrieval-augmented generation Yasunaga et al. (2023); Yu et al. (2024). Most existing methods employ pre-trained vision-language models (VLMs) to address these tasks Radford et al. (2021); Li et al. (2023); Saito et al. (2023). However, the common VLMs are purely trained on image-text matching datasets Changpinyo et al. (2021); Schuhmann et al. (2022), which are in lack of ability to jointly encode and comprehend both modalities effectively. As a result, it is necessary to create proper datasets so as to extend VLMs for the diversified multimodal retrieval tasks.

Refer to caption

Figure 1: Construction pipeline of multimodal triplets: (a) mining of image pairs, (b) generation of open-ended instructions. Multiple similarity models are used to introduce diversified correlations for the image pairs.

Instruction Tuning for Multimodal Retrieval.

Instruction-tuning is a popular strategy to enhance the multi-task capacity for both large language models Ouyang et al. (2022); Wei et al. (2021); Chung et al. (2024) and embedding models Su et al. (2022); Asai et al. (2022); Zhang et al. (2023); Xiao et al. (2024); Chen et al. (2024a). While there have been a few instruction datasets proposed for multimodal retrieval Liu et al. (2021, 2022); Chang et al. (2022); Wei et al. (2024); Zhou et al. (2024), they are limited in scale and diversity due to their reliance on human annotation. Recently, a notable progress was made by MagicLens Zhang et al. (2024), where a large-scale open-ended search instruction dataset is created from the co-existed images within webpages. However, given the shortage of multi-image webs, MagicLens is limited by its scalability and data-quality. Moreover, this dataset is still held private and inaccessible to public users. As a result, the creation and release of high-quality instruction-tuning datasets have become imperative for advancing multimodal retrieval research.

3 Methodology

3.1 MegaPairs Construction

Training on large-scale open-world data significantly enhances the generalization capabilities of foundation models. For instance, CLIP Radford et al. (2021) has achieved remarkable advancements in cross-modal retrieval and various downstream tasks due to its extensive training on text-image pairs. However, the multimodal instruction tuning data, despite its importance to multimodal retrieval Zhang et al. (2024), is scarce in natural world and expensive to annotate by human effort. In this paper, we propose to construct large-scale multimodal instruction-tuning datasets through data synthesis. Formally, each data instance contains the following triplet: a pair of images (ℐq,ℐt)subscriptℐ𝑞subscriptℐ𝑡(\mathcal{I}_{q},\mathcal{I}_{t})( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), together with a textual instructions 𝒯q→tsubscript𝒯→𝑞𝑡\mathcal{T}_{q\rightarrow t}caligraphic_T start_POSTSUBSCRIPT italic_q → italic_t end_POSTSUBSCRIPT specifying the transition relationship from query image ℐqsubscriptℐ𝑞\mathcal{I}_{q}caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to target image ℐtsubscriptℐ𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

We identify two primary technical challenges in acquiring such triplets: (1) sampling relevant and diversified image pairs at scale, (2) precise annotation of instruction for the sampled image pair. To address these challenges, we propose leveraging the common open-domain image corpora. Intuitively, a large-scale corpus contains abundant correlated images of diverse semantic relationships, which can be mined and annotated for our instruction-tuning data. Our data synthesis pipeline is demonstrated as Figure 1, which involves two main components: the mining of image pairs and the generation of open-ended instructions.

Mining Correlated Image Pairs.

As illustrated in Figure 1(a), we propose sampling correlated image pairs from a large-scale image corpus. For each query image (ℐq,𝒞q)subscriptℐ𝑞subscript𝒞𝑞(\mathcal{I}_{q},\mathcal{C}_{q})( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ), we utilize multiple similarity models to search for a diverse set of correlated target images of heterogeneous correlations {ℐt1,ℐt2,…,ℐtn}subscriptℐsubscript𝑡1subscriptℐsubscript𝑡2…subscriptℐsubscript𝑡𝑛\{\mathcal{I}_{t_{1}},\mathcal{I}_{t_{2}},\ldots,\mathcal{I}_{t_{n}}\}{ caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. In our work, the following types of correlations are used: (1) visual-semantic correlation, which measures the semantic correlation of two images regardless of visual similarity, e.g., two different views of the same cars; (2) visual-pattern correlation, which captures the visual similarity of two images regardless of semantic correlation, e.g., different cars in similar backgrounds; (3) caption correlation, which measures the textual similarity between two images’ captions.

Recognizing the importance of hard negatives in training retrieval models Xiong et al. (2020); Hofstätter et al. (2021); Zhang et al. (2022), for each pair (ℐq,ℐti)subscriptℐ𝑞subscriptℐsubscript𝑡𝑖(\mathcal{I}_{q},\mathcal{I}_{t_{i}})( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), we include additional images {ℐtj∣tj≠ti}conditional-setsubscriptℐsubscript𝑡𝑗subscript𝑡𝑗subscript𝑡𝑖\{\mathcal{I}_{t_{j}}\mid t_{j}\neq t_{i}\}{ caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from the retrieved set as hard negative samples. This approach is simple but empirically effective. We validate the scalability and quality of our data pairs in Section 4.3, with additional examples visualized in Appendix F.

Generating Open-Ended Instructions.

As shown in Figure 1(b), we utilize open-source multimodal large language models (MLLM) and large language models (LLM) for the automated annotation of mined image pairs 𝒫={(ℐq,ℐti)}𝒫subscriptℐ𝑞subscriptℐsubscript𝑡𝑖\mathbf{\mathcal{P}}=\{(\mathcal{I}_{q},\mathcal{I}_{t_{i}})\}caligraphic_P = { ( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }. Initially, each image pair (ℐq,ℐti)subscriptℐ𝑞subscriptℐsubscript𝑡𝑖(\mathcal{I}_{q},\mathcal{I}_{t_{i}})( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is processed by the MLLM to generate a detailed description 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the common concepts and differences between the query image ℐqsubscriptℐ𝑞\mathcal{I}_{q}caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and the target image ℐtisubscriptℐsubscript𝑡𝑖\mathcal{I}_{t_{i}}caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This description 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then refined by the LLM to produce textual instructions 𝒯q→tisubscript𝒯→𝑞subscript𝑡𝑖\mathcal{T}_{q\rightarrow t_{i}}caligraphic_T start_POSTSUBSCRIPT italic_q → italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We prompt the LLM to generate multiple 𝒯q→tisubscript𝒯→𝑞subscript𝑡𝑖\mathcal{T}_{q\rightarrow t_{i}}caligraphic_T start_POSTSUBSCRIPT italic_q → italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for each pair, enhancing the diversity of the textual instructions. Ultimately, we construct a multimodal triplet (ℐq,𝒯q→ti,ℐti)subscriptℐ𝑞subscript𝒯→𝑞subscript𝑡𝑖subscriptℐsubscript𝑡𝑖(\mathcal{I}_{q},\mathcal{T}_{q\rightarrow t_{i}},\mathcal{I}_{t_{i}})( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_q → italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where (ℐq,𝒯q→ti)subscriptℐ𝑞subscript𝒯→𝑞subscript𝑡𝑖(\mathcal{I}_{q},\mathcal{T}_{q\rightarrow t_{i}})( caligraphic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_q → italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) can be used to retrieve ℐtisubscriptℐsubscript𝑡𝑖\mathcal{I}_{t_{i}}caligraphic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This two-step annotation method ensures both accuracy and diversity in the automated annotation process while leveraging open-source models. The detailed prompts can be found in Appendix A.

Implementations.

A dataset of 26,235,105 image pairs is created based on the above data synthesis pipeline. We utilize a subset from the Recap-DataComp-1B Li et al. (2024b) as our image corpus, containing 20 million captioned images. For similarity models, we employ EVA-CLIP’s image encoder for visual-semantic correlation Sun et al. (2023), DINOv2 Oquab et al. (2024) for visual-pattern correlation, and EVA-CLIP’s text encoder for caption similarity. We filter the image pairs whose similarity score is within (0.8, 0.96), thus eliminating weak associations and near duplications. We further leverage InternVL2-26B Chen et al. (2024b) and LLaMA3-8B Dubey et al. (2024) to generate the open-ended instructions. For each image pair, we create at least three different textual instructions and introduce five hard negatives.

3.2 MMRet Model

We propose MMRet, a series of models designed for universal multimodal retrieval based on pre-trained vision-language models (VLMs). Our MMRet integrates two distinct VLM architectures to achieve a universal multimodal embedding.

CLIP-based MMRet.

The original CLIP Radford et al. (2021) model employs a dual encoder architecture that independently encodes image and text data. We denote the image encoder as ΦIsubscriptΦ𝐼{\Phi}_{{I}}roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and the text encoder as ΦTsubscriptΦ𝑇{\Phi}_{{T}}roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Given an image I𝐼Iitalic_I or text T𝑇Titalic_T, their embeddings are computed as follows:

𝐞isubscript𝐞𝑖\displaystyle\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =ΦI⁢(I)absentsubscriptΦ𝐼𝐼\displaystyle={\Phi}_{{I}}(I)= roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I ) (1)
𝐞tsubscript𝐞𝑡\displaystyle\mathbf{e}_{t}bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ΦT⁢(T)absentsubscriptΦ𝑇𝑇\displaystyle={\Phi}_{{T}}(T)= roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_T )

To produce the multimodal embedding for a composed image-text sample (I,T)𝐼𝑇(I,T)( italic_I , italic_T ), we employ the score-fusion strategy as used by UniIR Wei et al. (2024), which directly uses an element-wise addition of the outputs from the dual encoders:

𝐞i⁢t=ΦI⁢(I)+ΦT⁢(T)subscript𝐞𝑖𝑡subscriptΦ𝐼𝐼subscriptΦ𝑇𝑇\mathbf{e}_{it}={\Phi}_{{I}}(I)+{\Phi}_{{T}}(T)bold_e start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I ) + roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_T ) (2)

In our CLIP-based MMRet, we trained both base and large models.

MLLM-based MMRet.

The multimodal large language models (MLLMs) incorporate a visual encoder, typically based on a vision transformer Dosovitskiy et al. (2021), into a large language model (LLM). This integration allows image tokens to be directly processed by the LLM. Consequently, MLLMs can effectively handle diverse multimodal inputs by converting any type of input into a sequence of tokens. For instance, composed image-text data is transformed into interleaved sequences of image and text tokens, enabling the model to process them seamlessly.

Our MMRet model builds upon the LLaVA-1.6 Liu et al. (2024). In both training and inference stages, MMRet uses task-specific instructions for query inputs to improve generalization, aligning with standard practices in LLM-based embedding models Wang et al. (2023); Li et al. (2024a). A typical multimodal query input is structured as follows:

⟨instruct⟩⁢{task_inst}⁢⟨query⟩⁢{qt}⁢{qi}⁢[EOS]delimited-⟨⟩instruct{task_inst}delimited-⟨⟩querysubscript𝑞𝑡subscript𝑞𝑖[EOS]\langle\text{instruct}\rangle~{}~{}\text{\{task\_inst\}}~{}~{}\langle\text{% query}\rangle~{}~{}\{q_{t}\}~{}~{}\{q_{i}\}~{}~{}\texttt{[EOS]}⟨ instruct ⟩ {task_inst} ⟨ query ⟩ { italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } [EOS] (3)

where {task_inst} represents the task-specific instruction, {qt}subscript𝑞𝑡\{q_{t}\}{ italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } denotes the input query text, and {qi}subscript𝑞𝑖\{q_{i}\}{ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is the input query image. The normalized last hidden state of the [EOS] token in the MLLM is used as the embedding of any given input sequence.

3.3 Multimodal Contrastive Learning

We employ multimodal contrastive learning to transform the original CLIP and MLLM into our MMRet model, enabling various multimodal retrieval tasks. We use the standard InfoNCE loss Oord et al. (2018) as our training objective:

| ℒ=−1|𝒬|⁢∑qi∈𝒬log⁡exp⁡(𝐞qi⋅𝐞ci+/τ)∑cj∈𝒞exp⁡(𝐞qi⋅𝐞cj/τ)ℒ1𝒬subscriptsubscript𝑞𝑖𝒬⋅subscript𝐞subscript𝑞𝑖subscript𝐞superscriptsubscript𝑐𝑖𝜏subscriptsubscript𝑐𝑗𝒞⋅subscript𝐞subscript𝑞𝑖subscript𝐞subscript𝑐𝑗𝜏\mathcal{L}=-\frac{1}{|\mathcal{Q}|}\sum_{q_{i}\in\mathcal{Q}}\log\frac{\exp(% \mathbf{e}_{q_{i}}\cdot\mathbf{e}_{c_{i}^{+}}/\tau)}{\sum_{c_{j}\in\mathcal{C}% }\exp(\mathbf{e}_{q_{i}}\cdot\mathbf{e}_{c_{j}}/\tau)}caligraphic_L = - divide start_ARG 1 end_ARG start_ARG | caligraphic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Q end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( bold_e start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( bold_e start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_τ ) end_ARG | (4) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

where the set 𝒬𝒬\mathcal{Q}caligraphic_Q includes all query samples qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a batch. The vectors 𝐞qisubscript𝐞subscript𝑞𝑖\mathbf{e}_{q_{i}}bold_e start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐞ci+subscript𝐞superscriptsubscript𝑐𝑖\mathbf{e}_{c_{i}^{+}}bold_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are the embeddings of the query qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its positive candidate ci+superscriptsubscript𝑐𝑖c_{i}^{+}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, respectively. The set 𝒞𝒞\mathcal{C}caligraphic_C contains all in-batch candidates. Notably, both q𝑞qitalic_q and c𝑐citalic_c can be images, text, or composed image-text data. The parameter τ𝜏\tauitalic_τ modulates the penalties on negative samples and is set to 0.02 unless otherwise specified in this paper.

Methods Backbone # Params CIRCO CIRR FashionIQ GeneCIS mAP@5 R@1 Rs@1 R@10 Rs@1 SEARLE Baldrati et al. (2023) CLIP-B 165M 9.4 24.0 54.9 22.9 - CIReVL Karthik et al. (2023) CLIP-B 12.3B† 14.9 23.9 60.2 28.3 15.9 LDRE Yang et al. (2024) CLIP-B 7.9B† 18.0 25.7 60.5 24.8 - MagicLens-B Zhang et al. (2024) CLIP-B 166M 23.1 27.0 66.7 26.3 15.0 MagicLens-B‡ Zhang et al. (2024) CoCa-B 267M 30.8 31.6 69.3 35.2 17.4∗ MMRet-Base CLIP-B 149M 34.3 36.1 71.6 31.9 18.0 Pic2Word Saito et al. (2023) CLIP-L 429M 8.7 23.9 - 24.7 11.2 PLI Chen and Lai (2023) CLIP-L 428M 10.4 25.5 55.6 35.4 - SEARLE Baldrati et al. (2023) CLIP-L 442M 11.7 24.2 53.8 25.6 12.3 CompoDiff Gu et al. (2024a) CLIP-L 568M 12.6 18.2 57.4 36.0 14.9 CIReVL Karthik et al. (2023) CLIP-L 12.5B† 18.6 24.6 59.5 28.6 15.9 LDRE Yang et al. (2024) CLIP-L 8.2B† 23.4 26.5 60.4 28.5 - MagicLens-L Zhang et al. (2024) CLIP-L 465M 29.6 30.1 68.1 30.7 16.3 MagicLens-L‡ Zhang et al. (2024) CoCa-L 613M 34.1∗ 33.3∗ 70.9∗ 38.0 16.7 MMRet-Large CLIP-L 428M 39.2 38.0 73.2 34.6 18.1 LDRE Yang et al. (2024) CLIP-G 10.3B† 31.1 36.2 68.8 32.5 - CIReVL Karthik et al. (2023) CLIP-G 14.6B† 26.8 34.7 68.0 32.2 17.4∗ IP-CIR Li et al. (2024c) CLIP-G 43.8B† 32.8 39.3 70.0 45.7∗ - E5-V Jiang et al. (2024a) LLaVA-1.6 8.35B 19.1 33.9 - 31.8 - MM-Emded Lin et al. (2024) LLaVA-1.6 7.57B 32.3 - - - - MMRet-MLLM LLaVA-1.6 7.57B 42.2 46.7 75.4 35.6 21.1

Table 1: Zero-shot retrieval performance on various CIR benchmarks. ∗ denotes the previous best performance for each benchmark prior to MMRet. † indicates methods with multiple components (e.g., GPT-3.5, Qwen1.5-32B); we report # parameters of components with known sizes. The CoCa-based MagicLens‡ models are proprietary. Results in bold and underline denote the best and second-best performances for each model scale, respectively. Our MMRet model achieves state-of-the-art results across different model sizes and benchmarks, surpassing the previous SOTA by 8.1% on the main benchmark CIRCO, significantly advancing zero-shot CIR methods.

4 Experiments

In this section, we first evaluate the effectiveness of MegaPairs on zero-shot composed image retrieval (CIR) tasks in Section 4.1. Next, we explore the impact of MegaPairs on broader multimodal retrieval tasks in Section 4.2. Finally, we conduct detailed analysis on our MegaPairs in Section 4.3.

4.1 Zero-shot Performance on CIR tasks

4.1.1 Implementation Details

We utilize our MegaPairs dataset to perform multimodal contrastive training for our MMRet models. For the CLIP-based MMRet, we initialize the model using both the base111https://huggingface.co/openai/clip-vit-base-patch16 and large222https://huggingface.co/openai/clip-vit-large-patch14 versions of CLIP, referred to as MMRet-Base and MMRet-Large, respectively. For the MLLM-based MMRet, we leverage the LLaVA-1.6 Mistral 7B architecture333https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf and initialize the model parameters accordingly, which we denote as MMRet-MLLM. The training details of MMRet on MegaPairs can be found in Appendix B.

4.1.2 Benchmarks

We evaluate our MMRet in a zero-shot setting across four different composed image retrieval benchmarks: CIRCO Baldrati et al. (2023), CIRR Liu et al. (2021), FashionIQ Wu et al. (2021), and GeneCIS Vaze et al. (2023). Following previous practice Zhang et al. (2024), CIRCO is considered our main benchmark due to its extensive candidate pool and high-quality annotations. Detailed information and metrics for each benchmark can be found in Appendix C.

Table 2: Zero-shot performance on the Massive Multimodal Embedding Benchmark (MMEB). †UniIR was trained on M-BEIR Wei et al. (2024), which includes 10 of the 12 datasets in the MMEB retrieval tasks, it does not strictly adhere to a zero-shot setting. In contrast, our MMRet-MLLM, trained exclusively on the MegaPairs dataset, achieves state-of-the-art zero-shot performance in overall scores and multiple meta-tasks on MMEB.

4.1.3 Evaluation Results

The main evaluation results of MMRet across four benchmarks are shown in Table 1, with full results for each benchmark provided in Appendix D. We have identified three key observations:

(1) Our MMRet-MLLM model achieves leading performance across three of the four benchmarks. Specifically, on our main benchmark, CIRCO, MMRet-MLLM surpasses the current SOTA CoCa-based MagicLens-L by achieving 42.2% mAP@5 compared to 34.1% (an increase of 8.1%). On CIRR test set, it exceeds the current SOTA by 7.4% and 4.5% in R@1 and Rs@1, respectively. Additionally, on GeneCIS, it leads the current SOTA by 3.7% in Rs@1.

(2) MMRet exhibits superior performance across all model scales. For instance, MMRet-Base and MMRet-Large outperform comparable models by 4.5% and 4.7% in R@1 on the CIRR test set, respectively. Additionally, they surpass similar models by 3.5% and 5.1% in mAP@5 on the CIRCO benchmark. In the fashion-domain benchmark FashionIQ, while not achieving the highest scores, our CLIP-based MMRet shows competitive performance against other CLIP-based models.

(3) The MMRet-Base model surpasses most larger models, underscoring the exceptional quality of our MegaPairs dataset. Despite being our smallest model, MMRet-Base outperforms many larger models such as the MagicLens-L. For instance, it achieving the best result on CIRCO with a mAP@5 of 34.3%, excluding our own MMRet-Large and MMRet-MLLM models. It even exceeds the performance of models with dozens of times more parameters (e.g., MM-Embed), emphasizing the effectiveness of our MegaPairs dataset.

4.2 Performance on MMEB

To further validate the generalization ability of MegaPairs for broader multimodal embedding tasks, we evaluate MMRet on the Massive Multimodal Embedding Benchmark (MMEB) Jiang et al. (2024b). MMEB is a comprehensive benchmark that includes 36 datasets across four meta-task categories: Classification, Visual Question Answering, Retrieval, and Visual Grounding. It is designed to evaluate the quality of multimodal embeddings and assesses models across diverse combinations of text and image modalities. We present the performance of MMRet in both zero-shot and supervised fine-tuning scenarios. Following previous works Jiang et al. (2024a, b), we conduct experiments using our MMRet-MLLM.

Table 3: Supervised fine-tuning results on the MMEB benchmark. The backbone of our MMRet-MLLM is LLaVA-1.6 Liu et al. (2024). We compare our results with the following baselines: CLIP Radford et al. (2021), OpenCLIP Cherti et al. (2023), and two versions of VLM2Vec Jiang et al. (2024b) that employ the LLaVA-1.6 Liu et al. (2024) and Phi-3.5-V Abdin et al. (2024) backbones. All baseline results are sourced from Jiang et al. (2024b). IND: in-distribution dataset; OOD: out-of-distribution dataset.

4.2.1 Zero-shot Performance

Implementation Details.

In the zero-shot evaluation on MMEB, we directly utilize our MMRet-MLLM from Section 4.1, maintaining implementation details consistent with Section 4.1.1.

Metrics.

We evaluate Precision@1 for all tasks, which measures the ratio of positive candidates ranked in the top position for all queries. We report the average scores for the four meta tasks as well as the overall average. Following the MMEB setting, we incorporate the predefined task-specific instructions into queries for all tasks during evaluation.

Results.

The zero-shot performance of our MMRet-MLLM on MMEB is presented in Table 2. MMRet-MLLM achieved state-of-the-art zero-shot performance across various embedding meta-task, recording the highest overall average performance. Compared to the recent E5-V Jiang et al. (2024a), which uses a similar LLaVA-1.6 Liu et al. (2024) backbone for universal multimodal embedding, MMRet-MLLM trained on our MegaPairs dataset demonstrated superior performance. Notably, the second-best model, UniIR, was trained on M-BEIR Wei et al. (2024), which encompasses datasets from 10 of the 12 retrieval meta-tasks in MMEB, and thus is not considered zero-shot for this meta-task. Consequently, our MLLM-Ret significantly outperforms the remaining methods in the retrieval meta-task and demonstrates strong generalization capabilities across all tasks.

4.2.2 Supervised Fine-tuning Performance

Implementation Details.

We further fine-tune our MMRet-MLLM on MMEB to investigate the impact of MegaPairs on downstream task performance. The MMEB dataset includes 20 in-distribution (IND) datasets for training and 16 out-of-distribution (OOD) datasets for evaluation. We utilize the training sets from the 20 IND datasets, comprising approximately 662K data points. The learning rate is set to 5×10−65superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, and we employ LoRA with a rank of 32. The batch size is set to 192, and we train for one epoch. Following the VLM2Vec configuration Jiang et al. (2024b), we incorporate task-specific instructions into the queries during training.

Metrics.

We employ the same metrics as outlined in Section 4.2.1. Additionally, we report the average scores for both the IND and OOD datasets.

Results.

Table 3 compares the supervised fine-tuning performance of our MMRet model with various baselines on the MMEB dataset. Our MMRet-MLLM achieves state-of-the-art performance, with an overall average Precision@1 of 64.1%. Compared to VLM2Vec (LLaVA-1.6) Jiang et al. (2024b), which directly fine-tunes LLaVA-1.6 on MMEB, MMRet-MLLM enhances downstream task performance by 9.1% through multimodal contrastive training on our MegaPairs. Notably, our model shows improvements of 11.6% and 7.1% on out-of-distribution (OOD) datasets compared to the two versions of VLM2Vec, highlighting the superior generalization capability of our MegaPairs for broader downstream multimodal embedding tasks.

4.3 Detailed Investigation on MegaPairs

We first assess the quality and scalability of our MegaPairs dataset in Section 4.3.1. Next, we evaluate the effectiveness of the hard negative samples provided by MegaPairs in Section 4.3.2. Finally, we explore the strategies used for mining image pairs from open-domain image corpora in Section 4.3.3. Unless otherwise specified, all subsequent experiments are conducted using our MMRet-base model.

4.3.1 Data Scalability and Quality

Refer to caption

Figure 2: Performance scaling of MMRet-base on the MegaPairs as data size increases. The dashed lines indicate the performance of MagicLens-B (CLIP) trained on their dataset of 36.7M data pairs.

We first evaluated the performance trend of MMRet by training it on different sizes of subsets from the MegaPairs dataset to verify its scalability. Subsequently, we compared it with existing datasets to highlight the high-quality features of MegaPairs.

Performance Scaling. As shown in Figure 2, the performance of MMRet-base across various benchmarks consistently improves with the increasing size of training data. This upward trend highlights the effectiveness and scalability of MegaPairs.

Dataset Quality Comparison with Exsiting Datasets. The dashed lines in Figure 2 represent the performance of the MagicLens-B (CLIP) model, trained on their 36.7M dataset Zhang et al. (2024). Remarkably, with only 0.5M samples from our MegaPairs dataset, constituting less than 2% of MagicLens, MMRet significantly surpasses MagicLens across all benchmarks using the same CLIP-base backbone. This result underscores the superior quality and efficiency of our MegaPairs dataset.

4.3.2 The Impact of Hard Negatives

In MegaPairs, images from the retrieved set that are not the target are marked as hard negatives, providing a diverse and ample set of hard negative target images for each image pair. As shown in Table 4, compared to not using negatives or only using the query image as a negative, training with our mined hard negatives significantly enhances model performance across all benchmarks.

Table 4: Performance comparison of MMRet-base using different negative strategies at a 1M scale. Qry: query image negative; HN: our mined hard negatives. †We report CIRR validation set performance due to their test server submission limits.

4.3.3 Data Pair Search Strategy

We explored the impact of various search strategies in constructing heterogeneous triplets. For a fair comparison, we selected 1M data entries for each construction strategy and trained the model for 2000 steps.

Table 5 presents the results of various data pairing strategies across multiple benchmarks. Initially, when evaluating individual strategies, we observed that triplets based on text similarity achieved the highest zero-shot CIR performance. We hypothesize that text similarity captures more diverse relationships than image similarity. Furthermore, combining any two pairing strategies consistently outperformed using a single strategy. This enhancement is likely due to the increased diversity within the dataset, which is essential for training robust multimodal embedding models. Ultimately, employing all three strategies simultaneously provided the most robust performance across all datasets. As a result, this approach was chosen for constructing the MegaPairs.

Table 5: Performance comparison of MMRet-base using different data pairing strategies at 1M scale. D: DINOv2 Encoder; I: CLIP Image Encoder; T: CLIP Text Encoder. FIQ and CIS represent the FashionIQ and GeneCIS benchmarks, respectively. † We report CIRR validation set performance due to test server submission limits.

5 Conclusion

In this paper, we introduce MegaPairs, a large-scale multimodal pairing dataset designed for training universal multimodal retrievers. MegaPairs comprises diverse image pairs from the open world, annotated with open-ended textual instructions that capture their visual and semantic relationships. Using MegaPairs, we trained our MMRet models, achieving state-of-the-art zero-shot performance in four composed image retrieval tasks and on the Massive Multimodal Embedding Benchmarks, which consists of 36 different datasets. Extensive experiments further demonstrate the generalization capability and high-quality features of MegaPairs.

Limitations

In constructing MegaPairs, we discovered that using diverse retrievers can generate richer image pairs. Our study employed three distinct retrievers, which offered substantial diversity. However, there remains potential to explore additional pairing methods, such as leveraging more advanced text domain retrievers (e.g., BGE Xiao et al. (2024)) or incorporating varied strategies like image-text retrieval.

Ethics Statement

All images in our MegaPairs dataset are sourced from the Recap-Datacomp-1B dataset Li et al. (2024b), and have undergone rigorous screening by the Datacomp team to remove harmful content Gadre et al. (2024). Despite our best efforts, we acknowledge that these screenings may not be entirely comprehensive or without omissions. Additionally, we strongly discourage the use of MMRet models for encoding and retrieving sensitive content.

References

Appendix

Appendix A Detailed Prompt for Annotating Open-Ended Instructions

To annotate open-ended instructions, we begin by using the MLLM to generate a detailed description of the commonalities and differences between the query image and the target image, where the corresponding prompt is illustrated in Figure 3. Subsequently, the description is refined by the LLM to produce textual instructions, with the associated prompt provided in Figure 4.

Refer to caption

Figure 3: The specific prompts for MLLM. The value of WORD_NUM ranges from 60 to 100 in our practical data generation to enhance the diversity of the generated description.

Refer to caption

Figure 4: The specific prompts for LLM. The figure showcases two demonstrations, while in our practical data generation process, five demonstrations are randomly selected from a pool of 50 and fed into the LLM.

Appendix B Training Details of MMRet on MegaPairs

For the CLIP-based MMRet, the training process employs a batch size of 2048, with each query paired with one positive image and four hard negatives. All input images are resized to 224x224 to match the model’s configuration. During training, all CLIP parameters remain unfrozen. MMRet-base is trained for 15,000 steps, while MMRet-large is trained for 25,000 steps on the MegaPairs dataset.

For the MLLM-based MMRet, we use a batch size of 144 during training, with each query associated with one positive image and three hard negatives. We apply LoRA Hu et al. (2022) to both the ViT encoder and the LLM backbone of LLaVA-1.6, setting the LoRA rank to 32. Although the original model supports variable resolution image processing, we use a fixed resolution of 512x512 for all images to manage the token sequence length. MMRet-MLLM is trained for 20,000 steps on the MegaPairs dataset.

For both CLIP-based and MLLM-based MMRet models, we set an initial learning rate of 5×10−65superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and employ a linear decay strategy.

Appendix C Detailed Information and Evaluation Metrics of Zero-Shot CIR Benchmarks

The detailed information and metrics of our evaluation in zero-shot composed image retrieval (CIR) tasks for each benchmark are as follows:

CIRCO Baldrati et al. (2023) is a challenging zero-shot CIR benchmark comprising 123,403 candidate natural images. We evaluete our MMRet models on its test set, which contains 800 composed image-text queries, each annotated with multiple ground-truth images. We use mean Average Precision (mAP) as the evaluation metric. Due to its extensive candidate pool and high-quality annotations, CIRCO serves as a robust and comprehensive benchmark for zero-shot CIR evaluation, and we consider it our main benchmark.

CIRR Liu et al. (2021) is the first dataset for conducting the CIR task using natural images. We conduct zero-shot evaluations on its test set, which comprises 4,148 queries and a corpus of 2,315 images. Each query in CIRR is annotated with exactly one positive target image, but it suffers from some false negative issues. For each query, CIRR provides a subset retrieval setting that retrieves target images from a small corpus. We assess both standard and subset retrieval performance using recall metrics (R and Rs).

FashionIQ Wu et al. (2021) is another CIR task focusing on fashion products. We conduct zero-shot evaluations on its validation set, which includes 6,016 queries and 15,536 images. FashionIQ comprises three sub-tasks: dress, shirt, and toptee. We evaluate each sub-task separately and report their average recall values.

GeneCIS Vaze et al. (2023)is a benchmark for conditional image similarity measurement, comprising four sub-tasks about changing or focusing the attribute or object in the given image. In each sub-task, models need to retrieve the most similar images from a dedicated small subset for the given query image and the condition keyword. We approach it as a CIR task by combining the query image and the text description of the sub-task derived from the condition keyword as a composed image-text query. Each query’s candidate subset averages 13.813.813.813.8 images, and the mean Rs across all four subsets is reported.

Appendix D Full results on CIR Benchmarks

We report the full results on four CIR benchmarks Baldrati et al. (2023); Liu et al. (2021); Wu et al. (2021); Vaze et al. (2023) in Tables 6, 7, 8, and 9, respectively. Our MMRet model achieves state-of-the-art performance across various model sizes on the CIRCO, CIRR, and GeneCIS benchmarks.

Appendix E Full Results on MMEB Benchmark

We list the full results on the MMEB benchmark Jiang et al. (2024b) in Table 10. The MMEB benchmark consists of 36 datasets spanning four meta-task categories, including 20 in-distribution datasets and 16 out-of-distribution (OOD) datasets. The results on the OOD datasets are highlighted with a gray background in the table. Our MMRet model achieves state-of-the-art performance in both zero-shot and fine-tuning settings. Notably, MMRet surpasses the second-best performance on the OOD datasets, demonstrating its remarkable generalization capability.

Appendix F Visualized Examples of MegaPairs

We present several examples of MegaPairs in Figure 5. Each row corresponds to a single example, where the query item, comprising an image and its corresponding alt-text caption, is associated with multiple target images. These target images include both visually similar ones and semantically related ones beyond visual features.

For example, in the 4th row, the query image showing an ottoman with the alt-text caption Round ottoman, tufted surface is paired with target items that feature visually similar images (e.g., the 1st target image, which shows an ottoman, and the 3rd target image, which depicts a sofa with a similar style) as well as semantically related images that transcend visual features (e.g., the 2nd and 4th target images, depicting the interior of a car and a living room wall, respectively. These share few visual features with the query image but also exhibit a tufted surface). In the 5th row, the query image showing an F1 car with the alt-text caption AMG F1 W09 is paired with target items featuring visually similar images (e.g., the 1st target image, which shows an F1 car in red, and the 3rd target image, which displays a race scene with multiple F1 cars) as well as semantically related images that transcend visual features (e.g., the 2nd target image, which shows an F1 driver, and the 4th target image, depicting an F1 circuit. These images bear no visual similarity to the query image but share the F1 concept).

Appendix G Qualitative Results of MMRet on Zero-shot CIR Tasks

We present several top-5 retrieved images of our MMRet and the SOTA MagicLens Zhang et al. (2024) on zero-shot CIR tasks, as shown in Figure 6. Since only the CLIP-based checkpoint is available for MagicLens, we select the CLIP-L backbone for both methods. 1) For the blue ties query, MMRet accurately interprets the query and identifies both the specific attire and indoor setting, retrieving multiple images that meet the specified requirements. In contrast, MagicLens focuses solely on the individual object, overlooking the broader semantic context. 2) For the sweet, beverage, boats and sky query, MMRet demonstrates a solid understanding of real-world entity concepts, successfully integrating both foreground and background elements to retrieve the most relevant image. 3) The success on the bench top query highlights MMRet’s ability to comprehend specific pose and angle requirements. 4) The success on the darker ground and closer distance query illustrates MMRet’s capacity to recognize lighting conditions and shooting distance. 5) The success on the whell in the air query indicates that MMRet can identify dynamic actions and contextual scene elements.

Table 6: Full results on the CIRCO benchmark Baldrati et al. (2023). † indicates methods with multiple components (e.g., GPT-3.5, Qwen1.5-32B); we report # parameters of components with known sizes. The CoCa-based MagicLens‡ models are proprietary. Results in bold denote the best performances for each model scale.

Methods Backbone # Params Index Set Subset Set R@1 R@5 R@10 R@50 R@1 R@2 R@3 PALAVRA Cohen et al. (2022) CLIP-B 176M 16.6 43.5 58.5 84.0 41.6 65.3 80.9 PLI Chen and Lai (2023) BLIP-B 224M 27.2 58.9 71.4 91.3 55.1 77.4 89.1 SEARLE Baldrati et al. (2023) CLIP-B 165M 24.0 53.4 66.8 89.8 54.9 76.6 88.2 CIReVL Karthik et al. (2023) CLIP-B 12.3B† 23.9 52.5 66.0 87.0 60.2 80.1 90.2 LDRE Yang et al. (2024) CLIP-B 7.9B† 25.7 55.1 69.0 89.9 60.5 80.7 90.7 MagicLens-B Zhang et al. (2024) CLIP-B 166M 27.0 58.0 70.9 91.1 66.7 83.9 92.4 MagicLens-B‡ Zhang et al. (2024) CoCa-B 267M 31.6 64.0 76.9 93.8 69.3 86.0 94.0 MMRet-Base CLIP-B 149M 36.1 68.1 79.5 94.7 71.6 87.2 94.0 Pic2Word Saito et al. (2023) CLIP-L 429M 23.9 51.7 65.3 87.8 - - - PLI Chen and Lai (2023) CLIP-L 428M 25.5 54.6 67.6 88.7 55.6 77.5 89.5 SEARLE Baldrati et al. (2023) CLIP-L 442M 24.2 52.5 66.3 88.8 53.8 75.0 88.2 CIReVL Karthik et al. (2023) CLIP-L 12.5B† 24.6 52.3 64.9 86.3 59.5 79.9 89.7 LinCIR Gu et al. (2024b) CLIP-L 442M 25.0 53.3 66.7 - 57.1 77.4 88.9 CompoDiff Gu et al. (2024a) CLIP-L 568M 18.2 53.1 70.8 90.3 57.4 77.1 87.9 MagicLens-L Zhang et al. (2024) CLIP-L 465M 30.1 61.7 74.4 92.6 68.1 84.8 93.2 MagicLens-L‡ Zhang et al. (2024) CoCa-L 613M 33.3 67.0 77.9 94.4 70.9 87.3 94.5 MMRet-Large CLIP-L 428M 38.0 70.3 81.1 94.7 73.2 88.0 94.3 Pic2Word Saito et al. (2023) CLIP-H 987M 32.9 63.1 73.9 - 62.2 81.4 91.2 SEARLE Baldrati et al. (2023) CLIP-H 1.0B 34.0 64.0 75.3 - 64.6 83.2 92.8 LinCIR Gu et al. (2024b) CLIP-H 1.0B 33.8 63.5 73.4 - 62.4 81.5 92.1 Pic2Word Saito et al. (2023) CLIP-G 2.5B 30.4 58.1 69.2 - 68.9 85.5 93.0 SEARLE Baldrati et al. (2023) CLIP-G 2.6B 34.8 64.1 75.1 - 68.7 84.7 93.2 CompoDiff Gu et al. (2024a) CLIP-G 2.9B 26.7 55.1 74.5 92.0 64.5 82.4 91.8 CIReVL Karthik et al. (2023) CLIP-G 14.6B† 34.7 64.3 75.1 91.7 68.0 84.9 93.2 LinCIR Gu et al. (2024b) CLIP-G 2.6B 35.3 64.7 76.1 - 63.4 82.2 92.0 LDRE Yang et al. (2024) CLIP-G 10.3B† 36.2 66.4 77.3 94.0 68.8 85.7 93.8 IP-CIR Li et al. (2024c) CLIP-G 43.8B† 39.3 70.1 80.0 94.9 70.0 86.9 94.2 E5-V Jiang et al. (2024a) LLaVA-1.6 8.35B 33.9 64.1 75.9 93.5 - - - MMRet-MLLM LLaVA-1.6 7.57B 46.7 76.0 85.1 96.5 75.4 89.6 95.7

Table 7: Full results on the CIRR benchmark Liu et al. (2021). † indicates methods with multiple components (e.g., GPT-3.5, Qwen1.5-32B); we report # parameters of components with known sizes. The CoCa-based MagicLens‡ models are proprietary. Results in bold denote the best performance for each model scale.

Table 8: Full results on the FashionIQ benchmark Wu et al. (2021). † indicates methods with multiple components (e.g., GPT-3.5, Qwen1.5-32B); we report # parameters of components with known sizes. The CoCa-based MagicLens‡ models are proprietary. Results in bold denote the best performance for each model scale.

Methods Backbone # Params Focus Attribute Change Attribute Focus Object Change Object Avg R@1 R@2 R@3 R@1 R@2 R@3 R@1 R@2 R@3 R@1 R@2 R@3 R@1 CIReVL Karthik et al. (2023) CLIP-B 12.3B† 17.9 29.4 40.4 14.8 25.8 35.8 14.6 24.3 33.3 16.1 27.8 37.6 15.9 MagicLens-B Zhang et al. (2024) CLIP-B 166M 15.5 28.4 39.1 12.3 23.0 32.1 14.4 26.2 35.5 17.7 28.4 39.2 15.0 MagicLens-B‡ Zhang et al. (2024) CoCa-B 267M 16.2 27.8 38.6 16.2 27.2 36.6 17.1 27.7 38.2 20.2 32.2 42.9 17.4 MMRet-Base CLIP-B 149M 18.3 30.9 39.6 15.2 25.6 34.8 16.6 27.3 35.8 21.7 34.9 45.0 18.0 Pic2Word Saito et al. (2023) CLIP-L 429M 15.7 28.2 38.7 13.9 24.7 33.1 8.4 18.0 25.8 6.7 15.1 24.0 11.2 SEARLE Baldrati et al. (2023) CLIP-L 442M 17.0 29.7 40.7 16.4 25.3 34.1 8.0 16.9 25.6 7.9 16.8 24.8 12.3 CIReVL Karthik et al. (2023) CLIP-L 12.5B† 19.5 31.8 42.0 14.4 26.0 35.2 12.3 21.8 30.5 17.2 28.9 37.6 15.9 LinCIR Gu et al. (2024b) CLIP-L 442M 16.9 30.0 41.5 16.2 28.0 36.8 8.3 17.4 26.2 7.4 15.7 25.0 12.2 CompoDiff Gu et al. (2024a) CLIP-L 568M 13.5 24.3 36.1 19.2 28.6 37.2 8.1 16.4 25.1 18.7 31.7 40.6 14.9 MagicLens-L Zhang et al. (2024) CLIP-L 465M 16.1 28.2 39.0 15.6 27.5 36.3 16.3 26.2 35.5 17.1 29.5 39.7 16.3 MagicLens-L‡ Zhang et al. (2024) CoCa-L 613M 16.6 28.7 39.3 16.0 27.5 36.5 15.7 27.6 37.3 18.7 31.7 40.2 16.7 MMRet-Large CLIP-L 428M 18.4 30.0 38.5 15.4 27.6 35.7 17.4 26.6 36.3 21.0 34.0 42.4 18.1 Pic2Word Saito et al. (2023) CLIP-H 987M 18.6 30.7 42.1 13.2 23.9 33.1 9.2 17.6 27.1 6.6 16.5 25.4 11.9 SEARLE Baldrati et al. (2023) CLIP-H 1.0B 18.8 31.5 42.3 15.5 26.9 35.9 10.6 18.7 26.5 8.5 17.9 26.2 13.3 LinCIR Gu et al. (2024b) CLIP-H 1.0B 19.6 31.5 41.6 16.6 27.6 37.5 9.8 18.8 27.9 9.0 17.6 25.7 13.8 Pic2Word Saito et al. (2023) CLIP-G 2.5B 12.5 23.4 33.7 11.7 21.9 30.9 9.9 19.3 27.4 8.6 18.2 26.1 10.7 SEARLE Baldrati et al. (2023) CLIP-G 2.6B 16.3 29.4 40.7 16.2 27.3 35.5 10.8 18.2 27.9 8.3 15.6 25.8 12.9 CompoDiff Gu et al. (2024a) CLIP-G 2.9B 14.3 26.7 38.4 19.7 28.8 37.4 9.2 19.1 25.8 18.7 31.7 40.2 15.5 CIReVL Karthik et al. (2023) CLIP-G 14.6B† 20.5 34.0 44.5 16.1 28.6 39.4 14.7 25.2 33.0 18.1 31.2 41.0 17.4 LinCIR Gu et al. (2024b) CLIP-G 2.6B 19.1 33.0 42.3 17.6 30.2 38.1 10.1 19.1 28.1 7.9 16.3 25.7 13.7 MMRet-MLLM LLaVA-1.6 7.57B 18.4 31.4 41.0 16.7 27.7 36.4 22.4 32.5 41.6 26.9 40.4 49.9 21.1

Table 9: Full results on the GeneCIS benchmark Vaze et al. (2023). † indicates methods with multiple components (e.g., GPT-3.5, Qwen1.5-32B); we report # parameters of components with known sizes. The CoCa-based MagicLens‡ models are proprietary. Results in bold denote the best performance for each model scale.

Table 10: The detailed results on the MMEB benchmark Jiang et al. (2024b). We report the performance of our MMRet under both zero-shot and fine-tuning settings.

Refer to caption

Figure 5: The visualized examples of MegaPairs. Each row represents a single example, with the query item highlighted in a blue rectangle and the target items enclosed within a dashed box.

Refer to caption

Figure 6: Top-5 retrieved images of MMRet and MagicLens on zero-shot CIR tasks, both using the CLIP-L backbone. Queries are shown with a blue background, and the most correct retrieved images are marked with green outlines.