VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval

VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing

ArXiv, 2021

Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval. However, most of the existing VL transformers use earlyinteraction dataflow that computes a joint representation for the text-image input. In the retrieval stage, such models need to infer on all the matched text-image combinations, which causes high computing costs. The goal of this paper is to decompose the early-interaction dataflow inside the pre-trained VL transformer to achieve acceleration while maintaining its outstanding accuracy. To achieve this, we propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text through contrastive learning, which accelerates retrieval speed by thousands of times. Meanwhile, we propose to compose bimodal hard negatives for the contrastive learning objective, which enables the VLDeformer to maintain the outstanding accuracy of the backbone VL transformer...

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Proceedings of the AAAI Conference on Artificial Intelligence

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two...

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

ArXiv, 2020

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the problem of accurate cross-media retrieval through image-sentence matching based on word-region alignments using supervision only at the global image-sentence level. In particular, we present an approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. The proposed approach obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k. Moreover, on MS-COCO, it defeats current approaches also on the sentence retrieval task. Given our long-term interest in scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separat...

Self-Supervised Visual Representations for Cross-Modal Retrieval

Proceedings of the 2019 on International Conference on Multimedia Retrieval - ICMR '19, 2019

Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places. However, collecting and annotating such datasets requires a tremendous amount of human effort and, besides, their annotations are usually limited to discrete sets of popular visual classes that may not be representative of the richer semantics found on large-scale cross-modal retrieval datasets. In this paper, we present a self-supervised cross-modal retrieval framework that leverages as training data the correlations between images and text on the entire set of Wikipedia articles. Our method consists in training a CNN to predict: (1) the semantic context of the article in which an image is more probable to appear as an illustration (global context), and (2) the semantic context of its caption (local context). Our experiments demonstrate that the proposed method is not only capable of learning discriminative visual representations for solving vision tasks like image classification and object detection, but that the learned representations are better for cross-modal retrieval when compared to supervised pre-training of the network on the ImageNet dataset.

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

ArXiv, 2020

This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchm...

Continual learning in cross-modal retrieval

2021

Multimodal representations and continual learning are two areas closely related to human intelligence. The former considers the learning of shared representation spaces where information from different modalities can be compared and integrated (we focus on cross-modal retrieval between language and visual representations). The latter studies how to prevent forgetting a previously learned task when learning a new one. While humans excel in these two aspects, deep neural networks are still quite limited. In this paper, we propose a combination of both problems into a continual cross-modal retrieval setting, where we study how the catastrophic interference caused by new tasks impacts the embedding spaces and their cross-modal alignment required for effective retrieval. We propose a general framework that decouples the training, indexing and querying stages. We also identify and study different factors that may lead to forgetting, and propose tools to alleviate it. We found that the indexing stage pays an important role and that simply avoiding reindexing the database with updated embedding networks can lead to significant gains. We evaluated our methods in two image-text retrieval datasets, obtaining significant gains with respect to the fine tuning baseline.

MURAL: Multimodal, Multitask Retrieval Across Languages

ArXiv, 2021

Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al., 2021)–a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL’s performance matches or exceeds ALIGN’s cross-modal retrieval performance on wellresourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-BASE improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8...

Improvement of deep cross-modal retrieval by generating real-valued representation

PeerJ Computer Science

The cross-modal retrieval (CMR) has attracted much attention in the research community due to flexible and comprehensive retrieval. The core challenge in CMR is the heterogeneity gap, which is generated due to different statistical properties of multi-modal data. The most common solution to bridge the heterogeneity gap is representation learning, which generates a common sub-space. In this work, we propose a framework called “Improvement of Deep Cross-Modal Retrieval (IDCMR)”, which generates real-valued representation. The IDCMR preserves both intra-modal and inter-modal similarity. The intra-modal similarity is preserved by selecting an appropriate training model for text and image modality. The inter-modal similarity is preserved by reducing modality-invariance loss. The mean average precision (mAP) is used as a performance measure in the CMR system. Extensive experiments are performed, and results show that IDCMR outperforms over state-of-the-art methods by a margin 4% and 2% re...

Joint-teaching: Learning to Refine Knowledge for Resource-constrained Unsupervised Cross-modal Retrieval

Proceedings of the 29th ACM International Conference on Multimedia, 2021

Cross-modal retrieval has received considerable attention owing to its applicability to enable users to search desired information with diversified forms. Existing retrieval methods retain good performance mainly relying on complex deep neural networks and high-quality supervision signals, which deters them from realworld resource-constrained development and deployment. In this paper, we propose an effective unsupervised learning framework named JOint-teachinG (JOG) to pursue a high-performance yet lightweight cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student") with strong feature learning ability and predictive power. Considering that a teacher model serving the same task as the student is not always available, we resort to a cross-task teacher to leverage transferrable knowledge to guide student learning. To eliminate the inevitable noises in the distilled knowledge resulting from the task discrepancy, an online knowledge-refinement strategy is designed to progressively improve the quality of the cross-task knowledge in a joint-teaching manner, where a peer student is engaged. In addition, the proposed JOG learns to represent the original high-dimensional data with compact binary codes to accelerate the query processing, further facilitating resource-limited retrieval. Through extensive experiments, we demonstrate that in various network structures, the proposed method can yield promising learning results on widelyused benchmarks. The proposed research is a pioneering work for resource-constrained cross-modal retrieval, which has strong potential to be applied to on-device deployment and is hoped to pave the way for further study. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

arXiv (Cornell University), 2023

Most approaches to cross-modal retrieval (CMR) focus either on object-centric datasets, meaning that each document depicts or describes a single object, or on scene-centric datasets, meaning that each image depicts or describes a complex scene that involves multiple objects and relations between them. We posit that a robust CMR model should generalize well across both dataset types. Despite recent advances in CMR, the reproducibility of the results and their generalizability across different dataset types has not been studied before. We address this gap and focus on the reproducibility of the state-of-the-art CMR results when evaluated on object-centric and scene-centric datasets. We select two state-of-theart CMR models with different architectures: (i) CLIP; and (ii) X-VLM. Additionally, we select two scene-centric datasets, and three object-centric datasets, and determine the relative performance of the selected models on these datasets. We focus on reproducibility, replicability, and generalizability of the outcomes of previously published CMR experiments. We discover that the experiments are not fully reproducible and replicable. Besides, the relative performance results partially generalize across object-centric and scene-centric datasets. On top of that, the scores obtained on object-centric datasets are much lower than the scores obtained on scene-centric datasets. For reproducibility and transparency we make our source code and the trained models publicly available.

VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval (original) (raw)

Related papers