Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals (original) (raw)

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

ArXiv, 2020

This paper considers the task of matching images and sentences by learning a visual-textual embedding space for cross-modal retrieval. Finding such a space is a challenging task since the features and representations of text and image are not comparable. In this work, we introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously to infer image-text similarity. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking. To learn about the joint representations, we leverage our newly extracted collection of tweets from Twitter. The main characteristic of our dataset is that the images and tweets are not standardized the same as the benchmarks. Furthermore, there can be a higher semantic correlation between the pictures and tweets contrary to benchmarks in which the descriptions are well-organized. Experimental results on MS-COCO benchm...

Multimodal indexing based on semantic cohesion for image retrieval

Information Retrieval, 2012

This paper introduces two novel strategies for representing multimodal images with application to multimedia image retrieval. We consider images that are composed of both text and labels: while text describes the image content at a very high semantic level (e.g., making reference to places, dates or events), labels provide a mid-level description of the image (i.e., in terms of the objects that can be seen in the image). Accordingly, the main assumption of this work is that by combining information from text and labels we can develop very effective retrieval methods. We study standard information fusion techniques for combining both sources of information. However, whereas the performance of such techniques is highly competitive, they cannot capture effectively the content of images. Therefore, we propose two novel representations for multimodal images that attempt to exploit the semantic cohesion among terms from different modalities. Such representations are based on distributional term representations widely used in computational linguistics. Under the considered representations the content of an image is modeled by a distribution of co-occurrences over terms or of occurrences over other images, in such a way that the representation can be considered an expansion of the multimodal terms in the image. We report experimental results using the SAIAPR TC12 benchmark on two sets of topics used in ImageCLEF competitions with manually and automatically generated labels. Experimental results show that the proposed representations outperform significantly both, standard multimodal techniques and unimodal methods. Results on manually assigned labels provide an upper bound in the retrieval performance that can be obtained, whereas results with automatically generated labels are encouraging. The novel representations are able to capture more effectively the content of multimodal images. We emphasize that although we have applied our representations to multimedia image retrieval the same formulation can be adopted for modeling other multimodal documents (e.g., videos).

Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval

Electronics, 2020

Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learning method with multi-scale semantic correlation to deal with the retrieval tasks between image and text modalities. A deep model with two branches is designed to nonlinearly map raw heterogeneous data into comparable representations. In contrast to binary similarity, we formulate semantic relationship with multi-scale similarity to learn fine-grained multi-modal distances. Inter-modal and intra-...

Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval

Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Cross-modal hashing (CMH) has been widely used in multimedia retrieval applications for its low storage cost and fast indexing speed. Thanks to the success of deep learning, cross-modal hashing has made significant progress with high-quality deep features. However, the modal gap is still a crucial bottleneck for existing cross-modal hashing methods: the commonly used convolutional neural network and bag-of-words encoders are customized for single modal prior, limiting the models to learn semantics representation in a cross-modal space. To overcome modality heterogeneity, we propose a shared transformer encoder (UniHash) to unify the cross-modal hashing into the same semantic space. A contrastive label correlation learning (CLC) loss using the category labels as modality bridge is designed together to improve the representation quality. Moreover, we take advantage of the multi-hot label space and propose a negative label generation (NegLG) strategy to get richer and uniformly distributed negative labels for contrast. Extensive experiments on three benchmarks verify the advantage of our proposed method. Besides, the proposed UniHash outperforms state-of-the-art cross-modal hashing methods significantly, establishing a new important baseline for the cross-modal hashing research. Codes are released github.com/idealwhite/Unihash.

Multimodal Information Retrieval: Challenges and Future Trends

International Journal of Computer Applications, 2013

Multimodal information retrieval is a research problem of great interest in all domains, due to the huge collections of multimedia data available in different contexts like text, image, audio and video. Researchers are trying to incorporate multimodal information retrieval using machine learning, support vector machines, neural network and neuroscience etc. to provide an efficient retrieval system that fulfills user need. This paper is an overview of multimodal information retrieval, challenges in the progress of multimodal information retrieval.

Text-to-image retrieval based on incremental association via multimodal hypernetworks

2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012

Text-to-image retrieval is to retrieve the images associated with the textual queries. A text-to-image retrieval model requires an incremental learning method for its practical use since the multimodal data grow up dramatically. Here we propose an incremental text-to-image retrieval method using a multimodal association model. The association model is based on a hypernetwork (HN) where a vertex corresponds to a textual word or a visual patch and a hyperedge represents a higherorder multimodal association. Using the HN incrementally learned by a sequential Bayesian sampling, in the multimodal hypernetwork-based text-to-image retrieval, a given text query is crossmodally expanded to the visual query and then similar images are retrieved to the expanded visual query. We evaluated the proposed method using 3,000 images with textual description from Flickr.com. The experimental results present that the proposed method achieves very competitive retrieval performances compared to a baseline method. Moreover, we demonstrate that our method provides robust text-to-image retrieval results for the increasing data.

Multimodal Information Spaces for Content-based Image Retrieval

Currently, image retrieval by content is a research problem of great interest in academia and the industry, due to the large collections of images available in different contexts. One of the main challenges to develop effective image retrieval systems is the automatic identification of semantic image contents. This research proposal aims to design a model for image retrieval able to take advantage of different data sources, i.e. using multimodal information, to improve the response of an image retrieval system. In particular two data modalities associated to contents and context of images are considered in this proposal: visual features and unstructured text annotations. The proposed framework is based on kernel methods that provide two main important advantages over the traditional multimodal approaches: first, the structure of each modality is preserved in a high dimensional feature space, and second, they provide natural ways to fuse feature spaces in a unique information space. This document presents the research agenda to build a Multimodal Information Space for searching images by content.

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

arXiv (Cornell University), 2023

Most approaches to cross-modal retrieval (CMR) focus either on object-centric datasets, meaning that each document depicts or describes a single object, or on scene-centric datasets, meaning that each image depicts or describes a complex scene that involves multiple objects and relations between them. We posit that a robust CMR model should generalize well across both dataset types. Despite recent advances in CMR, the reproducibility of the results and their generalizability across different dataset types has not been studied before. We address this gap and focus on the reproducibility of the state-of-the-art CMR results when evaluated on object-centric and scene-centric datasets. We select two state-of-theart CMR models with different architectures: (i) CLIP; and (ii) X-VLM. Additionally, we select two scene-centric datasets, and three object-centric datasets, and determine the relative performance of the selected models on these datasets. We focus on reproducibility, replicability, and generalizability of the outcomes of previously published CMR experiments. We discover that the experiments are not fully reproducible and replicable. Besides, the relative performance results partially generalize across object-centric and scene-centric datasets. On top of that, the scores obtained on object-centric datasets are much lower than the scores obtained on scene-centric datasets. For reproducibility and transparency we make our source code and the trained models publicly available.

Supervised Intra- and Inter-Modality Similarity Preserving Hashing for Cross-Modal Retrieval

IEEE Access

Cross-modal hashing has drawn considerable interest in multimodal retrieval due to the explosive growth of big data on multimedia. However, the existing methods mainly focus on unified hash codes learning and investigate the local geometric structure in the original space, resulting in low-discriminative power hash code of out-of-sample instances. To address this important problem, this paper is dedicated to investigate the hashing functions learning by considering the modality correlation preserving in the expected low-dimensional common space. A cross-modal hashing method based on supervised collective matrix factorization is proposed by taking intra-modality and inter-modality similarity preserving into account. For more flexible hashing functions, label information is embedded into the hashing functions learning procedure. Specifically, we explore the intra-modality similarity preserving in the expected low-dimensional common space. In addition, a supervised shrinking scheme is used to enhance the local geometric consistency in each modality. The proposed method learns unified hash codes as well as hashing functions for different modalities; the overall objective function, consisting of collective matrix factorization and intra-and intermodality similarity embedding, is solved using an alternative optimization in an iterative scheme. Extensive experiments on three benchmark data sets demonstrate that the proposed method is more flexible to new coming data and can achieve superior performance to the state-of-the-art supervised cross-modal hashing approaches in most of the cases.

A new approach to cross-modal multimedia retrieval

Proceedings of the international conference on Multimedia - MM '10, 2010

The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for crossmodal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.