Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval (original) (raw)

Cross-Modal Retrieval Using Deep De-correlated Subspace Ranking Hashing

2018

Cross-modal hashing has become a popular research topic in recent years due to the efficiency of storing and retrieving high-dimensional multimodal data represented by compact binary codes. While most cross-modal hash functions use binary space partitioning functions (e.g. the sign function), our method uses ranking-based hashing, which is based on numerically stable and scale-invariant rank correlation measures. In this paper, we propose a novel deep learning architecture called Deep De-correlated Subspace Ranking Hashing (DDSRH) that uses feature-ranking methods to determine the hash codes for the image and text modalities in a common hamming space. Specifically, DDSRH learns a set of de-correlated nonlinear subspaces on which to project the original features, so that the hash code can be determined by the relative ordering of projected feature values in a given optimized subspace. The network relies upon a pre-trained deep feature learning network for each modality, and a hashing...

Semantic Correlation Based Deep Cross-Modal Hashing For Faster Retrieval

International Journal of Innovative Technology and Exploring Engineering, 2019

Due to growth of multi-modal data, large amount of data is being generated. Nearest Neighbor (NN) search is used to retrieve information but it suffers when there is high-dimensional data. However Approximate Nearest Neighbor (ANN) is a searching method which is extensively used by the researchers where data is represented in form of binary code using semantic hashing. Such representation reduces the storage cost and retrieval speed. In addition, deep learning has shown good performance in information retrieval which efficiently handle scalability problem. The multi-modal data has different statistical properties so there is a need to have method which finds semantic correlation between them. In this paper, experiment is performed using correlation methods like CCA, KCCA and DCCA on NMIST dataset. NMIST dataset is multi-view dataset and result proves that DCCA outperforms over CCA and KCCA by learning representations with higher correlations. However, due to flexible requirements of...

Supervised Intra- and Inter-Modality Similarity Preserving Hashing for Cross-Modal Retrieval

IEEE Access

Cross-modal hashing has drawn considerable interest in multimodal retrieval due to the explosive growth of big data on multimedia. However, the existing methods mainly focus on unified hash codes learning and investigate the local geometric structure in the original space, resulting in low-discriminative power hash code of out-of-sample instances. To address this important problem, this paper is dedicated to investigate the hashing functions learning by considering the modality correlation preserving in the expected low-dimensional common space. A cross-modal hashing method based on supervised collective matrix factorization is proposed by taking intra-modality and inter-modality similarity preserving into account. For more flexible hashing functions, label information is embedded into the hashing functions learning procedure. Specifically, we explore the intra-modality similarity preserving in the expected low-dimensional common space. In addition, a supervised shrinking scheme is used to enhance the local geometric consistency in each modality. The proposed method learns unified hash codes as well as hashing functions for different modalities; the overall objective function, consisting of collective matrix factorization and intra-and intermodality similarity embedding, is solved using an alternative optimization in an iterative scheme. Extensive experiments on three benchmark data sets demonstrate that the proposed method is more flexible to new coming data and can achieve superior performance to the state-of-the-art supervised cross-modal hashing approaches in most of the cases.

Unsupervised Deep Cross-modality Spectral Hashing

IEEE Transactions on Image Processing, 2020

This paper presents a novel framework, namely Deep Cross-modality Spectral Hashing (DCSH), to tackle the unsupervised learning problem of binary hash codes for efficient cross-modal retrieval. The framework is a two-step hashing approach which decouples the optimization into (1) binary optimization and (2) hashing function learning. In the first step, we propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations. While the former is capable of well preserving the local structure of each modality, the latter reveals the hidden patterns from all modalities. In the second step, to learn mapping functions from informative data inputs (images and word embeddings) to binary codes obtained from the first step, we leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality. Quantitative evaluations on three standard benchmark datasets demonstrate that the proposed DCSH method consistently outperforms other stateof-the-art methods.

Multimodal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing

IEEE Transactions on Neural Networks and Learning Systems, 2022

In this paper, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH). First, to learn informative representations that can preserve both intra-and intermodal similarities, we leverage the recent advances in estimating variational lowerbound of MI to maximizing the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modelled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intra-and intermodal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-theart cross-modal retrieval methods.

Efficient cross-modal retrieval via flexible supervised collective matrix factorization hashing

Multimedia Tools and Applications, 2018

Cross-modal retrieval has recently drawn much attention in multimedia analysis, and it is still a challenging topic mainly attributes to its heterogeneous nature. In this paper, we propose a flexible supervised collective matrix factorization hashing (FS-CMFH) to efficient cross-modal retrieval. First, we exploit a flexible collective matrix factorization framework to jointly learn the individual latent space of similar semantic with respected to each modality. Meanwhile, the label consistency across different modalities is simultaneously exploited to preserve both intra-modal and intermodal semantics within these similar latent semantic spaces. Accordingly, these two ingredients are formulated as a joint graph regularization term in an overall objective function, through which the similar hash codes of different modalities in an instance can be discriminatively obtained to flexibly characterize such instance. As a result, these derived hash codes incorporating higher discrimination power are able to improve the cross-modal searching accuracy significantly. The extensive experiments tested on three popular benchmark datasets show that the proposed approach performs favorably compared to the state-of-the-art competing approaches.

Improvement of deep cross-modal retrieval by generating real-valued representation

PeerJ Computer Science

The cross-modal retrieval (CMR) has attracted much attention in the research community due to flexible and comprehensive retrieval. The core challenge in CMR is the heterogeneity gap, which is generated due to different statistical properties of multi-modal data. The most common solution to bridge the heterogeneity gap is representation learning, which generates a common sub-space. In this work, we propose a framework called “Improvement of Deep Cross-Modal Retrieval (IDCMR)”, which generates real-valued representation. The IDCMR preserves both intra-modal and inter-modal similarity. The intra-modal similarity is preserved by selecting an appropriate training model for text and image modality. The inter-modal similarity is preserved by reducing modality-invariance loss. The mean average precision (mAP) is used as a performance measure in the CMR system. Extensive experiments are performed, and results show that IDCMR outperforms over state-of-the-art methods by a margin 4% and 2% re...

Transitive Hashing Network for Heterogeneous Multimedia Retrieval

2017

Hashing has been widely applied to large-scale multimedia retrieval due to the storage and retrieval efficiency. Cross-modal hashing enables efficient retrieval from database of one modality in response to a query of another modality. Existing work on cross-modal hashing assumes heterogeneous relationship across modalities for hash function learning. In this paper, we relax the strong assumption by only requiring such heterogeneous relationship in an auxiliary dataset different from the query/database domain. We craft a hybrid deep architecture to simultaneously learn the cross-modal correlation from the auxiliary dataset, and align the dataset distributions between the auxiliary dataset and the query/database domain, which generates transitive hash codes for heterogeneous multimedia retrieval. Extensive experiments exhibit that the proposed approach yields state of the art multimedia retrieval performance on public datasets, i.e. NUS-WIDE, ImageNet-YahooQA.

Weakly-paired Cross-Modal Hashing

arXiv (Cornell University), 2019

Hashing has been widely adopted for large-scale data retrieval in many domains, due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities are readily available. This assumption is unrealistic in practical applications. In addition, these methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly-paired data, whose correspondence across modalities are partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the local structure of each cluster, and thus to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes in a unified objective function the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions, and to reinforce the reciprocal effects of the two objectives. Experiments on publicly multi-modal datasets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it indeed offers a high degree of flexibility for practical cross-modal hashing tasks.

Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals

IEEE Transactions on Neural Networks and Learning Systems, 2020

Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both singlemodal-and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text-and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.