Cross-Modal Retrieval Using Deep De-correlated Subspace Ranking Hashing (original) (raw)
Related papers
Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval
Proceedings of the 31st ACM International Conference on Information & Knowledge Management
Cross-modal hashing (CMH) has been widely used in multimedia retrieval applications for its low storage cost and fast indexing speed. Thanks to the success of deep learning, cross-modal hashing has made significant progress with high-quality deep features. However, the modal gap is still a crucial bottleneck for existing cross-modal hashing methods: the commonly used convolutional neural network and bag-of-words encoders are customized for single modal prior, limiting the models to learn semantics representation in a cross-modal space. To overcome modality heterogeneity, we propose a shared transformer encoder (UniHash) to unify the cross-modal hashing into the same semantic space. A contrastive label correlation learning (CLC) loss using the category labels as modality bridge is designed together to improve the representation quality. Moreover, we take advantage of the multi-hot label space and propose a negative label generation (NegLG) strategy to get richer and uniformly distributed negative labels for contrast. Extensive experiments on three benchmarks verify the advantage of our proposed method. Besides, the proposed UniHash outperforms state-of-the-art cross-modal hashing methods significantly, establishing a new important baseline for the cross-modal hashing research. Codes are released github.com/idealwhite/Unihash.
Semantic Correlation Based Deep Cross-Modal Hashing For Faster Retrieval
International Journal of Innovative Technology and Exploring Engineering, 2019
Due to growth of multi-modal data, large amount of data is being generated. Nearest Neighbor (NN) search is used to retrieve information but it suffers when there is high-dimensional data. However Approximate Nearest Neighbor (ANN) is a searching method which is extensively used by the researchers where data is represented in form of binary code using semantic hashing. Such representation reduces the storage cost and retrieval speed. In addition, deep learning has shown good performance in information retrieval which efficiently handle scalability problem. The multi-modal data has different statistical properties so there is a need to have method which finds semantic correlation between them. In this paper, experiment is performed using correlation methods like CCA, KCCA and DCCA on NMIST dataset. NMIST dataset is multi-view dataset and result proves that DCCA outperforms over CCA and KCCA by learning representations with higher correlations. However, due to flexible requirements of...
Unsupervised Deep Cross-modality Spectral Hashing
IEEE Transactions on Image Processing, 2020
This paper presents a novel framework, namely Deep Cross-modality Spectral Hashing (DCSH), to tackle the unsupervised learning problem of binary hash codes for efficient cross-modal retrieval. The framework is a two-step hashing approach which decouples the optimization into (1) binary optimization and (2) hashing function learning. In the first step, we propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations. While the former is capable of well preserving the local structure of each modality, the latter reveals the hidden patterns from all modalities. In the second step, to learn mapping functions from informative data inputs (images and word embeddings) to binary codes obtained from the first step, we leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality. Quantitative evaluations on three standard benchmark datasets demonstrate that the proposed DCSH method consistently outperforms other stateof-the-art methods.
IEEE Transactions on Neural Networks and Learning Systems, 2022
In this paper, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH). First, to learn informative representations that can preserve both intra-and intermodal similarities, we leverage the recent advances in estimating variational lowerbound of MI to maximizing the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modelled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intra-and intermodal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-theart cross-modal retrieval methods.
Supervised Intra- and Inter-Modality Similarity Preserving Hashing for Cross-Modal Retrieval
IEEE Access
Cross-modal hashing has drawn considerable interest in multimodal retrieval due to the explosive growth of big data on multimedia. However, the existing methods mainly focus on unified hash codes learning and investigate the local geometric structure in the original space, resulting in low-discriminative power hash code of out-of-sample instances. To address this important problem, this paper is dedicated to investigate the hashing functions learning by considering the modality correlation preserving in the expected low-dimensional common space. A cross-modal hashing method based on supervised collective matrix factorization is proposed by taking intra-modality and inter-modality similarity preserving into account. For more flexible hashing functions, label information is embedded into the hashing functions learning procedure. Specifically, we explore the intra-modality similarity preserving in the expected low-dimensional common space. In addition, a supervised shrinking scheme is used to enhance the local geometric consistency in each modality. The proposed method learns unified hash codes as well as hashing functions for different modalities; the overall objective function, consisting of collective matrix factorization and intra-and intermodality similarity embedding, is solved using an alternative optimization in an iterative scheme. Extensive experiments on three benchmark data sets demonstrate that the proposed method is more flexible to new coming data and can achieve superior performance to the state-of-the-art supervised cross-modal hashing approaches in most of the cases.
Deep Learning to Hash with Multiple Representations
2012 IEEE 12th International Conference on Data Mining, 2012
Hashing seeks an embedding of high-dimensional objects into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (representation) of data. However, objects are often described by multiple representations. For instance, images are described by a few different visual descriptors (such as SIFT, GIST, and HOG), so it is desirable to incorporate multiple representations into hashing, leading to multi-view hashing. In this paper we present a deep network for multi-view hashing, referred to as deep multi-view hashing, where each layer of hidden nodes is composed of view-specific and shared hidden nodes, in order to learn individual and shared hidden spaces from multiple views of data. Numerical experiments on image datasets demonstrate the useful behavior of our deep multi-view hashing (DMVH), compared to recently-proposed multi-modal deep network as well as existing shallow models of hashing.
Weakly-paired Cross-Modal Hashing
arXiv (Cornell University), 2019
Hashing has been widely adopted for large-scale data retrieval in many domains, due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities are readily available. This assumption is unrealistic in practical applications. In addition, these methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly-paired data, whose correspondence across modalities are partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the local structure of each cluster, and thus to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes in a unified objective function the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions, and to reinforce the reciprocal effects of the two objectives. Experiments on publicly multi-modal datasets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it indeed offers a high degree of flexibility for practical cross-modal hashing tasks.
Effective and efficient indexing in cross-modal hashing-based datasets
Signal Processing: Image Communication
To overcome the barrier of storage and computation, the hashing technique has been widely used for nearest neighbor search in multimedia retrieval applications recently. Particularly, cross-modal retrieval that searches across different modalities becomes an active but challenging problem. Although dozens of cross-modal hashing algorithms are proposed to yield compact binary codes, the exhaustive search is impractical for the real-time purpose, and Hamming distance computation suffers inaccurate results. In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The proposed hash code indexing scheme exploits a few binary bits of the hash code as the index code. We construct an inverted index table based on index codes and train a neural network to improve the indexing accuracy and efficiency. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated by three cross-modal hashing methods. Results show the proposed method effectively boost the performance on these hash methods.
Deep Hashing with Hash Center Update for Efficient Image Retrieval
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we propose an approach for learning binary hash codes for image retrieval. Canonical Correlation Analysis (CCA) is used to design two loss functions for training a neural network such that the correlation between the two views to CCA is maximized. The first loss, maximizes the correlation between the hash centers and learned hash codes. The second loss maximizes the correlation between the class labels and classification scores. A novel weighted mean and thresholding based hash center update scheme is proposed for adapting the hash centers in each epoch. The training loss reaches the theoretical lower bound of the proposed loss functions, showing that the correlation coefficients are maximized during training and substantiating the formation of an efficient feature space for image retrieval. The measured mean average precision shows that the proposed approach outperforms other state-of-theart approaches in both single-labeled and multi-labeled image datasets.
Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals
IEEE Transactions on Neural Networks and Learning Systems, 2020
Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. In this article, we propose a novel deep semantic multimodal hashing network (DSMHN) for scalable image-text and video-text retrieval. The proposed deep hashing framework leverages 2-D convolutional neural networks (CNN) as the backbone network to capture the spatial information for image-text retrieval, while the 3-D CNN as the backbone network to capture the spatial and temporal information for video-text retrieval. In the DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both intermodality similarities and intramodality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for the classification task, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, intermodality similarity-preserving learning, semantic label-preserving learning, and hash function learning with different types of loss functions simultaneously. The proposed DSMHN method is a generic and scalable deep hashing framework for both image-text and video-text retrievals, which can be flexibly integrated with different types of loss functions. We conduct extensive experiments for both singlemodal-and cross-modal-retrieval tasks on four widely used multimodal-retrieval data sets. Experimental results on both image-text-and video-text-retrieval tasks demonstrate that the DSMHN significantly outperforms the state-of-the-art methods.