Bayesian Triplet Loss: Uncertainty Quantification in Image Retrieval (original) (raw)

Exploring Uncertainty in Conditional Multi-Modal Retrieval Systems

Cornell University - arXiv, 2019

We cast visual retrieval as a regression problem by posing triplet loss as a regression loss. This enables epistemic uncertainty estimation using dropout as a Bayesian approximation framework in retrieval. Accordingly, Monte Carlo (MC) sampling is leveraged to boost retrieval performance. Our approach is evaluated on two applications: person reidentification and autonomous car driving. Comparable state-of-the-art results are achieved on multiple datasets for the former application. We leverage the Honda driving dataset (HDD) for autonomous car driving application. It provides multiple modalities and similarity notions for ego-motion action understanding. Hence, we present a multi-modal conditional retrieval network. It disentangles embeddings into separate representations to encode different similarities. This form of joint learning eliminates the need to train multiple independent networks without any performance degradation. Quantitative evaluation highlights our approach competence, achieving 6% improvement in a highly uncertain environment.

Exploring Uncertainty Measures for Image-caption Embedding-and-retrieval Task

ACM Transactions on Multimedia Computing, Communications, and Applications, 2021

With the significant development of black-box machine learning algorithms, particularly deep neural networks, the practical demand for reliability assessment is rapidly increasing. On the basis of the concept that “Bayesian deep learning knows what it does not know,” the uncertainty of deep neural network outputs has been investigated as a reliability measure for classification and regression tasks. By considering an embedding task as a regression task, several existing studies have quantified the uncertainty of embedded features and improved the retrieval performance of cutting-edge models by model averaging. However, in image-caption embedding-and-retrieval tasks, well-known samples are not always easy to retrieve. This study shows that the existing method has poor performance in reliability assessment and investigates another aspect of image-caption embedding-and-retrieval tasks. We propose posterior uncertainty by considering the retrieval task as a classification task, which ca...

Distribution Structure Learning Loss (DSLL) Based on Deep Metric Learning for Image Retrieval

Entropy, 2019

The massive number of images demands highly efficient image retrieval tools. Deep distance metric learning (DDML) is proposed to learn image similarity metrics in an end-to-end manner based on the convolution neural network, which has achieved encouraging results. The loss function is crucial in DDML frameworks. However, we found limitations to this model. When learning the similarity of positive and negative examples, the current methods aim to pull positive pairs as close as possible and separate negative pairs into equal distances in the embedding space. Consequently, the data distribution might be omitted. In this work, we focus on the distribution structure learning loss (DSLL) algorithm that aims to preserve the geometric information of images. To achieve this, we firstly propose a metric distance learning for highly matching figures to preserve the similarity structure inside it. Second, we introduce an entropy weight-based structural distribution to set the weight of the rep...

Learning from the uncertain

Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven Business

Recent advances for visual concept detection based on deep convolutional neural networks have only been successful because of the availability of huge training datasets provided by benchmarking initiatives such as ImageNet. Assembly of reliably annotated training data still is a largely manual effort and can only be approached efficiently as crowdworking tasks. On the other hand, user generated photos and annotations are available at almost no costs in social photo communities such as Flickr. Leveraging the information available in these communities may help to extend existing datasets as well as to create new ones for completely different classification scenarios. However, user generated annotations of photos are known to be incomplete, subjective and do not necessarily relate to the depicted content. In this paper, we therefore present an approach to reliably identify photos relevant for a given visual concept category. We have downloaded additional metadata for 1 million Flickr images and have trained a language model based on user generated annotations. Relevance estimation is based on accordance of an image's annotation data with our language model and on subsequent visual re-ranking. Experimental results demonstrate the potential of the proposed method-comparison with a baseline approach based on single tag matching shows significant improvements.

Robust and Decomposable Average Precision for Image Retrieval

ArXiv, 2021

In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP). In this paper, we introduce a method for robust and decomposable average precision (ROADMAP) addressing two major challenges for end-to-end training of deep neural networks with AP: non-differentiability and non-decomposability. Firstly, we propose a new differentiable approximation of the rank function, which provides an upper bound of the AP loss and ensures robust training. Secondly, we design a simple yet effective loss function to reduce the decomposability gap between the AP in the whole training set and its averaged batch approximation, for which we provide theoretical guarantees. Extensive experiments conducted on three image retrieval datasets show that ROADMAP outperforms several recent AP approximation methods and highlight the importance of our two contributions. Finally, using ROADMAP for training deep models yields very good performances, outperforming state-of-the-art r...

Modeling Latent Variable Uncertainty for Loss-based Learning

2012

We consider the problem of parameter estimation using weakly supervised datasets, where a training sample consists of the input and a partially specified annotation, which we refer to as the output. The missing information in the annotation is modeled using latent variables. Previous methods overburden a single distribution with two separate tasks: (i) modeling the uncertainty in the latent variables during training; and (ii) making accurate predictions for the output and the latent variables during testing. We propose a novel framework that separates the demands of the two tasks using two distributions: (i) a conditional distribution to model the uncertainty of the latent variables for a given input-output pair; and (ii) a delta distribution to predict the output and the latent variables for a given input. During learning, we encourage agreement between the two distributions by minimizing a loss-based dissimilarity coefficient. Our approach generalizes latent svm in two important ways: (i) it models the uncertainty over latent variables instead of relying on a pointwise estimate; and (ii) it allows the use of loss functions that depend on latent variables, which greatly increases its applicability. We demonstrate the efficacy of our approach on two challenging problems-object detection and action detection-using publicly available datasets.

Distribution Consistency Loss for Large-Scale Remote Sensing Image Retrieval

Remote Sensing, 2020

Remote sensing images are featured by massiveness, diversity and complexity. These features put forward higher requirements for the speed and accuracy of remote sensing image retrieval. The extraction method plays a key role in retrieving remote sensing images. Deep metric learning (DML) captures the semantic similarity information between data points by learning embedding in vector space. However, due to the uneven distribution of sample data in remote sensing image datasets, the pair-based loss currently used in DML is not suitable. To improve this, we propose a novel distribution consistency loss to solve this problem. First, we define a new way to mine samples by selecting five in-class hard samples and five inter-class hard samples to form an informative set. This method can make the network extract more useful information in a short time. Secondly, in order to avoid inaccurate feature extraction due to sample imbalance, we assign dynamic weight to the positive samples accordin...

Ladder Loss for Coherent Visual-Semantic Embedding

thirty-fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020

For visual-semantic embedding, the existing methods normally treat the relevance between queries and candidates in a bipolar way-relevant or irrelevant, and all "irrelevant" candidates are uniformly pushed away from the query by an equal margin in the embedding space, regardless of their various proximity to the query. This practice disregards relatively dis-criminative information and could lead to suboptimal ranking in the retrieval results and poorer user experience, especially in the long-tail query scenario where a matching candidate may not necessarily exist. In this paper, we introduce a continuous variable to model the relevance degree between queries and multiple candidates, and propose to learn a coherent embedding space, where candidates with higher relevance degrees are mapped closer to the query than those with lower relevance degrees. In particular, the new ladder loss is proposed by extending the triplet loss inequality to a more general inequality chain, which implements variable push-away margins according to respective relevance degrees. In addition , a proper Coherent Score metric is proposed to better measure the ranking results including those "irrelevant" candidates. Extensive experiments on multiple datasets validate the efficacy of our proposed method, which achieves significant improvement over existing state-of-the-art methods.

Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches

2020 25th International Conference on Pattern Recognition (ICPR)

In this work, we evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings. Specifically, we train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss in order to study their clustering properties under the assumption of noisy labels. Additionally, we propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods. We evaluate all three Triplet loss formulations for K-means and correlation clustering on the CIFAR-10 image classification dataset.

Discriminative Jackknife: Quantifying Uncertainty in Deep Learning via Higher-Order Influence Functions

2020

Deep learning models achieve high predictive accuracy across a broad spectrum of tasks, but rigorously quantifying their predictive uncertainty remains challenging. Usable estimates of predictive uncertainty should (1) cover the true prediction targets with high probability, and (2) discriminate between high- and low-confidence prediction instances. Existing methods for uncertainty quantification are based predominantly on Bayesian neural networks; these may fall short of (1) and (2) -- i.e., Bayesian credible intervals do not guarantee frequentist coverage, and approximate posterior inference undermines discriminative accuracy. In this paper, we develop the discriminative jackknife (DJ), a frequentist procedure that utilizes influence functions of a model's loss functional to construct a jackknife (or leave-one-out) estimator of predictive confidence intervals. The DJ satisfies (1) and (2), is applicable to a wide range of deep learning models, is easy to implement, and can be ...