Local Consensus Enhanced Siamese Network with Reciprocal Loss for Two-view Correspondence Learning (original) (raw)

Correspondence Networks With Adaptive Neighbourhood Consensus

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

In this paper, we tackle the task of establishing dense visual correspondences between images containing objects of the same category. This is a challenging task due to large intra-class variations and a lack of dense pixel level annotations. We propose a convolutional neural network architecture, called adaptive neighbourhood consensus network (ANC-Net), that can be trained end-to-end with sparse keypoint annotations, to handle this challenge. At the core of ANC-Net is our proposed non-isotropic 4D convolution kernel, which forms the building block for the adaptive neighbourhood consensus module for robust matching. We also introduce a simple and efficient multi-scale self-similarity module in ANC-Net to make the learned feature robust to intra-class variations. Furthermore, we propose a novel orthogonal loss that can enforce the one-to-one matching constraint. We thoroughly evaluate the effectiveness of our method on various benchmarks, where it substantially outperforms state-of-the-art methods.

Dual-Resolution Correspondence Networks

ArXiv, 2020

We tackle the problem of establishing dense pixel-wise correspondences between a pair of images. In this work, we introduce Dual-Resolution Correspondence Networks (DRC-Net), to obtain pixel-wise correspondences in a coarse-to-fine manner. DRC-Net extracts both coarse- and fine- resolution feature maps. The coarse maps are used to produce a full but coarse 4D correlation tensor, which is then refined by a learnable neighbourhood consensus module. The fine-resolution feature maps are used to obtain the final dense correspondences guided by the refined coarse 4D correlation tensor. The selected coarse-resolution matching scores allow the fine-resolution features to focus only on a limited number of possible matches with high confidence. In this way, DRC-Net dramatically increases matching reliability and localisation accuracy, while avoiding to apply the expensive 4D convolution kernels on fine-resolution feature maps. We comprehensively evaluate our method on large-scale public bench...

Dual-Resolution Correspondence Networks –Supplementary Material–

2020

In the supplementary, we present more experimental results and analysis to show the effectiveness of DualRC-Net. In section 1, we provide five alternatives to the FPN-like structure for fusing the dual-resolution feature maps of the feature backbone. In section 2, we compare DualRC-Net with other neighbourhood consensus based methods in more details. Finally, in section 3, we qualitatively compare DualRC-Net with the state-of-the-art methods on three benchmarks. DualRC-Net establishes the new state-of-the-art. 1 Investigation on more variants of FPN structure Apart from the dual-resolution feature extractor we present in the main paper, we also investigate other possible FPN-like architectures (shown in Figure 1) and thoroughly evaluate their effects on the matching performance.

MCNet: Multiscale Clustering Network for Two-View Geometry Learning and Feature Matching

IEEE/CAA Journal of Automatica Sinica, 2023

The main components of multi-view geometry and computer vision are robust pose estimation and feature matching. This letter discusses how to recover two-view geometry and match features between a pair of images, and presents MCNet (a multiscale clustering network) as an algorithm for extracting multiscale features. It can identify the true inliers from the established putative correspondences, where outliers may degenerate the geometry estimation. In particular, the proposed MCNet is based on graph clustering, in which the embedded correspondence features are mapped to a number of clusters by graph pooling. We designed a multiscale clustering layer into the two-view correspondence learning framework in order to improve correspondence representation efficiency. As a consequence of the multi-group feature fusion, we also constructed the network architectures termed MCNet-U and MCNet-M, respectively, utilizing the UNet and Pyramid techniques. Based on experimental results, the proposed model achieves state-of-the-art performance on feature matching with heavy outliers under weak supervision.

Convolutional Hough Matching Networks for Robust and Efficient Visual Correspondence

arXiv (Cornell University), 2021

Despite advances in feature representation, leveraging geometric relations is crucial for establishing reliable visual correspondences under large variations of images. In this work we introduce a Hough transform perspective on convolutional matching and propose an effective geometric matching algorithm, dubbed Convolutional Hough Matching (CHM). The method distributes similarities of candidate matches over a geometric transformation space and evaluates them in a convolutional manner. We cast it into a trainable neural layer with a semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters. To further improve the efficiency of high-dimensional voting, we also propose to use an efficient kernel decomposition with center-pivot neighbors, which significantly sparsifies the proposed semi-isotropic kernels without performance degradation. To validate the proposed techniques, we develop the neural network with CHM layers that perform convolutional matching in the space of translation and scaling. Our method sets a new state of the art on standard benchmarks for semantic visual correspondence, proving its strong robustness to challenging intra-class variations. Index Terms-Semantic visual correspondence, Hough matching, convolutional matching, center-pivot convolution ! CHM input: all possible candidate matches CHM output: filtered matches Regions with different centers and scales Matches of object regions with different centers and scales convolution Hough voting kernel Non-linearity + Maxpool + Upsample

WarpNet: Weakly Supervised Matching for Single-View Reconstruction

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

We present an approach to matching images of objects in fine-grained datasets without using part annotations, with an application to the challenging problem of weakly supervised single-view reconstruction. This is in contrast to prior works that require part annotations, since matching objects across class and pose variations is challenging with appearance features alone. We overcome this challenge through a novel deep learning architecture, WarpNet, that aligns an object in one image with a different object in another. We exploit the structure of the fine-grained dataset to create artificial data for training this network in an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. On the CUB-200-2011 dataset of bird categories, we improve the AP over an appearance-only network by 13.6%. We further demonstrate that our WarpNet matches, together with the structure of fine-grained datasets, allow single-view reconstructions with quality comparable to using annotated point correspondences.

Learning Semantic Correspondence Exploiting an Object-level Prior

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal provides an object-level prior for the semantic correspondence task and offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks.

Domain-Invariant Stereo Matching Networks

Lecture Notes in Computer Science, 2020

State-of-the-art stereo matching networks have difficulties in generalizing to new unseen environments due to significant domain differences, such as color, illumination, contrast, and texture. In this paper, we aim at designing a domain-invariant stereo matching network (DSM-Net) that generalizes well to unseen scenes. To achieve this goal, we propose i) a novel "domain normalization" approach that regularizes the distribution of learned representations to allow them to be invariant to domain differences, and ii) an end-to-end trainable structure-preserving graph-based filter for extracting robust structural and geometric representations that can further enhance domain-invariant generalizations. When trained on synthetic data and generalized to real test sets, our model performs significantly better than all state-of-the-art models. It even outperforms some deep neural network models (e.g. MC-CNN and DispNet) fine-tuned with test-domain data. The code is available at https://github.com/feihuzhang/DSMNet.

There and Back Again: Self-supervised Multispectral Correspondence Estimation

2021 IEEE International Conference on Robotics and Automation (ICRA)

Across a wide range of applications, from autonomous vehicles to medical imaging, multi-spectral images provide an opportunity to extract additional information not present in color images. One of the most important steps in making this information readily available is the accurate estimation of dense correspondences between different spectra. Due to the nature of cross-spectral images, most correspondence solving techniques for the visual domain are simply not applicable. Furthermore, most cross-spectral techniques utilize spectra-specific characteristics to perform the alignment. In this work, we aim to address the dense correspondence estimation problem in a way that generalizes to more than one spectrum. We do this by introducing a novel cycle-consistency metric that allows us to self-supervise. This, combined with our spectraagnostic loss functions, allows us to train the same network across multiple spectra. We demonstrate our approach on the challenging task of dense RGB-FIR correspondence estimation. We also show the performance of our unmodified network on the cases of RGB-NIR and RGB-RGB, where we achieve higher accuracy than similar self-supervised approaches. Our work shows that crossspectral correspondence estimation can be solved in a common framework that learns to generalize alignment across spectra.

Unsupervised Learning for Stereo Matching Using Single-View Videos

IEEE Access

This paper proposes an unsupervised approach to construct a deep learning based stereo matching method using single-view videos (SMV). From videos, a set of corresponding points are computed between images, and image patches that center at the computed points are extracted. Negative and positive samples constitute a dataset to train a similarity network that is then used as a matching cost function. In addition, we propose a local-global matching cost network that exploits the first feature maps (local features) accompanying with last feature maps (global features) as output feature of the proposed network. The concatenated features are connected to full-connected layers and the network outputs a similarity measure of an image patch pair as a matching cost. Computed matching costs are aggregated using semiglobal matching and cross-based cost aggregation, followed by sub-pixel interpolation, left-right consistency check, median and bilateral filtering. We evaluate the proposed stereo matching methods using popular stereo matching datasets, including KITTI 2012 and 2015, and Middlebury. We submit the disparity maps to their benchmark servers to evaluate the performance of SMV. We also compared the generalization of SMV and baseline methods using the training sets of the three datasets. The benchmark results show that SMV is the most accurate method among unsupervised approach, and it even outperforms several deep learning based stereo matching using supervised manner. The evaluation results of generalization show that SMV is comparative with the baseline method, MC-CNN, which is trained with supervision. INDEX TERMS Stereo matching, unsupervised learning, video extraction.