Two-stage deep regression enhanced depth estimation from a single RGB image (original) (raw)
Related papers
Depth estimation of a single RGB image with semi-supervised two-stage regression
Proceedings of the 5th International Conference on Communication and Information Processing, 2019
Obtaining accurate depth estimation at low computational cost is a major problem in the field of computer vision. To tackle this problem, we propose a framework that integrates different neural networks, for predicting the corresponding depth from a single RGB image and sparse depth samples. This method combines two different types of deep learning frameworks with the best performance, including the improved Residual Neural Network and conditional generation adversarial network (cGAN). It has been proved that the improved ResNet has strong depth prediction capability, but the depth map is still incomplete in detail. We improve the existing cGAN model to enhance ResNet-based depth prediction. Experiments compared with stage-of-the-art are performed on publicly available data sets. And the results demonstrate that the proposed two-stage deep regression model is superior to other existing methods of the same type. CCS Concepts CCS → Computing methodologies → Computer graphics → Image manipulation → Image processing.
IOS Press eBooks, 2021
Pose estimation is typically performed through 3D images. In contrast, estimating the pose from a single RGB image is still a difficult task. RGB images do not only represent objects' shape, but also represent the intensity that is relative to the viewpoint, texture, and lighting condition. While the 3D pose estimation from depth images is considered a promising approach since the depth image only represents objects' shape. Thus, it is necessary to know what is the appropriate method that can be used for predicting the depth image from a 2D RGB image and then to use for getting the 3D pose estimation. In this paper, we propose a promising approach based on a deep learning model for depth estimation in order to improve the 3D pose estimation. The proposed model consists of two successive networks. The first network is an autoencoder network that maps from the RGB domain to the depth domain. The second network is a discriminator network that compares a real depth image to a generated depth image to support the first network to generate an accurate depth image. In this work, we do not use real depth images corresponding to the input color images. Our contribution is to use 3D CAD models corresponding to objects appearing in color images to render depth images from different viewpoints. These rendered images are then used as ground truth and to guide the autoencoder network to learn the mapping from the image domain to the depth domain. The proposed model outperforms state-of-the-art models on the publicly PASCAL 3D+ dataset.
Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks
Proceedings of the 30th ACM International Conference on Multimedia
Accurately measuring the absolute depth of every pixel captured by an imaging sensor is of critical importance in real-time applications such as autonomous navigation, augmented reality and robotics. In order to predict dense depth, a general approach is to fuse sensor inputs from different modalities such as LiDAR, camera and other time-of-flight sensors. LiDAR and other time-of-flight sensors provide accurate depth data but are quite sparse, both spatially and temporally. To augment missing depth information, generally RGB guidance is leveraged due to its high resolution information. Due to the reliance on multiple sensor modalities, design for robustness and adaptation is essential. In this work, we propose a transformer-like self-attention based generative adversarial network to estimate dense depth using RGB and sparse depth data. We introduce a novel training recipe for making the model robust so that it works even when one of the input modalities is not available. The multi-head self-attention mechanism can dynamically attend to most salient parts of the RGB image or corresponding sparse depth data producing the most competitive results. Our proposed network also requires less memory for training and inference compared to other existing heavily residual connection based convolutional neural networks, making it more suitable for resource-constrained edge applications. The source code is available at: https:// github.com/ kocchop/ robust-multimodal-fusion-gan CCS CONCEPTS • Computing methodologies → Reconstruction; Adversarial learning; Perception; Robustness; • Hardware → Sensor applications and deployments.
On Regression Losses for Deep Depth Estimation
2018 25th IEEE International Conference on Image Processing (ICIP), 2018
Depth estimation from a single monocular image has reached great performances thanks to recent works based on deep networks. However, as various choices of losses, architectures and experimental conditions are proposed in the literature, it is difficult to establish their respective influence on the performances. In this paper we propose an in-depth study of various losses and experimental conditions for depth regression, on NYUv2 dataset. From this study we propose a new network for depth estimation combining an encoder-decoder architecture with an adversarial loss. This network reaches top ones state of the art on NUYv2 dataset while being simpler to train in a single phase.
DeepDNet: Deep Dense Network for Depth Completion Task
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
In this paper, we propose a Deep Dense Network for Depth Completion Task (DeepDNet) towards generating dense depth map using sparse depth and captured view. Wide variety of scene understanding applications such as 3D reconstruction, mixed reality, robotics demand accurate and dense depth maps. Existing depth sensors capture accurate and reliable sparse depth and find challenges in acquiring dense depth maps. Towards this we plan to utilise the accurate sparse depth as input with RGB image to generate dense depth. We model the transformation of random sparse input to grid-based sparse input using Quad-tree decomposition. We propose Dense-Residual-Skip (DRS) Autoencoder along with an attention towards edge preservation using Gradient Aware Mean Squared Error (GAMSE) Loss. We demonstrate our results on the NYUv2 dataset and compare it with other state of the art methods. We also show our results on sparse depth captured by ARCore depth API with its dense depth map. Extensive experiments suggest consistent improvements over existing methods.
Depth Estimation From a Single RGB Image Using Fine-Tuned Generative Adversarial Network
IEEE Access, 2021
Estimating the depth map from a single RGB image is important to understand the nature of the terrain in robot navigation and has attracted considerable attention in the past decade. The existing approaches can accurately estimate the depth from a single RGB image, considering a highly structured environment. The problem becomes more challenging when the terrain is highly dynamic. We propose a finetuned generative adversarial network to estimate the depth map effectively for a given single RGB image. The proposed network is composed of a fine-tuned generator and a global discriminator. The encoder part of the generator takes input RGB images and depth maps and generates their joint distribution in the latent space. Subsequently, the decoder part of the generator decodes the depth map from the joint distribution. The discriminator takes real and fake pairs in three different configurations and then guides the generator to estimate the depth map from the given RGB image accordingly. Finally, we conducted extensive experiments with a highly dynamic environment dataset for verifying the effectiveness and feasibility of the proposed approach. The proposed approach could decode the depth map from the joint distribution more effectively and accurately than the existing approaches. INDEX TERMS Generative adversarial network, convolutional neural network, image translation, autoencoders.
CNN Based Monocular Depth Estimation
E3S Web of Conferences, 2021
In several applications, such as scene interpretation and reconstruction, precise depth measurement from images is a significant challenge. Current depth estimate techniques frequently provide fuzzy, low-resolution estimates. With the use of transfer learning, this research executes a convolutional neural network for generating a high-resolution depth map from a single RGB image. With a typical encoder-decoder architecture, when initializing the encoder, we use features extracted from high-performing pre-trained networks, as well as augmentation and training procedures that lead to more accurate outcomes. We demonstrate how, even with a very basic decoder, our approach can provide complete high-resolution depth maps. A wide number of deep learning approaches have recently been presented, and they have showed significant promise in dealing with the classical ill-posed issue. The studies are carried out using KITTI and NYU Depth v2, two widely utilized public datasets. We also examine...
Cornell University - arXiv, 2022
We propose an accurate and lightweight convolutional neural network for stereo estimation with depth completion. We name this method fully-convolutional deformable similarity network with depth completion (FCDSN-DC). This method extends FC-DCNN by improving the feature extractor, adding a network structure for training highly accurate similarity functions and a network structure for filling inconsistent disparity estimates. The whole method consists of three parts. The first part consists of fully-convolutional densely connected layers that computes expressive features of rectified image pairs. The second part of our network learns highly accurate similarity functions between this learned features. It consists of densely-connected convolution layers with a deformable convolution block at the end to further improve the accuracy of the results. After this step an initial disparity map is created and the left-right consistency check is performed in order to remove inconsistent points. The last part of the network then uses this input together with the corresponding left RGB image in order to train a network that fills in the missing measurements. Consistent depth estimations are gathered around invalid points and are parsed together with the RGB points into a shallow CNN network structure in order to recover the missing values. We evaluate our method on challenging real world indoor and outdoor scenes, in particular Middlebury, KITTI and ETH3D were it produces competitive results. We furthermore show that this method generalizes well and is well suited for many applications without the need of further training.
MobileXNet: An Efficient Convolutional Neural Network for Monocular Depth Estimation
IEEE Transactions on Intelligent Transportation Systems
Depth is a vital piece of information for autonomous vehicles to perceive obstacles. Due to the relatively low price and small size of monocular cameras, depth estimation from a single RGB image has attracted great interest in the research community. In recent years, the application of Deep Neural Networks (DNNs) has significantly boosted the accuracy of monocular depth estimation (MDE). State-of-the-art methods are usually designed on top of complex and extremely deep network architectures, which require more computational resources and cannot run in real-time without using high-end GPUs. Although some researchers tried to accelerate the running speed, the accuracy of depth estimation is degraded because the compressed model does not represent images well. In addition, the inherent characteristic of the feature extractor used by the existing approaches results in severe spatial information loss in the produced feature maps, which also impairs the accuracy of depth estimation on small sized images. In this study, we are motivated to design a novel and efficient Convolutional Neural Network (CNN) that assembles two shallow encoder-decoder style subnetworks in succession to address these problems. In particular, we place our emphasis on the trade-off between the accuracy and speed of MDE. Extensive experiments have been conducted on the NYU depth v2, KITTI, Make3D and Unreal data sets. Compared with the state-of-the-art approaches which have an extremely deep and complex architecture, the proposed network not only achieves comparable performance but also runs at a much faster speed on a single, less powerful GPU. Index terms-Monocular depth estimation, depth prediction, convolutional neural networks, encoder-decoder, autonomous vehicles.
A Review of Benchmark Datasets and Training Loss Functions in Neural Depth Estimation
IEEE Access, 2021
In many applications, such as robotic perception, scene understanding, augmented reality, 3D reconstruction, and medical image analysis, depth from images is a fundamentally ill-posed problem. The success of depth estimation models relies on assembling a suitably large and diverse training dataset and on the selection of appropriate loss functions. It is critical for researchers in this field to be made aware of the wide range of publicly available depth datasets along with the properties of various loss functions that have been applied to depth estimation. Selection of the right training data combined with appropriate loss functions will accelerate new research and enable better comparison with state-of-the-art. Accordingly, this work offers a comprehensive review of available depth datasets as well as the loss functions that are applied in this problem domain. These depth datasets are categorised into five primary categories based on their application, namely (i) people detection and action recognition, (ii) faces and facial pose, (iii) perception-based navigation (i.e., street signs, roads), (iv) object and scene recognition, and (v) medical applications. The important characteristics and properties of each depth dataset are described and compared. A mixing strategy for depth datasets is presented in order to generalise model results across different environments and use cases. Furthermore, depth estimation loss functions that can help with training deep learning depth estimation models across different datasets are discussed. State-of-the-art deep learning-based depth estimation methods evaluations are presented for three of the most popular datasets. Finally, a discussion about challenges and future research along with recommendations for building comprehensive depth datasets will be presented as to help researchers in the selection of appropriate datasets and loss functions for evaluating their results and algorithms.