NVS-MonoDepth: Improving Monocular Depth Prediction with Novel View Synthesis (original) (raw)

RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes

ArXiv, 2020

We present a generalised self-supervised learning approach for monocular estimation of the real depth across scenes with diverse depth ranges from 1--100s of meters. Existing supervised methods for monocular depth estimation require accurate depth measurements for training. This limitation has led to the introduction of self-supervised methods that are trained on stereo image pairs with a fixed camera baseline to estimate disparity which is transformed to depth given known calibration. Self-supervised approaches have demonstrated impressive results but do not generalise to scenes with different depth ranges or camera baselines. In this paper, we introduce RealMonoDepth a self-supervised monocular depth estimation approach which learns to estimate the real scene depth for a diverse range of indoor and outdoor scenes. A novel loss function with respect to the true scene depth based on relative depth scaling and warping is proposed. This allows self-supervised training of a single netw...

CNN Based Monocular Depth Estimation

E3S Web of Conferences, 2021

In several applications, such as scene interpretation and reconstruction, precise depth measurement from images is a significant challenge. Current depth estimate techniques frequently provide fuzzy, low-resolution estimates. With the use of transfer learning, this research executes a convolutional neural network for generating a high-resolution depth map from a single RGB image. With a typical encoder-decoder architecture, when initializing the encoder, we use features extracted from high-performing pre-trained networks, as well as augmentation and training procedures that lead to more accurate outcomes. We demonstrate how, even with a very basic decoder, our approach can provide complete high-resolution depth maps. A wide number of deep learning approaches have recently been presented, and they have showed significant promise in dealing with the classical ill-posed issue. The studies are carried out using KITTI and NYU Depth v2, two widely utilized public datasets. We also examine...

Learning Monocular Depth by Distilling Cross-Domain Stereo Networks

Computer Vision – ECCV 2018, 2018

Monocular depth estimation aims at estimating a pixelwise depth map for a single image, which has wide applications in scene understanding and autonomous driving. Existing supervised and unsupervised methods face great challenges. Supervised methods require large amounts of depth measurement data, which are generally difficult to obtain, while unsupervised methods are usually limited in estimation accuracy. Synthetic data generated by graphics engines provide a possible solution for collecting large amounts of depth data. However, the large domain gaps between synthetic and realistic data make directly training with them challenging. In this paper, we propose to use the stereo matching network as a proxy to learn depth from synthetic data and use predicted stereo disparity maps for supervising the monocular depth estimation network. Cross-domain synthetic data could be fully utilized in this novel framework. Different strategies are proposed to ensure learned depth perception capability well transferred across different domains. Our extensive experiments show state-of-the-art results of monocular depth estimation on KITTI dataset.

A Lightweight Self-Supervised Training Framework for Monocular Depth Estimation

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Depth estimation attracts great interest in various sectors such as robotics, human computer interfaces, intelligent visual surveillance, and wearable augmented reality gear. Monocular depth estimation is of particular interest due to its low complexity and cost. Research in recent years was shifted away from supervised learning towards unsupervised or selfsupervised approaches. While there have been great achievements, most of the research has focused on large heavy networks which are highly resource intensive that makes them unsuitable for systems with limited resources. We are particularly concerned about the increased complexity during training that current self-supervised approaches bring. In this paper, we propose a lightweight self-supervised training framework which utilizes computationally cheap methods to compute ground truth approximations. In particular, we utilize a stereo pair of images during training which are used to compute photometric reprojection loss and a disparity ground truth approximation. Due to the ground truth approximation, our framework is able to remove the need of pose estimation and the corresponding heavy prediction networks that current self-supervised methods have. In the experiments, we have demonstrated that our framework is capable of increasing the generator's performance at a fraction of the size required by the current state-of-the-art self-supervised approach.

Ing for Self-Supervised Monocular Depth

2020

Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.†

MobileXNet: An Efficient Convolutional Neural Network for Monocular Depth Estimation

IEEE Transactions on Intelligent Transportation Systems

Depth is a vital piece of information for autonomous vehicles to perceive obstacles. Due to the relatively low price and small size of monocular cameras, depth estimation from a single RGB image has attracted great interest in the research community. In recent years, the application of Deep Neural Networks (DNNs) has significantly boosted the accuracy of monocular depth estimation (MDE). State-of-the-art methods are usually designed on top of complex and extremely deep network architectures, which require more computational resources and cannot run in real-time without using high-end GPUs. Although some researchers tried to accelerate the running speed, the accuracy of depth estimation is degraded because the compressed model does not represent images well. In addition, the inherent characteristic of the feature extractor used by the existing approaches results in severe spatial information loss in the produced feature maps, which also impairs the accuracy of depth estimation on small sized images. In this study, we are motivated to design a novel and efficient Convolutional Neural Network (CNN) that assembles two shallow encoder-decoder style subnetworks in succession to address these problems. In particular, we place our emphasis on the trade-off between the accuracy and speed of MDE. Extensive experiments have been conducted on the NYU depth v2, KITTI, Make3D and Unreal data sets. Compared with the state-of-the-art approaches which have an extremely deep and complex architecture, the proposed network not only achieves comparable performance but also runs at a much faster speed on a single, less powerful GPU. Index terms-Monocular depth estimation, depth prediction, convolutional neural networks, encoder-decoder, autonomous vehicles.

Deep Classification Network for Monocular Depth Estimation

2019

Monocular Depth Estimation is usually treated as a supervised and regression problem when it actually is very similar to semantic segmentation task since they both are fundamentally pixel-level classification tasks. We applied depth increments that increases with depth in discretizing depth values and then applied Deeplab v2 and the result was higher accuracy. We were able to achieve a state-of-the-art result on the KITTI dataset and outperformed existing architecture by an 8% margin.

Self-Supervised Correlational Monocular Depth Estimation using ResVGG Network

Proceedings of The 7th International Conference on Intelligent Systems and Image Processing 2019

Self-supervised monocular depth estimation (SMDE) has recently received significant attention in computer vision. Leveraging the development of deep learning approaches, SMDE provides a solution to the applications of automation, navigation, and scene understanding. In this paper, we propose a novel training objective and learning network to perform a single image depth estimation in our convolutional neural network without the ground truth depth data. The proposed training objective enables the learning network to learn the stereo image correlation in training and estimates the image depth from a single input image in prediction. The proposed learning network ResVGG is a hybrid structure of Resnet50 and VGG-16. The proposed ResVGG has a similar performance as Res-net50 but needs much less computational costs. We demonstrate that our proposed method has competitive accuracy comparing to the current state-of-the-art on KITTI dataset and achieves the frame rates of 32 frame per second (FPS) in prediction using a single NVIDIA GTX 1080 GPU. Furthermore, the proposed method can potentially support visual odometry depth estimation.