Deep Camera Pose Regression Using Pseudo-LiDAR (original) (raw)

I2D-Loc: Camera localization via image to LiDAR depth flow

ISPRS Journal of Photogrammetry and Remote Sensing, 2022

Accurate camera localization in existing LiDAR maps is promising since it potentially allows exploiting strengths of both LiDAR-based and camera-based methods. However, effective methods that robustly address appearance and modality differences for 2D-3D localization are still missing. To overcome these problems, we propose the I2D-Loc, a scene-agnostic and end-to-end trainable neural network that estimates the 6-DoF pose from an RGB image to an existing LiDAR map with local optimization on an initial pose. Specifically, we first project the LiDAR map to the image plane according to a rough initial pose and utilize a depth completion algorithm to generate a dense depth image. We further design a confidence map to weight the features extracted from the dense depth to get a more reliable depth representation. Then, we propose to utilize a neural network to estimate the correspondence flow between depth and RGB images. Finally, we utilize the BPnP algorithm to estimate the 6-DoF pose, calculating the gradients of pose error and optimizing the front-end network parameters. Moreover, by decoupling the intrinsic camera parameters out of the end-to-end training process, I2D-Loc can be generalized to the images with different intrinsic parameters. Experiments on KITTI, Argoverse, and Lyft5 datasets demonstrate that the I2D-Loc can achieve centimeter localization performance. The source code, dataset, trained models, and demo videos are released at https://levenberg.github.io/I2D-Loc/.

Urban Localization with Street Views using a Convolutional Neural Network for End-to-End Camera Pose Regression

IEEE Intelligent Vehicles Symposium, 2019

This paper presents an end-to-end real-time monocular absolute localization approach that uses Google Street View panoramas as a prior source of information to train a Convolutional Neural Network (CNN). We propose an adaptation of the PoseNet architecture [8] to a sparse database of panoramas. We show that we can expand the latter by synthesizing new images and consequently improve the accuracy of the pose regressor. The main advantage of our method is that it does not require a first passage of an equipped vehicle to build a map. Moreover, the offline data generation and CNN training are automatic and does not require the input of an operator. In the online phase, the approach only uses one camera for localization and regresses poses in a global frame. The conducted experiments show that augmenting the training set as presented in this paper drastically improves the accuracy of the CNN. The results, when compared to a handcrafted-feature-based approach, are less accurate (around 7.5 to 8 m against 2.5 to 3 m) but also less dependent on the position of the camera inside the vehicle. Furthermore, our CNN-based method computes the pose approximately 40 times faster (75 ms per image instead of 3 s) than the handcrafted approach.

LiDAR ICPS-net: Indoor Camera Positioning based-on Generative Adversarial Network for RGB to Point-Cloud Translation

ArXiv, 2019

Indoor positioning aims at navigation inside areas with no GPS-data availability and could be employed in many applications such as augmented reality, autonomous driving specially inside closed areas and tunnels. In this paper, a deep neural network-based architecture has been proposed to address this problem. In this regard, a tandem set of convolutional neural networks, as well as a Pix2Pix GAN network have been leveraged to perform as the scene classifier, scene RGB image to point cloud converter, and position regressor, respectively. The proposed architecture outperforms the previous works, including our recent work, in the sense that it makes data generation task easier and more robust against scene small variations, whilst the accuracy of the positioning is remarkably well, for both Cartesian position and quaternion information of the camera.

ICPS-net: an end-to-end RGB-based indoor camera positioning system using deep convolutional neural networks

Twelfth International Conference on Machine Vision (ICMV 2019), 2020

Indoor positioning and navigation inside an area with no GPS-data availability is a challenging problem. There are applications such as augmented reality, autonomous driving, navigation of drones inside tunnels, in which indoor positioning gets crucial. In this paper, a tandem architecture of deep network-based systems, for the first time to our knowledge, is developed to address this problem. This structure is trained on the scene images being obtained through scanning of the desired area segments using photogrammetry. A CNN structure based on EfficientNet is trained as a classifier of the scenes, followed by a MobileNet CNN structure which is trained to perform as a regressor. The proposed system achieves amazingly fine precisions for both Cartesian position and quaternion information of the camera.

Depth-DensePose: an efficient densely connected deep learning model for camera-based localization

International Journal of Electrical and Computer Engineering (IJECE), 2022

Camera/image-based localization is important for many emerging applications such as augmented reality (AR), mixed reality, robotics, and self-driving. Camera localization is the problem of estimating both camera position and orientation with respect to an object. Use cases for camera localization depend on two key factors: accuracy and speed (latency). Therefore, this paper proposes Depth-DensePose, an efficient deep learning model for 6-degrees-of-freedom (6-DoF) camera-based localization. The Depth-DensePose utilizes the advantages of both DenseNets and adapted depthwise separable convolution (DS-Conv) to build a deeper and more efficient network. The proposed model consists of iterative depth-dense blocks. Each depth dense block contains two adapted DS-Conv with two kernel sizes 3 and 5, which are useful to retain both low-level as well as high-level features. We evaluate the proposed Depth-DensePose on the Cambridge Landmarks dataset, which shows that the Depth-DensePose outperforms the performance of related deep learning models for camera based localization. Furthermore, extensive experiments were conducted which proven the adapted DS-Conv is more efficient than the standard convolution. Especially, in terms of memory and processing time which is important to real-time and mobile applications.

G2L-Net: Global to Local Network for Real-time 6D Pose Estimation with Embedding Vector Features

arXiv (Cornell University), 2020

In this paper, we propose a novel real-time 6D object pose estimation framework, named G2L-Net. Our network operates on point clouds from RGB-D detection in a divideand-conquer fashion. Specifically, our network consists of three steps. First, we extract the coarse object point cloud from the RGB-D image by 2D detection. Second, we feed the coarse object point cloud to a translation localization network to perform 3D segmentation and object translation prediction. Third, via the predicted segmentation and translation, we transfer the fine object point cloud into a local canonical coordinate, in which we train a rotation localization network to estimate initial object rotation. In the third step, we define point-wise embedding vector features to capture viewpoint-aware information. To calculate more accurate rotation, we adopt a rotation residual estimator to estimate the residual between initial rotation and ground truth, which can boost initial pose estimation performance. Our proposed G2L-Net is real-time despite the fact multiple steps are stacked via the proposed coarse-to-fine framework. Extensive experiments on two benchmark datasets show that G2L-Net achieves state-of-the-art performance in terms of both accuracy and speed. 1

xyzNet: Towards Machine Learning Camera Relocalization by Using a Scene Coordinate Prediction Network

2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), 2018

Camera relocalization is a common problem in several applications such as augmented reality or robot navigation. Especially, augmented reality requires fast, accurate and robust camera localization. However, it is still challenging to have a both real-time and accurate method. In this paper, we present our hybrid method combing machine learning approach and geometric approach for real-time camera relocalization from a single RGB image. We propose a light Convolutional Neural Network (CNN) called xyzNet to efficiently and robustly regress 3D world coordinates of key-points in an image. Then, the geometric information about 2D-3D correspondences allows the removal of ambiguous predictions and the calculation of more accurate camera pose. Moreover, we show favorable results compared to previous machine learning based approaches about the accuracy and the performance of our method on different datasets as well as the capacity to address challenges concerning dynamic scene.

Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies-a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations-essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance-raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.

ViPR: Visual-Odometry-aided Pose Regression for 6DoF Camera Localization

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

Visual Odometry (VO) accumulates a positional drift in long-term robot navigation tasks. Although Convolutional Neural Networks (CNNs) improve VO in various aspects, VO still suffers from moving obstacles, discontinuous observation of features, and poor textures or visual information. While recent approaches estimate a 6DoF pose either directly from (a series of) images or by merging depth maps with optical flow (OF), research that combines absolute pose regression with OF is limited.We propose ViPR, a novel modular architecture for longterm 6DoF VO that leverages temporal information and synergies between absolute pose estimates (from PoseNet-like modules) and relative pose estimates (from FlowNet-based modules) by combining both through recurrent layers. Experiments on known datasets and on our own Industry dataset show that our modular design outperforms state ofthe art in long-term navigation tasks.

End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Reliable and accurate 3D object detection is a necessity for safe autonomous driving. Although LiDAR sensors can provide accurate 3D point cloud estimates of the environment, they are also prohibitively expensive for many settings. Recently, the introduction of pseudo-LiDAR (PL) has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap stereo cameras. PL combines state-of-the-art deep neural networks for 3D depth estimation with those for 3D object detection by converting 2D depth map outputs to 3D point cloud inputs. However, so far these two networks have to be trained separately. In this paper, we introduce a new framework based on differentiable Change of Representation (CoR) modules that allow the entire PL pipeline to be trained end-to-end. The resulting framework is compatible with most state-of-the-art networks for both tasks and in combination with PointRCNN improves over PL consistently across all benchmarks-yielding the highest entry on the KITTI image-based 3D object detection leaderboard at the time of submission. Our code will be made available at https://github.com/mileyan/ pseudo-LiDAR_e2e.