Stereo R-CNN Based 3D Object Detection for Autonomous Driving (original) (raw)

Deep Learning-based Image 3D Object Detection for Autonomous Driving: Review

An accurate and robust perception system is key to understanding the driving environment of autonomous driving and robots. Autonomous driving needs 3D information about objects, including the object’s location and pose, to understand the driving environment clearly. A camera sensor is widely used in autonomous driving because of its richness in color, texture, and low price. The major problem with the camera is the lack of 3D information, which is necessary to understand the 3D driving environment. Additionally, the object’s scale change and cclusion make 3D object detection more challenging. Many deep learning-based methods, such as depth estimation, have been developed to solve the lack of 3D information. This survey presents the image 3D object detection 3D bounding box encoding techniques, feature extraction techniques, and evaluation metrics of 3D object detection. The image-based methods are categorized based on the technique used to estimate an image’s depth information, and ...

PG-RCNN: Semantic Surface Point Generation for 3D Object Detection

arXiv (Cornell University), 2023

One of the main challenges in LiDAR-based 3D object detection is that the sensors often fail to capture the complete spatial information about the objects due to long distance and occlusion. Two-stage detectors with point cloud completion approaches tackle this problem by adding more points to the regions of interest (RoIs) with a pretrained network. However, these methods generate dense point clouds of objects for all region proposals, assuming that objects always exist in the RoIs. This leads to the indiscriminate point generation for incorrect proposals as well. Motivated by this, we propose Point Generation R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic surface points of foreground objects for accurate detection. Our method uses a jointly trained RoI point generation module to process the contextual information of RoIs and estimate the complete shape and displacement of foreground objects. For every generated point, PG-RCNN assigns a semantic feature that indicates the estimated foreground probability. Extensive experiments show that the point clouds generated by our method provide geometrically and semantically rich information for refining false positive and misaligned proposals. PG-RCNN achieves competitive performance on the KITTI benchmark, with significantly fewer parameters than state-of-the-art models.

RefineNet: Refining Object Detectors for Autonomous Driving

IEEE Transactions on Intelligent Vehicles

Highly accurate, camera-based object detection is an essential component of autonomous navigation and assistive technologies. In particular, for on-road applications, localization quality of objects in the image plane is important for accurate distance estimation, safe trajectory prediction, and motion planning. In this paper, wemathematically formulate and study a strategy for improving object localization with a deep convolutional neural network. An iterative region-of-interest pooling framework is proposed for predicting increasingly tight object boxes and addressing limitations in current state-of-the-art deep detection models. The method is shown to significantly improve the performance on a variety of datasets, scene settings, and camera perspectives, producing high-quality object boxes at a minor additional computational expense. Specifically, the architecture achieves impressive gains in performance (up to 6% improvement in detection accuracy) at fast run-time speed (0.22 s per frame on 1242 × 375 sized images). The iterative refinement is shown to impact subsequent vision tasks, such as object tracking in the image plane and in ground plane.

Shift R-CNN: Deep Monocular 3D Object Detection With Closed-Form Geometric Constraints

2019 IEEE International Conference on Image Processing (ICIP)

We propose Shift R-CNN, a hybrid model for monocular 3D object detection, which combines deep learning with the power of geometry. We adapt a Faster R-CNN network for regressing initial 2D and 3D object properties and combine it with a least squares solution for the inverse 2D to 3D geometric mapping problem, using the camera projection matrix. The closed-form solution of the mathematical system, along with the initial output of the adapted Faster R-CNN are then passed through a final ShiftNet network that refines the result using our newly proposed Volume Displacement Loss. Our novel, geometrically constrained deep learning approach to monocular 3D object detection obtains top results on KITTI 3D Object Detection Benchmark [5], being the best among all monocular methods that do not use any pre-trained network for depth estimation.

Multi-Camera 3D Object Detection for Autonomous Driving Using Deep Learning and Self-Attention Mechanism

IEEE Access

In the absence of depth-centric sensors, 3D object detection using only conventional cameras becomes ill-posed and inaccurate due to the lack of depth information in the RGB image. We propose a multi-camera perception solution to predict the 3D properties of the vehicle obtained from the aggregated information from multiple static infrastructure-installed cameras. While a multi-bin regression loss has been adopted to predict the orientation of a 3D bounding box using a convolutional neural network, combining it with the geometrical constraints of a 2D bounding box to form a 3D bounding box is not accurate enough for all the driving scenarios and orientations. This paper leverages a vision transformer that overcomes the drawbacks of convolutional neural networks when there are no external LiDAR or pseudo-LiDAR pre-trained datasets available for depth map estimation, particularly in occluded regions. By combining the predicted 3D boxes from various cameras using an average weighted score algorithm, we determine the best bounding box with the highest confidence score. Comprehensive simulations for performance analysis are shown from the results obtained by utilizing the KITTI standard data generated from the CARLA simulator. INDEX TERMS Autonomous vehicles, deep learning, object detection, vision transformer.

ApolloCar3D: A Large 3D Car Instance Understanding Benchmark for Autonomous Driving

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

Autonomous driving has attracted remarkable attention from both industry and academia. An important task is to estimate 3D properties (e.g. translation, rotation and shape) of a moving or parked vehicle on the road. This task, while critical, is still under-researched in the computer vision community-partially owing to the lack of large scale and fully-annotated 3D car database suitable for autonomous driving research. In this paper, we contribute the first largescale database suitable for 3D car instance understanding-ApolloCar3D. The dataset contains 5,277 driving images and over 60K car instances, where each car is fitted with an industry-grade 3D CAD model with absolute model size and semantically labelled keypoints. This dataset is above 20× larger than PASCAL3D+ [65] and KITTI [21], the current state-of-the-art. To enable efficient labelling in 3D, we build a pipeline by considering 2D-3D keypoint correspondences for a single instance and 3D relationship among multiple instances. Equipped with such dataset, we build various baseline algorithms with the state-of-the-art deep convolutional neural networks. Specifically, we first segment each car with a pre-trained Mask R-CNN [22], and then regress towards its 3D pose and shape based on a deformable 3D car model with or without using semantic keypoints. We show that using keypoints significantly improves fitting performance. Finally, we develop a new 3D metric jointly considering 3D pose and 3D shape, allowing for comprehensive evaluation and ablation study. By comparing with human performance we suggest several future directions for further improvements.

UrbanNet: Leveraging Urban Maps for Long Range 3D Object Detection

2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021

Relying on monocular image data for precise 3D object detection remains an open problem, whose solution has broad implications for cost-sensitive applications such as traffic monitoring. We present UrbanNet, a modular architecture for long range monocular 3D object detection with static cameras. Our proposed system combines commonly available urban maps along with a mature 2D object detector and an efficient 3D object descriptor to accomplish accurate detection at long range even when objects are rotated along any of their three axes. We evaluate UrbanNet on a novel challenging synthetic dataset and highlight the advantages of its design for traffic detection in roads with changing slope, where the flat ground approximation does not hold.

Convolutional Neural Network Using for Multi-Sensor 3D Object Detection

Journal of Physics: Conference Series, 2021

The purpose of this article is to detect 3D objects inside the independent vehicle with great accuracy. The method proposed a Multi-View 3D System (MV3D) framework which encodes the sparse 3d-point cloud with a compact multi-view image, using LIDAR satellite image and RGB pictures as inputs, and predicts 3D boundary boxes. The network comprises two sub-networks: one for creating 3D artifacts and one for multi-visual fusion functionality. Propose an autonomous 3D object tracking approach to manipulate sparse and dense knowledge about romanticizing and geometry in stereo images. The Stereo R-CNN strategy applies Faster R-CNNs to stereo inputs such that objects are simultaneously identified and linked in conservative and liberal images. Such charts were then combined and fed into a 3D proposal generator to generate accurate 3D proposals for vehicles. In the second step, the refining network extended the features of the proposal regions further and carried through the classification, re...

Towards Robust Robot 3D Perception in Urban Environments: The UT Campus Object Dataset

arXiv (Cornell University), 2023

We introduce the UT Campus Object Dataset (CODa), a mobile robot egocentric perception dataset collected on the University of Texas Austin Campus. Our dataset contains 8.5 hours of multimodal sensor data: synchronized 3D point clouds and stereo RGB video from a 128-channel 3D LiDAR and two 1.25MP RGB cameras at 10 fps; RGB-D videos from an additional 0.5MP sensor at 7 fps, and a 9-DOF IMU sensor at 40 Hz. We provide 58 minutes of ground-truth annotations containing 1.3 million 3D bounding boxes with instance IDs for 53 semantic classes, 5000 frames of 3D semantic annotations for urban terrain, and pseudo-ground truth localization. We repeatedly traverse identical geographic locations for a wide range of indoor and outdoor areas, weather conditions, and times of the day. Using CODa, we empirically demonstrate that: 1) 3D object detection performance in urban settings is significantly higher when trained using CODa compared to existing datasets even when employing state-of-the-art domain adaptation approaches, 2) sensor-specific fine-tuning improves 3D object detection accuracy and 3) pretraining on CODa improves cross-dataset 3D object detection performance in urban settings compared to pretraining on AV datasets. Using our dataset and annotations, we release benchmarks for 3D object detection and 3D semantic segmentation using established metrics. In the future, the CODa benchmark will include additional tasks like unsupervised object discovery and re-identification. We publicly release CODa on the Texas Data Repository [1], pre-trained models, dataset development package, and interactive dataset viewer 1. We expect CODa to be a valuable dataset for research in egocentric 3D perception and planning for autonomous navigation in urban environments.