Learning Monocular 3D Vehicle Detectionwithout 3D Bounding Box Labels GCPR 2020
- Lukas Koestler1,2
- Nan Yang1,2
- Rui Wang1,2
- Daniel Cremers1,2
Abstract
The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. We propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. By representing the objects as triangular meshes and employing differentiable shape rendering, we define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, off-the-shelf networks. We evaluate the proposed algorithm on the real-world KITTI dataset and achieve promising performance in comparison to state-of-the-art methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.
Architecture
The proposed model contains a single-image network and a multi-image network extension. The single-image network back-projects the input depth map estimated from the image into a point cloud. A Frustum-PointNet encoder predicts the pose and shape of the vehicle which are then decoded into a predicted 3D mesh and segmentation mask through differentiable rendering. The predictions are compared to the input segmentation mask and back-projected point cloud to define two loss terms. The multi-image network architecture takes three temporally consecutive images as the inputs, and the single-image network is applied individually to each image. Our network predicts a depth map for the middle frame based on the vehicle's pose and shape. A pre-trained network predicts ego-motion and object-motion from the images. The reconstruction loss is computed by differentiably warping the images into the middle frame. Please note the further information in our paper.
Qualitative Results
Qualitative comparison of MonoGRNet (first row), Mono3D (second row), and our method (third row) with depth maps from BTS. We show ground truth bounding boxes for cars (red), predicted bounding boxes (green), and the back-projected point cloud. In comparison to Mono3D, the prediction accuracy of the proposed approach is increased specifically for further away vehicles. As in the quantitative evaluation, the performance of MonoGRNet and our model is comparable. Best viewed with "open image in new tab". Please note the further results in our paper.
Quantitative Results
Method | Without 3D BBox |
APBEV, 0.7 | ||
---|---|---|---|---|
Easy | Mode | Hard | ||
MonoGRNet | 23.07 | 16.37 | 10.05 | |
Mono3D | 1.92 | 1.13 | 0.77 | |
Ours | ✔ | 19.23 | 9.06 | 5.34 |
Result for the proposed KITTI validation set that is not identical with the standard KITTI validation split; please see the paper for details. We report the average precision (AP) in percent for the car category in the bird's-eye view (BEV). The AP is the average over 40 values as introduced by Simonelli et al. 2019. Our method convincingly outperforms the supervised baseline method Mono3D and shows promising performance in comparison to a state-of-the-art supervised method MonoGRNet. Please note the further results in our paper.