Abstract
Monocular 3D object detection (Mono3OD) is a challenging yet cost-effective vision task in the fields of autonomous driving and mobile robotics. The lack of reliable depth information makes obtaining accurate 3D positional information extremely difficult. In recent years, center-guided monocular 3D object detectors have directly regressed the absolute depth of the object center based on 2D detection. However, this approach heavily relies on local semantic information, ignoring contextual spatial cues and global-to-local visual correlations. Moreover, visual variations in the scene can lead to inevitable depth prediction errors for objects at different scales. To address these limitations, we propose a Mono3OD framework based on scene-level adaptive instance depth estimation (MonoSAID). Firstly, the continuous depth is discretized into multiple bins, and the width distribution of depth bins is adaptively generated based on scene-level contextual semantic information. Then, by establishing the correlation between global contextual semantic feature information and local semantic features of instances, and using the probability distribution representation of local instance features and the linear combination of bin centers distributions to solve the depth problem. In addition, a multi-scale spatial perception attention module is designed to extract attention maps of various scales through pyramid pooling operations. This design enhances the model’s receptive field and multi-scale spatial perception capabilities, thereby improving its ability to model target objects. We conducted extensive experiments on the KITTI dataset and the Waymo dataset. The results show that MonoSAID can effectively improve the 3D detection accuracy and robustness, and our method achieves state-of-the-art performance.
Similar content being viewed by others
Code or data availability
The data of this experiment are still being collated, and data supporting the results of this study may be obtained from the corresponding author upon reasonable request.
References
Arnold, E., Al-Jarrah, O.Y., Dianati, M., Fallah, S., Oxtoby, D., Mouzakitis, A.: A survey on 3d object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst. 20(10), 3782–3795 (2019)
Wu, Y., Wang, Y., Zhang, S., Ogai, H.: Deep 3d object detection networks using lidar data: a review. IEEE Sens. J. 21(2), 1152–1171 (2020)
Chen, N., Wang, Y., Zhang, R.: Distributed prescribed performance formation control for nonholonomic mobile robots under noisy communication. Journal of Intelligent & Robotic Systems 108(3), 36 (2023)
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point voxel feature set abstraction for 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10529–10538 (2020)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4490–4499 (2018)
Li, C., Ku, J., Waslander, S.L.: Confidence guided stereo 3d object detection with split depth estimation. 2020 IEEE International Conference on Intelligent Robots and Systems (IROS), 5776–5783 (2020)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 918–927 (2018)
Liu, X., Xue, N., Wu, T.: Learning auxiliary monocular contexts helps monocular 3d object detection. Proceedings of the AAAI Conference on Artificial Intelligence 36(2), 1810–1818 (2022)
Qin, Z., Wang, J., Lu, Y.: Monogrnet: a geometric reasoning network for monocular 3d object localization. Proceedings of the AAAI Conference on Artificial Intelligence 33(01), 8851–8858 (2019)
Wang, L., Du, L., Ye, X., Fu, Y., Guo, G., Xue, X., Feng, J., Zhang, L.: Depth-conditioned dynamic message propagation for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 454–463 (2021)
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8555–8564 (2021)
Lu, Y., Ma, X., Yang, L., Zhang, T., Liu, Y., Chu, Q., Yan, J., Ouyang, W.: Geometry uncertainty projection network for monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 3111–3121 (2021)
Zhang, Y., Lu, J., Zhou, J.: Objects are different: Flexible monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3289–3298 (2021)
Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 17853–17862 (2023)
Tang, Y., Li, B., Liu, M., Chen, B., Wang, Y., Ouyang, W.: Autopedestrian: an automatic data augmentation and loss function search scheme for pedestrian detection. IEEE Trans. Image Process. 30, 8483–8496 (2021)
Tang, Y., Liu, M., Li, B., Wang, Y., Ouyang, W.: Otp-nms: Towards optimal threshold prediction of nms for crowded pedestrian detection. IEEE Transactions on Image Processing (2023)
Wang, R., Qin, J., Li, K., Li, Y., Cao, D., Xu, J.: Bev-lanedet: An efficient 3d lane detection based on virtual camera via key-points, pp. 1002–1011 (2023)
Liu, H., Qu, D., Xu, F., Du, Z., Jia, K., Song, J., Liu, M.: Real-time and efficient collision avoidance planning approach for safe human-robot interaction. Journal of Intelligent & Robotic Systems 105(4), 93 (2022)
Li, H., Qin, J., Liu, Q., Yan, C.: An efficient deep reinforcement learning algorithm for mapless navigation with gap-guided switching strategy. Journal of Intelligent & Robotic Systems 108(3), 43 (2023)
You, Y., Wang, Y., Chao, W.-L., Garg, D., Pleiss, G., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. (2019). arXiv:1906.06310
Wang, Y., Chao, W.-L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8445–8453 (2019)
Weng, X., Kitani, K.: Monocular 3d object detection with pseudo-lidar point cloud. Proceedings of the IEEE International Conference on Computer Vision Workshops, 0–0 (2019)
Qian, R., Garg, D., Wang, Y., You, Y., Belongie, S., Hariharan, B., Campbell, M., Weinberger, K.Q., Chao, W.-L.: End-to-end pseudo-lidar for image-based 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5881–5890 (2020)
Chong, Z., Ma, X., Zhang, H., Yue, Y., Li, H., Wang, Z., Ouyang, W.: Monodistill: Learning spatial features for monocular 3d object detection. arXiv:2201.10830 (2022)
Huang, K.-C., Wu, T.-H., Su, H.-T., Hsu, W.H.: Monodtr: monocular 3d object detection with depth-aware transformer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4012–4021 (2022)
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth guided convolutions for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 1000–1001 (2020)
Ma, X., Zhang, Y., Xu, D., Zhou, D., Yi, S., Li, H., Ouyang, W.: Delving into localization errors for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4721–4730 (2021)
Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: Monocular 3d object detection using pairwise spatial relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 12093–12102 (2020)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 913–922 (2021)
Qin, Z., Li, X.: Monoground: Detecting monocular 3d objects from the ground. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3793–3802 (2022)
Li, Z., Qu, Z., Zhou, Y., Liu, J., Wang, H., Jiang, L.: Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2791–2800 (2022)
Liu, Z., Wu, Z., Tóth, R.: Smoke: single-stage monocular 3d object detection via keypoint estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 996–997 (2020)
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3d bounding box estimation using deep learning and geometry. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7074–7082 (2017)
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems 30 (2017)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2002–2011 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1907–1915 (2017)
Jing, L., Yu, R., Kretzschmar, H., Li, K., Qi, C.R., Zhao, H., Ayvaci, A., Chen, X., Cower, D., Li, Y., et al.: Depth estimation matters most: improving per-object depth estimation for monocular 3d detection and tracking. 2022 International Conference on Robotics and Automation (ICRA), 366–373 (2022)
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3d object detection? Proceedings of the IEEE International Conference on Computer Vision, 3142–3152 (2021)
Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.-K.: Geometry-based distance decomposition for monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 15172–15181 (2021)
Zhang, R., Qiu, H., Wang, T., Guo, Z., Xu, X., Qiao, Y., Gao, P., Li, H.: Monodetr: depth-guided transformer for monocular 3d object detection. (2022). arXiv:2203.13310
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. (2019). arXiv:1904.07850
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2403–2412 (2018)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890 (2017)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361 (2012)
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. Advances in Neural Information Processing Systems 28 (2015)
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2446–2454 (2020)
Kumar, A., Brazil, G., Corona, E., Parchami, A., Liu, X.: Deviant: Depth equivariant network for monocular 3d object detection. Proceedings of the European Conference on Computer Vision, 664–683 (2022)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)
Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: Autoshape: Real-time shape-aware monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 15641–15650 (2021)
Hong, Y., Dai, H., Ding, Y.: Cross-modality knowledge distillation network for monocular 3d object detection. Proceedings of the European Conference on Computer Vision, 87–104 (2022). Springer
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-lidar representation. Proceedings of the European Conference on Computer Vision, 311–327 (2020)
Kim, Y., Kim, S., Sim, S., Choi, J.W., Kum, D.: Boosting monocular 3d object detection with object-centric auxiliary depth supervision. IEEE Trans. Intell. Transp. Syst. 24(2), 1801–1813 (2022)
Brazil, G., Liu, X.: M3d-rpn: Monocular 3d region proposal network for object detection. Proceedings of the IEEE International Conference on Computer Vision, 9287–9296 (2019)
Gu, J., Wu, B., Fan, L., Huang, J., Cao, S., Xiang, Z., Hua, X.-S.: Homography loss for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1080–1089 (2022)
Chen, W., Zhao, J., Zhao, W.-L., Wu, S.-Y.: Shape-aware monocular 3d object detection. IEEE Trans. Intell. Transp. Syst. 24(6), 6416–6424 (2023)
Funding
This work was supported by the National Natural Science Foundation of China (62102003) Anhui Postdoctoral Science Foundation (2022B623) Natural Science Foundation of Anhui Province (2108085QF258) the University Synergy Innovation Program of Anhui Province (GXXT-2022-038, GXXT-2021-006) Central guiding local technology development special funds (202107d06020001) University-level general projects of Anhui University of science and technology (xjyb2020-04).
Author information
Authors and Affiliations
Contributions
Chenxing Xia developed the overall research idea and wrote the main manuscript text; Wenjun Zhao prepared the manuscript and contributed to conception, design and implementation; Huidan Han collected the data and contributed to conception, design; Zhanpeng Tao contributed to conception, design and implementation; Bin Ge performed data analysis, Xiuju Gao designed the experiments and performed data analysis; Kuan-Ching Li contributed to conception and edited the final version for clarity and accuracy; Yan Zhang contributed to implementation and edited the final version for clarity and accuracy.
Corresponding author
Ethics declarations
Ethics approval
Not applicable
Consent to participate
The authors declare their willingness to participate in the work
Consent for publication
The authors state that they agree to publish this work
Conflicts of interest
The authors declared that they have no conflict of interest to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, C., Zhao, W., Han, H. et al. MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation. J Intell Robot Syst 110, 2 (2024). https://doi.org/10.1007/s10846-023-02027-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10846-023-02027-6