MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Xia, Chenxing; Zhao, Wenjun; Han, Huidan; Tao, Zhanpeng; Ge, Bin; Gao, Xiuju; Li, Kuan-Ching; Zhang, Yan

doi:10.1007/s10846-023-02027-6

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Regular paper
Published: 18 December 2023

Volume 110, article number 2, (2024)
Cite this article

Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Chenxing Xia^1,2,3,
Wenjun Zhao ORCID: orcid.org/0009-0005-6599-0038¹,
Huidan Han⁴,
Zhanpeng Tao¹,
Bin Ge¹,
Xiuju Gao⁵,
Kuan-Ching Li⁶ &
…
Yan Zhang⁷

293 Accesses
Explore all metrics

Abstract

Monocular 3D object detection (Mono3OD) is a challenging yet cost-effective vision task in the fields of autonomous driving and mobile robotics. The lack of reliable depth information makes obtaining accurate 3D positional information extremely difficult. In recent years, center-guided monocular 3D object detectors have directly regressed the absolute depth of the object center based on 2D detection. However, this approach heavily relies on local semantic information, ignoring contextual spatial cues and global-to-local visual correlations. Moreover, visual variations in the scene can lead to inevitable depth prediction errors for objects at different scales. To address these limitations, we propose a Mono3OD framework based on scene-level adaptive instance depth estimation (MonoSAID). Firstly, the continuous depth is discretized into multiple bins, and the width distribution of depth bins is adaptively generated based on scene-level contextual semantic information. Then, by establishing the correlation between global contextual semantic feature information and local semantic features of instances, and using the probability distribution representation of local instance features and the linear combination of bin centers distributions to solve the depth problem. In addition, a multi-scale spatial perception attention module is designed to extract attention maps of various scales through pyramid pooling operations. This design enhances the model’s receptive field and multi-scale spatial perception capabilities, thereby improving its ability to model target objects. We conducted extensive experiments on the KITTI dataset and the Waymo dataset. The results show that MonoSAID can effectively improve the 3D detection accuracy and robustness, and our method achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation

Article 22 July 2022

ESA-SSD: single-stage object detection network using deep hierarchical feature learning

Article 07 December 2023

Leveraging front and side cues for occlusion handling in monocular 3D object detection

Article 21 June 2023

Code or data availability

The data of this experiment are still being collated, and data supporting the results of this study may be obtained from the corresponding author upon reasonable request.

References

Arnold, E., Al-Jarrah, O.Y., Dianati, M., Fallah, S., Oxtoby, D., Mouzakitis, A.: A survey on 3d object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst. 20(10), 3782–3795 (2019)
Article Google Scholar
Wu, Y., Wang, Y., Zhang, S., Ogai, H.: Deep 3d object detection networks using lidar data: a review. IEEE Sens. J. 21(2), 1152–1171 (2020)
Article Google Scholar
Chen, N., Wang, Y., Zhang, R.: Distributed prescribed performance formation control for nonholonomic mobile robots under noisy communication. Journal of Intelligent & Robotic Systems 108(3), 36 (2023)
Article Google Scholar
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point voxel feature set abstraction for 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10529–10538 (2020)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4490–4499 (2018)
Li, C., Ku, J., Waslander, S.L.: Confidence guided stereo 3d object detection with split depth estimation. 2020 IEEE International Conference on Intelligent Robots and Systems (IROS), 5776–5783 (2020)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from rgb-d data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 918–927 (2018)
Liu, X., Xue, N., Wu, T.: Learning auxiliary monocular contexts helps monocular 3d object detection. Proceedings of the AAAI Conference on Artificial Intelligence 36(2), 1810–1818 (2022)
Article Google Scholar
Qin, Z., Wang, J., Lu, Y.: Monogrnet: a geometric reasoning network for monocular 3d object localization. Proceedings of the AAAI Conference on Artificial Intelligence 33(01), 8851–8858 (2019)
Article Google Scholar
Wang, L., Du, L., Ye, X., Fu, Y., Guo, G., Xue, X., Feng, J., Zhang, L.: Depth-conditioned dynamic message propagation for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 454–463 (2021)
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8555–8564 (2021)
Lu, Y., Ma, X., Yang, L., Zhang, T., Liu, Y., Chu, Q., Yan, J., Ouyang, W.: Geometry uncertainty projection network for monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 3111–3121 (2021)
Zhang, Y., Lu, J., Zhou, J.: Objects are different: Flexible monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3289–3298 (2021)
Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Planning-oriented autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 17853–17862 (2023)
Tang, Y., Li, B., Liu, M., Chen, B., Wang, Y., Ouyang, W.: Autopedestrian: an automatic data augmentation and loss function search scheme for pedestrian detection. IEEE Trans. Image Process. 30, 8483–8496 (2021)
Article MathSciNet Google Scholar
Tang, Y., Liu, M., Li, B., Wang, Y., Ouyang, W.: Otp-nms: Towards optimal threshold prediction of nms for crowded pedestrian detection. IEEE Transactions on Image Processing (2023)
Wang, R., Qin, J., Li, K., Li, Y., Cao, D., Xu, J.: Bev-lanedet: An efficient 3d lane detection based on virtual camera via key-points, pp. 1002–1011 (2023)
Liu, H., Qu, D., Xu, F., Du, Z., Jia, K., Song, J., Liu, M.: Real-time and efficient collision avoidance planning approach for safe human-robot interaction. Journal of Intelligent & Robotic Systems 105(4), 93 (2022)
Article Google Scholar
Li, H., Qin, J., Liu, Q., Yan, C.: An efficient deep reinforcement learning algorithm for mapless navigation with gap-guided switching strategy. Journal of Intelligent & Robotic Systems 108(3), 43 (2023)
Article Google Scholar
You, Y., Wang, Y., Chao, W.-L., Garg, D., Pleiss, G., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. (2019). arXiv:1906.06310
Wang, Y., Chao, W.-L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8445–8453 (2019)
Weng, X., Kitani, K.: Monocular 3d object detection with pseudo-lidar point cloud. Proceedings of the IEEE International Conference on Computer Vision Workshops, 0–0 (2019)
Qian, R., Garg, D., Wang, Y., You, Y., Belongie, S., Hariharan, B., Campbell, M., Weinberger, K.Q., Chao, W.-L.: End-to-end pseudo-lidar for image-based 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5881–5890 (2020)
Chong, Z., Ma, X., Zhang, H., Yue, Y., Li, H., Wang, Z., Ouyang, W.: Monodistill: Learning spatial features for monocular 3d object detection. arXiv:2201.10830 (2022)
Huang, K.-C., Wu, T.-H., Su, H.-T., Hsu, W.H.: Monodtr: monocular 3d object detection with depth-aware transformer. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4012–4021 (2022)
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth guided convolutions for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 1000–1001 (2020)
Ma, X., Zhang, Y., Xu, D., Zhou, D., Yi, S., Li, H., Ouyang, W.: Delving into localization errors for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4721–4730 (2021)
Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: Monocular 3d object detection using pairwise spatial relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 12093–12102 (2020)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 913–922 (2021)
Qin, Z., Li, X.: Monoground: Detecting monocular 3d objects from the ground. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3793–3802 (2022)
Li, Z., Qu, Z., Zhou, Y., Liu, J., Wang, H., Jiang, L.: Diversity matters: Fully exploiting depth clues for reliable monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2791–2800 (2022)
Liu, Z., Wu, Z., Tóth, R.: Smoke: single-stage monocular 3d object detection via keypoint estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 996–997 (2020)
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3d bounding box estimation using deep learning and geometry. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7074–7082 (2017)
Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems 30 (2017)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2002–2011 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1907–1915 (2017)
Jing, L., Yu, R., Kretzschmar, H., Li, K., Qi, C.R., Zhao, H., Ayvaci, A., Chen, X., Cower, D., Li, Y., et al.: Depth estimation matters most: improving per-object depth estimation for monocular 3d detection and tracking. 2022 International Conference on Robotics and Automation (ICRA), 366–373 (2022)
Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is pseudo-lidar needed for monocular 3d object detection? Proceedings of the IEEE International Conference on Computer Vision, 3142–3152 (2021)
Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.-K.: Geometry-based distance decomposition for monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 15172–15181 (2021)
Zhang, R., Qiu, H., Wang, T., Guo, Z., Xu, X., Qiao, Y., Gao, P., Li, H.: Monodetr: depth-guided transformer for monocular 3d object detection. (2022). arXiv:2203.13310
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. (2019). arXiv:1904.07850
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2403–2412 (2018)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890 (2017)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361 (2012)
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. Advances in Neural Information Processing Systems 28 (2015)
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2446–2454 (2020)
Kumar, A., Brazil, G., Corona, E., Parchami, A., Liu, X.: Deviant: Depth equivariant network for monocular 3d object detection. Proceedings of the European Conference on Computer Vision, 664–683 (2022)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)
Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: Autoshape: Real-time shape-aware monocular 3d object detection. Proceedings of the IEEE International Conference on Computer Vision, 15641–15650 (2021)
Hong, Y., Dai, H., Ding, Y.: Cross-modality knowledge distillation network for monocular 3d object detection. Proceedings of the European Conference on Computer Vision, 87–104 (2022). Springer
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-lidar representation. Proceedings of the European Conference on Computer Vision, 311–327 (2020)
Kim, Y., Kim, S., Sim, S., Choi, J.W., Kum, D.: Boosting monocular 3d object detection with object-centric auxiliary depth supervision. IEEE Trans. Intell. Transp. Syst. 24(2), 1801–1813 (2022)
Google Scholar
Brazil, G., Liu, X.: M3d-rpn: Monocular 3d region proposal network for object detection. Proceedings of the IEEE International Conference on Computer Vision, 9287–9296 (2019)
Gu, J., Wu, B., Fan, L., Huang, J., Cao, S., Xiang, Z., Hua, X.-S.: Homography loss for monocular 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1080–1089 (2022)
Chen, W., Zhao, J., Zhao, W.-L., Wu, S.-Y.: Shape-aware monocular 3d object detection. IEEE Trans. Intell. Transp. Syst. 24(6), 6416–6424 (2023)
Article Google Scholar

Download references

Funding

This work was supported by the National Natural Science Foundation of China (62102003) Anhui Postdoctoral Science Foundation (2022B623) Natural Science Foundation of Anhui Province (2108085QF258) the University Synergy Innovation Program of Anhui Province (GXXT-2022-038, GXXT-2021-006) Central guiding local technology development special funds (202107d06020001) University-level general projects of Anhui University of science and technology (xjyb2020-04).

Author information

Authors and Affiliations

College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, 232001, China
Chenxing Xia, Wenjun Zhao, Zhanpeng Tao & Bin Ge
Institute of Energy, Hefei Comprehensive National Science Center, Hefei, 230031, China
Chenxing Xia
Anhui Purvar Bigdata Technology Co. Ltd, Huainan, 232001, China
Chenxing Xia
Anyang Cigarette Factory, China Tobacco Henan Industrial Co., Ltd., Anyang, China
Huidan Han
College of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, 232001, China
Xiuju Gao
Department of Computer Science and Information Engineering, Providence University, Taichung City, 43301, Taiwan
Kuan-Ching Li
The School of Electronics and Information Engineering, Anhui University, Hefei, China
Yan Zhang

Authors

Chenxing Xia
View author publications
You can also search for this author in PubMed Google Scholar
Wenjun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Huidan Han
View author publications
You can also search for this author in PubMed Google Scholar
Zhanpeng Tao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Ge
View author publications
You can also search for this author in PubMed Google Scholar
Xiuju Gao
View author publications
You can also search for this author in PubMed Google Scholar
Kuan-Ching Li
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Chenxing Xia developed the overall research idea and wrote the main manuscript text; Wenjun Zhao prepared the manuscript and contributed to conception, design and implementation; Huidan Han collected the data and contributed to conception, design; Zhanpeng Tao contributed to conception, design and implementation; Bin Ge performed data analysis, Xiuju Gao designed the experiments and performed data analysis; Kuan-Ching Li contributed to conception and edited the final version for clarity and accuracy; Yan Zhang contributed to implementation and edited the final version for clarity and accuracy.

Corresponding author

Correspondence to Wenjun Zhao.

Ethics declarations

Ethics approval

Not applicable

Consent to participate

The authors declare their willingness to participate in the work

Consent for publication

The authors state that they agree to publish this work

Conflicts of interest

The authors declared that they have no conflict of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xia, C., Zhao, W., Han, H. et al. MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation. J Intell Robot Syst 110, 2 (2024). https://doi.org/10.1007/s10846-023-02027-6

Download citation

Received: 07 July 2023
Accepted: 05 December 2023
Published: 18 December 2023
DOI: https://doi.org/10.1007/s10846-023-02027-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Abstract

Access this article

Similar content being viewed by others

Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation

ESA-SSD: single-stage object detection network using deep hierarchical feature learning

Leveraging front and side cues for occlusion handling in monocular 3D object detection

Code or data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Abstract

Access this article

Similar content being viewed by others

Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation

ESA-SSD: single-stage object detection network using deep hierarchical feature learning

Leveraging front and side cues for occlusion handling in monocular 3D object detection

Code or data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation