MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation,Journal of Intelligent & Robotic Systems

当前位置： X-MOL 学术 › J. Intell. Robot. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation
Journal of Intelligent & Robotic Systems ( IF 3.3 ) Pub Date : 2023-12-18 , DOI: 10.1007/s10846-023-02027-6
Chenxing Xia , Wenjun Zhao , Huidan Han , Zhanpeng Tao , Bin Ge , Xiuju Gao , Kuan-Ching Li , Yan Zhang

Monocular 3D object detection (Mono3OD) is a challenging yet cost-effective vision task in the fields of autonomous driving and mobile robotics. The lack of reliable depth information makes obtaining accurate 3D positional information extremely difficult. In recent years, center-guided monocular 3D object detectors have directly regressed the absolute depth of the object center based on 2D detection. However, this approach heavily relies on local semantic information, ignoring contextual spatial cues and global-to-local visual correlations. Moreover, visual variations in the scene can lead to inevitable depth prediction errors for objects at different scales. To address these limitations, we propose a Mono3OD framework based on scene-level adaptive instance depth estimation (MonoSAID). Firstly, the continuous depth is discretized into multiple bins, and the width distribution of depth bins is adaptively generated based on scene-level contextual semantic information. Then, by establishing the correlation between global contextual semantic feature information and local semantic features of instances, and using the probability distribution representation of local instance features and the linear combination of bin centers distributions to solve the depth problem. In addition, a multi-scale spatial perception attention module is designed to extract attention maps of various scales through pyramid pooling operations. This design enhances the model’s receptive field and multi-scale spatial perception capabilities, thereby improving its ability to model target objects. We conducted extensive experiments on the KITTI dataset and the Waymo dataset. The results show that MonoSAID can effectively improve the 3D detection accuracy and robustness, and our method achieves state-of-the-art performance.

中文翻译：

MonoSAID：基于场景级自适应实例深度估计的单目 3D 物体检测

单目 3D 物体检测 (Mono3OD) 是自动驾驶和移动机器人领域一项具有挑战性但具有成本效益的视觉任务。缺乏可靠的深度信息使得获取准确的 3D 位置信息变得极其困难。近年来，中心引导单目3D物体检测器基于2D检测直接回归物体中心的绝对深度。然而，这种方法严重依赖局部语义信息，忽略上下文空间线索和全局到局部的视觉相关性。此外，场景中的视觉变化可能会导致不同尺度的物体不可避免的深度预测错误。为了解决这些限制，我们提出了一个基于场景级自适应实例深度估计（MonoSAID）的 Mono3OD 框架。首先，将连续深度离散化为多个仓，并根据场景级上下文语义信息自适应生成深度仓的宽度分布。然后，通过建立全局上下文语义特征信息与实例局部语义特征之间的相关性，并利用局部实例特征的概率分布表示和bin中心分布的线性组合来解决深度问题。此外，还设计了多尺度空间感知注意力模块，通过金字塔池化操作提取各种尺度的注意力图。这种设计增强了模型的感受野和多尺度空间感知能力，从而提高了其对目标物体的建模能力。我们对 KITTI 数据集和 Waymo 数据集进行了广泛的实验。结果表明，MonoSAID 可以有效提高 3D 检测精度和鲁棒性，并且我们的方法实现了最先进的性能。

更新日期：2023-12-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>