Vision-state Fusion: Improving Deep Neural Networks for Autonomous Robotics,Journal of Intelligent & Robotic Systems

当前位置： X-MOL 学术 › J. Intell. Robot. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Vision-state Fusion: Improving Deep Neural Networks for Autonomous Robotics
Journal of Intelligent & Robotic Systems ( IF 3.3 ) Pub Date : 2024-04-10 , DOI: 10.1007/s10846-024-02091-6
Elia Cereda , Stefano Bonato , Mirko Nava , Alessandro Giusti , Daniele Palossi

Vision-based deep learning perception fulfills a paramount role in robotics, facilitating solutions to many challenging scenarios, such as acrobatic maneuvers of autonomous unmanned aerial vehicles (UAVs) and robot-assisted high-precision surgery. Control-oriented end-to-end perception approaches, which directly output control variables for the robot, commonly take advantage of the robot’s state estimation as an auxiliary input. When intermediate outputs are estimated and fed to a lower-level controller, i.e., mediated approaches, the robot’s state is commonly used as an input only for egocentric tasks, which estimate physical properties of the robot itself. In this work, we propose to apply a similar approach for the first time – to the best of our knowledge – to non-egocentric mediated tasks, where the estimated outputs refer to an external subject. We prove how our general methodology improves the regression performance of deep convolutional neural networks (CNNs) on a broad class of non-egocentric 3D pose estimation problems, with minimal computational cost. By analyzing three highly-different use cases, spanning from grasping with a robotic arm to following a human subject with a pocket-sized UAV, our results consistently improve the R\(^{2}\) regression metric, up to +0.51, compared to their stateless baselines. Finally, we validate the in-field performance of a closed-loop autonomous cm-scale UAV on the human pose estimation task. Our results show a significant reduction, i.e., 24% on average, on the mean absolute error of our stateful CNN, compared to a State-of-the-Art stateless counterpart.

中文翻译：

视觉状态融合：改进自主机器人的深度神经网络

基于视觉的深度学习感知在机器人技术中发挥着至关重要的作用，有助于解决许多具有挑战性的场景，例如自主无人机（UAV）的杂技动作和机器人辅助的高精度手术。面向控制的端到端感知方法直接输出机器人的控制变量，通常利用机器人的状态估计作为辅助输入。当估计中间输出并将其馈送到较低级别的控制器（即中介方法）时，机器人的状态通常仅用作以自我为中心的任务的输入，该任务估计机器人本身的物理属性。在这项工作中，据我们所知，我们建议首次将类似的方法应用于非自我中心的中介任务，其中估计的输出涉及外部主体。我们证明了我们的通用方法如何以最小的计算成本提高深度卷积神经网络（CNN）在广泛的非自我中心 3D 姿势估计问题上的回归性能。通过分析三个高度不同的用例，从用机械臂抓取到用袖珍无人机跟踪人类受试者，我们的结果不断改进 R \(^{2}\)回归指标，高达 +0.51，与他们的无国籍基线相比。最后，我们验证了闭环自主厘米级无人机在人体姿态估计任务上的现场性能。我们的结果显示，与最先进的无状态对应物相比，我们的有状态 CNN 的平均绝对误差显着降低，即平均降低了 24%。

更新日期：2024-04-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>