EMTNet: efficient mobile transformer network for real-time monocular depth estimation,Pattern Analysis and Applications

当前位置： X-MOL 学术 › Pattern Anal. Applic. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

EMTNet: efficient mobile transformer network for real-time monocular depth estimation
Pattern Analysis and Applications ( IF 3.9 ) Pub Date : 2023-10-07 , DOI: 10.1007/s10044-023-01205-4
Long Yan , Fuyang Yu , Chao Dong

Estimating depth from a single image presents a formidable challenge due to the inherently ill-posed and ambiguous nature of deriving depth information from a 3D scene. Prior approaches to monocular depth estimation have mainly relied on Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) as the primary feature extraction methods. However, striking a balance between speed and accuracy for real-time tasks has proven to be a formidable hurdle with these methods. In this study, we proposed a new model called EMTNet, which extracts feature information from images at both local and global scales by combining CNN and ViT. To reduce the number of parameters, EMTNet introduces the mobile transformer block (MTB), which reuses parameters from self-attention. High-resolution depth maps are generated by fusing multi-scale features in the decoder. Through comprehensive validation on the NYU Depth V2 and KITTI datasets, the results demonstrate that EMTNet outperforms previous real-time monocular depth estimation models based on CNNs and hybrid architecture. In addition, we have done the corresponding generalizability tests and ablation experiments to verify our conjectures. The depth map output from EMTNet exhibits intricate details and attains a real-time frame rate of 32 FPS, achieving a harmonious balance between real-time and accuracy.

中文翻译：

EMTNet：用于实时单目深度估计的高效移动变压器网络

由于从 3D 场景中获取深度信息固有的不适定性和模糊性，从单个图像估计深度提出了艰巨的挑战。先前的单目深度估计方法主要依靠卷积神经网络（CNN）或视觉变换器（ViT）作为主要特征提取方法。然而，事实证明，在实时任务的速度和准确性之间取得平衡是这些方法的一个巨大障碍。在本研究中，我们提出了一种名为 EMTNet 的新模型，该模型结合 CNN 和 ViT 从局部和全局尺度的图像中提取特征信息。为了减少参数数量，EMTNet 引入了移动变压器块（MTB），它重用了自注意力的参数。高分辨率深度图是通过在解码器中融合多尺度特征来生成的。通过对 NYU Depth V2 和 KITTI 数据集的全面验证，结果表明 EMTNet 优于之前基于 CNN 和混合架构的实时单目深度估计模型。此外，我们还做了相应的泛化性测试和消融实验来验证我们的猜想。EMTNet输出的深度图展现了错综复杂的细节，并达到了32 FPS的实时帧率，实现了实时性和准确性之间的和谐平衡。我们做了相应的概括性测试和消融实验来验证我们的猜想。EMTNet输出的深度图展现了错综复杂的细节，并达到了32 FPS的实时帧率，实现了实时性和准确性之间的和谐平衡。我们做了相应的概括性测试和消融实验来验证我们的猜想。EMTNet输出的深度图展现了错综复杂的细节，并达到了32 FPS的实时帧率，实现了实时性和准确性之间的和谐平衡。

更新日期：2023-10-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>