Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection,Journal of Ambient Intelligence and Humanized Computing

当前位置： X-MOL 学术 › J. Ambient Intell. Human. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection
Journal of Ambient Intelligence and Humanized Computing ( IF 3.662 ) Pub Date : 2024-03-02 , DOI: 10.1007/s12652-024-04758-2
Shilpa Elsa Abraham , Binsu C. Kovoor

RGB-D salient object detection (SOD) has received immense research interest in the recent past due to its capability to handle complex and challenging image scenes. Despite substantial endeavors in this field, two notable challenges endure. The first pertains to the efficient extraction of saliency-relevant features from both modalities, necessitating a comprehensive understanding and capture of intricate features present in RGB and depth views. To tackle this, a novel approach is proposed, which encompasses convolutional neural network (CNN) and transformer architecture into a unified framework. This effectively captures spatial hierarchies and intricate dependencies, facilitating enhanced feature extraction and deeper contextual understanding. The second hurdle involves devising an optimal fusion strategy for the RGB and depth views, which is handled by a specialized cross-interactive multi-modal (CIMM) feature fusion module. Constructed with two stages of self-attention, this module generates a unified feature representation, effectively bridging the gap between the two modalities. Further, to emphasize the delineation of salient objects’ boundaries, an edge enhancement module (EEM) is incorporated that enhances the visual distinctiveness of salient objects, thereby improving the overall quality and accuracy of salient object detection in complex scenes. Extensive experimental evaluations performed on seven benchmark datasets demonstrate that the proposed model performs favourably against CNN and transformer based state-of-the-art methods under standard saliency evaluation metric. Notably, on NJU-2K test set, it achieves an S-measure of 0.933, F-measure of 0.930, E-measure of 0.951 and a remarkably low mean absolute error of 0.025, underscoring the efficacy of the proposed model.

中文翻译：

统一卷积和 Transformer：配备交叉交互式多模态特征融合和边缘引导的双级网络，用于 RGB-D 显着目标检测

RGB-D 显着目标检测 (SOD) 由于其处理复杂且具有挑战性的图像场景的能力，近年来引起了巨大的研究兴趣。尽管在这一领域做出了巨大的努力，但仍然存在两个显着的挑战。第一个涉及从两种模态中有效提取显着性相关特征，需要全面理解和捕获 RGB 和深度视图中存在的复杂特征。为了解决这个问题，提出了一种新方法，它将卷积神经网络（CNN）和 Transformer 架构纳入一个统一的框架中。这有效地捕获了空间层次结构和复杂的依赖关系，有助于增强特征提取和更深入的上下文理解。第二个障碍涉及为 RGB 和深度视图设计最佳融合策略，该策略由专门的交叉交互多模态 (CIMM) 特征融合模块处理。该模块由两个阶段的自注意力构建而成，生成统一的特征表示，有效地弥合了两种模式之间的差距。此外，为了强调显着物体边界的描绘，引入了边缘增强模块（EEM），增强了显着物体的视觉独特性，从而提高了复杂场景中显着物体检测的整体质量和准确性。对七个基准数据集进行的广泛实验评估表明，所提出的模型在标准显着性评估指标下的性能优于基于 CNN 和基于 Transformer 的最先进方法。值得注意的是，在 NJU-2K 测试集上，它实现了 0.933 的 S 测量、0.930 的 F 测量、0.951 的 E 测量和 0.025 的极低平均绝对误差，强调了所提出模型的有效性。

更新日期：2024-03-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>