Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Abraham, Shilpa Elsa; Kovoor, Binsu C.

doi:10.1007/s12652-024-04758-2

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Original Research
Published: 02 March 2024

Volume 15, pages 2341–2359, (2024)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

133 Accesses
Explore all metrics

Abstract

RGB-D salient object detection (SOD) has received immense research interest in the recent past due to its capability to handle complex and challenging image scenes. Despite substantial endeavors in this field, two notable challenges endure. The first pertains to the efficient extraction of saliency-relevant features from both modalities, necessitating a comprehensive understanding and capture of intricate features present in RGB and depth views. To tackle this, a novel approach is proposed, which encompasses convolutional neural network (CNN) and transformer architecture into a unified framework. This effectively captures spatial hierarchies and intricate dependencies, facilitating enhanced feature extraction and deeper contextual understanding. The second hurdle involves devising an optimal fusion strategy for the RGB and depth views, which is handled by a specialized cross-interactive multi-modal (CIMM) feature fusion module. Constructed with two stages of self-attention, this module generates a unified feature representation, effectively bridging the gap between the two modalities. Further, to emphasize the delineation of salient objects’ boundaries, an edge enhancement module (EEM) is incorporated that enhances the visual distinctiveness of salient objects, thereby improving the overall quality and accuracy of salient object detection in complex scenes. Extensive experimental evaluations performed on seven benchmark datasets demonstrate that the proposed model performs favourably against CNN and transformer based state-of-the-art methods under standard saliency evaluation metric. Notably, on NJU-2K test set, it achieves an S-measure of 0.933, F-measure of 0.930, E-measure of 0.951 and a remarkably low mean absolute error of 0.025, underscoring the efficacy of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CSNet: a ConvNeXt-based Siamese network for RGB-D salient object detection

Article 23 May 2023

Feature Enhancement and Multi-scale Cross-Modal Attention for RGB-D Salient Object Detection

Saliency Detection Framework Based on Deep Enhanced Attention Network

Data availability

The code and data are shared via GitFront and can be accessed at https://gitfront.io/r/user-9759679/eoWpi4HW5B72/SalFormer_GitFront/.

References

Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 1597–1604
Bai Z, Liu Z, Li G, Ye L, Wang Y (2021) Circular complement network for rgb-d salient object detection. Neurocomputing 451:95–106
Article Google Scholar
Banerjee S, Mitra S, Shankar BU (2018) Automated 3d segmentation of brain tumor using visual saliency. Inf Sci 424:337–353
Article MathSciNet Google Scholar
Borji A, Cheng MM, Jiang H, Li J (2015) Salient object detection: a benchmark. IEEE Trans Image Process 24(12):5706–5722
Article MathSciNet Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Chen S, Fu Y (2020) Progressively guided alternate refinement network for rgb-d salient object detection. In: European conference on computer vision. Springer, pp 520–538
Chen H, Li Y (2018) Progressively complementarity-aware fusion network for rgb-d salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3051–3060
Cheng Y, Fu H, Wei X, Xiao J, Cao X (2014) Depth enhanced saliency detection method. In: Proceedings of international conference on internet multimedia computing and service, pp 23–27
Chen H, Deng Y, Li Y, Hung TY, Lin G (2020) Rgbd salient object detection via disentangled cross-modal fusion. IEEE Trans Image Process 29:8407–8416
Article Google Scholar
Chen Z, Cong R, Xu Q, Huang Q (2020) Dpanet: depth potentiality-aware gated attention network for rgb-d salient object detection. IEEE Trans Image Process 30:7012–7024
Article Google Scholar
Chen Q, Fu K, Liu Z, Chen G, Du H, Qiu B, Shao L (2021) Ef-net: a novel enhancement and fusion network for rgb-d saliency detection. Pattern Recognit 112:107740
Article Google Scholar
Chen T, Hu X, Xiao J, Zhang G, Wang S (2022) Cfidnet: cascaded feature interaction decoder for rgb-d salient object detection. Neural Comput Appl 34(10):7547–7563
Article Google Scholar
Ciptadi A, Hermans T, Rehg JM (2013) An in depth view of saliency. Georgia Institute of Technology
De Boer PT, Kroese DP, Mannor S, Rubinstein RY (2005) A tutorial on the cross-entropy method. Ann Oper Res 134:19–67
Article MathSciNet Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Durga BK, Rajesh V (2022) A resnet deep learning based facial recognition design for future multimedia applications. Comput Electr Eng 104:108384
Article Google Scholar
Fan DP, Cheng MM, Liu Y, Li T, Borji A (2017) Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE international conference on computer vision, pp 4548–4557
Fan DP, Gong C, Cao Y, Ren B, Cheng MM, Borji A (2018) Enhanced-alignment measure for binary foreground map evaluation. arXiv:1805.10421
Fan DP, Lin Z, Zhang Z, Zhu M, Cheng MM (2020) Rethinking rgb-d salient object detection: models, data sets, and large-scale benchmarks. IEEE Trans Neural Netw Learn Syst 32(5):2075–2089
Article Google Scholar
Fang Y, Chen Z, Lin W, Lin CW (2012) Saliency detection in the compressed domain for adaptive image retargeting. IEEE Trans Image Process 21(9):3888–3901
Article MathSciNet Google Scholar
Fang X, Zhu J, Shao X, Wang H (2022) Grouptransnet: group transformer network for rgb-d salient object detection. arXiv:2203.10785
Feng D, Barnes N, You S, McCarthy C (2016) Local background enclosure for rgb-d salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2343–2350
Feng G, Meng J, Zhang L, Lu H (2022) Encoder deep interleaved network with multi-scale aggregation for rgb-d salient object detection. Pattern Recognit 128:108666
Article Google Scholar
Gao L, Liu B, Fu P, Xu M (2023) Depth-aware inverted refinement network for rgb-d salient object detection. Neurocomputing 518:507–522
Article Google Scholar
Guo J, Han K, Wu H, Tang Y, Chen X, Wang Y, Xu C (2022) Cmt: convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12175–12185
Gupta S, Girshick R, Arbeláez P, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision. Springer, pp 345–360
Han J, Chen H, Liu N, Yan C, Li X (2017) Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Trans Cybern 48(11):3171–3183
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu K, Zhao L, Feng S, Zhang S, Zhou Q, Gao X, Guo Y (2022) Colorectal polyp region extraction using saliency detection network with neutrosophic enhancement. Comput Biol Med 147:105760
Article Google Scholar
Huang P, Shen CH, Hsiao HF (2018) Rgbd salient object detection using spatially coherent deep learning framework. In: 2018 IEEE 23rd international conference on digital signal processing (DSP). IEEE, pp 1–5
Huang Z, Chen HX, Zhou T, Yang YZ, Liu BY (2021) Multi-level cross-modal interaction network for rgb-d salient object detection. Neurocomputing 452:200–211
Article Google Scholar
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Jia S, Zhang Y (2018) Saliency-based deep convolutional neural network for no-reference image quality assessment. Multimed Tools Appl 77(12):14859–14872
Article Google Scholar
Jia X, DongYe C, Peng Y (2022) Siatrans: siamese transformer network for rgb-d salient object detection with depth image classification. Image Vis Comput 127:104549
Article Google Scholar
Jin X, Guo C, He Z, Xu J, Wang Y, Su Y (2022) Fcmnet: frequency-aware cross-modality attention networks for rgb-d salient object detection. Neurocomputing 491:414–425
Article Google Scholar
Ju R, Ge L, Geng W, Ren T, Wu G (2014) Depth saliency based on anisotropic center-surround difference. In: 2014 IEEE international conference on image processing (ICIP). IEEE, pp 1115–1119
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, p 25
Kroner A, Senden M, Driessens K, Goebel R (2020) Contextual encoder-decoder network for visual saliency prediction. Neural Netw 129:261–270
Article Google Scholar
Lee M, Park C, Cho S, Lee S (2022) Spsn: superpixel prototype sampling network for rgb-d salient object detection. arXiv:2207.07898
Li N, Ye J, Ji Y, Ling H, Yu J (2014) Saliency detection on light field. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2806–2813
Li C, Cong R, Kwong S, Hou J, Fu H, Zhu G, Zhang D, Huang Q (2020) Asif-net: attention steered interweave fusion network for rgb-d salient object detection. IEEE Trans Cybern 51(1):88–100
Article Google Scholar
Li G, Liu Z, Chen M, Bai Z, Lin W, Ling H (2021) Hierarchical alternate interaction network for rgb-d salient object detection. IEEE Trans Image Process 30:3528–3542
Article Google Scholar
Li H, Wu P, Wang Z, Mao J, Alsaadi FE, Zeng N (2022) A generalized framework of feature learning enhanced convolutional neural network for pathology-image-oriented cancer diagnosis. Comput Biol Med 151:106265
Article Google Scholar
Li J, Ji W, Zhang M, Piao Y, Lu H, Cheng L (2023) Delving into calibrated depth for accurate rgb-d salient object detection. Int J Comput Vis 131(4):855–876
Article Google Scholar
Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R (2021) Swinir: image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1833–1844
Liu Z, Shi S, Duan Q, Zhang W, Zhao P (2019) Salient object detection for rgb-d image by single stream recurrent convolution neural network. Neurocomputing 363:46–57
Article Google Scholar
Liu N, Zhang N, Wan K, Shao L, Han J (2021a) Visual saliency transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4722–4732
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021b) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Liu Z, Tan Y, He Q, Xiao Y (2021) Swinnet: swin transformer drives edge-aware rgb-d and rgb-t salient object detection. IEEE Trans Circuits Syst Video Technol 32(7):4486–97
Article Google Scholar
Liu Z, Wang Y, Tu Z, Xiao Y, Tang B (2021d) Tritransnet: Rgb-d salient object detection with a triplet transformer embedding network. In: Proceedings of the 29th ACM international conference on multimedia, pp 4481–4490
Liu C, Yang G, Wang S, Wang H, Zhang Y, Wang Y (2023) Tanet: transformer-based asymmetric network for rgb-d salient object detection. In: IET Computer Vision
Mashrur FR, Islam MS, Saha DK, Islam SR, Moni MA (2021) Scnn: scalogram-based convolutional neural network to detect obstructive sleep apnea using single-lead electrocardiogram signals. Comput Biol Med 134:104532
Article Google Scholar
Michel P, Levy O, Neubig G (2019) Are sixteen heads really better than one? In: Advances in neural information processing systems, p 32
Ning X, Gong K, Li W, Zhang L (2021) Jwsaa: joint weak saliency and attention aware for person re-identification. Neurocomputing 453:801–811
Article Google Scholar
Niu Y, Geng Y, Li X, Liu F (2012) Leveraging stereopsis for saliency analysis. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 454–461
Pantazis G, Dimas G, Iakovidis DK (2020) Salsum: saliency-based video summarization using generative adversarial networks. arXiv:2011.10432
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, p 32
Patel Y, Appalaraju S, Manmatha R (2021) Saliency driven perceptual image compression. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 227–236
Peng H, Li B, Xiong W, Hu W, Ji R (2014) Rgbd salient object detection: a benchmark and algorithms. In: European conference on computer vision. Springer, pp 92–109
Perazzi F, Krähenbühl P, Pritch Y, Hornung A (2012) Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 733–740
Piao Y, Ji W, Li J, Zhang M, Lu H (2019) Depth-induced multi-scale recurrent attention network for saliency detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7254–7263
Rahman MA, Wang Y (2016) Optimizing intersection-over-union in deep neural networks for image segmentation. In: International symposium on visual computing. Springer, pp 234–244
Ren J, Gong X, Yu L, Zhou W, Ying Yang M (2015) Exploiting global priors for rgb-d saliency detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 25–32
Shigematsu R, Feng D, You S, Barnes N (2017) Learning rgb-d salient object detection using background enclosure, depth contrast, and top-down features. In: Proceedings of the IEEE international conference on computer vision workshops, pp 2749–2757
Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272
Sun F, Ren P, Yin B, Wang F, Li H (2023) Catnet: a cascaded and aggregated transformer network for rgb-d salient object detection. IEEE Trans Multimed
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, p 30
Wang N, Gong X (2019) Adaptive fusion for rgb-d salient object detection. IEEE Access 7:55277–55284
Article Google Scholar
Wang X, Li S, Chen C, Hao A, Qin H (2021) Depth quality-aware selective saliency fusion for rgb-d image salient object detection. Neurocomputing 432:44–56
Article Google Scholar
Wei W, Xu M, Wang J, Luo X (2023) Bidirectional attentional interaction networks for rgb-d salient object detection. Image Vis Comput 138:104792
Article Google Scholar
Wu J, Hao F, Liang W, Xu J (2023) Transformer fusion and pixel-level contrastive learning for rgb-d salient object detection. IEEE Trans Multimed
Yang Y, Wang J, Xie F, Liu J, Shu C, Wang Y, Zheng Y, Zhang H (2021) A convolutional neural network trained with dermoscopic images of psoriasis performed on par with 230 dermatologists. Comput Biol Med 139:104924
Article Google Scholar
Yang Y, Qin Q, Luo Y, Liu Y, Zhang Q, Han J (2022) Bi-directional progressive guidance network for rgb-d salient object detection. IEEE Trans Circuits Syst Video Technol 32(8):5346–5360
Article Google Scholar
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang ZH, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567
Zhang M, Ren W, Piao Y, Rong Z, Lu H (2020) Select, supplement and focus for rgb-d saliency detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3472–3481
Zhang J, Fan DP, Dai Y, Anwar S, Saleh F, Aliakbarian S, Barnes N (2021a) Uncertainty inspired rgb-d saliency detection. IEEE Trans Pattern Anal Mach Intell
Zhang Y, Zheng J, Li L, Liu N, Jia W, Fan X, Xu C, He X (2021) Rethinking feature aggregation for deep rgb-d salient object detection. Neurocomputing 423:463–473
Article Google Scholar
Zhao JX, Cao Y, Fan DP, Cheng MM, Li XY, Zhang L (2019) Contrast prior and fluid pyramid integration for rgbd salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3927–3936
Zhao X, Zhang L, Pang Y, Lu H, Zhang L (2020) A single stream network for robust and real-time rgb-d salient object detection. In: European conference on computer vision. Springer, pp 646–662
Zhou W, Zhu Y, Lei J, Wan J, Yu L (2021) Ccafnet: crossflow and cross-scale adaptive fusion network for detecting salient objects in rgb-d images. IEEE Trans Multimed 24:2192–2204
Article Google Scholar
Zhou W, Yue Y, Fang M, Qian X, Yang R, Yu L (2023) Bcinet: bilateral cross-modal interaction network for indoor scene understanding in rgb-d images. Inf Fusion 94:32–42
Article Google Scholar

Download references

Acknowledgements

This work was partly supported by Rashtriya Uchchatar Shiksha Abhiyan (RUSA) 2.0 of India under “Research, Innovation and Quality Improvement” for Project ID. T4G

Author information

Authors and Affiliations

School of Engineering, Cochin University of Science and Technology, Kochi, Kerala, India
Shilpa Elsa Abraham & Binsu C. Kovoor

Authors

Shilpa Elsa Abraham
View author publications
You can also search for this author in PubMed Google Scholar
Binsu C. Kovoor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shilpa Elsa Abraham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Abraham, S.E., Kovoor, B.C. Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection. J Ambient Intell Human Comput 15, 2341–2359 (2024). https://doi.org/10.1007/s12652-024-04758-2

Download citation

Received: 31 May 2023
Accepted: 22 January 2024
Published: 02 March 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s12652-024-04758-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Abstract

Access this article

Similar content being viewed by others

CSNet: a ConvNeXt-based Siamese network for RGB-D salient object detection

Feature Enhancement and Multi-scale Cross-Modal Attention for RGB-D Salient Object Detection

Saliency Detection Framework Based on Deep Enhanced Attention Network

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection

Abstract

Access this article

Similar content being viewed by others

CSNet: a ConvNeXt-based Siamese network for RGB-D salient object detection

Feature Enhancement and Multi-scale Cross-Modal Attention for RGB-D Salient Object Detection

Saliency Detection Framework Based on Deep Enhanced Attention Network

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation