Robust cross-modal retrieval with alignment refurbishment

Guo, Jinyi; Ding, Jieyu

doi:10.1631/FITEE.2200514

Robust cross-modal retrieval with alignment refurbishment

基于对齐自修正的鲁棒跨模态检索

Research Article
Published: 07 November 2023

Volume 24, pages 1403–1415, (2023)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

73 Accesses
Explore all metrics

Abstract

Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data. Currently, many cross-modal retrieval methods have been proposed and have achieved excellent results; however, these are trained with clean cross-modal pairs, which are semantically matched but costly, compared with easily available data with noise alignment (i.e., paired but mismatched in semantics). When training these methods with noise-aligned data, the performance degrades dramatically. Therefore, we propose a robust cross-modal retrieval with alignment refurbishment (RCAR), which significantly reduces the impact of noise on the model. Specifically, RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable. Then, RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component. In addition, we define partial and complete noises in the noise-alignment paradigm. Experimental results show that, compared with the popular cross-modal retrieval methods, RCAR achieves more robust performance with both types of noise.

摘要

跨模态检索通过为不同模态数据建立一致的对齐方式来实现模态间的相互检索. 目前多种跨模态检索方法已被提出并取得良好性能. 这些方法使用干净对齐的跨模态数据进行训练. 虽然这些数据在语义上是匹配的, 但相较于互联网上容易获得的噪声对齐的数据(即成对但在语义上不匹配), 标注成本很高. 当用噪声对齐的数据训练这些模型时, 它们的性能会急剧下降. 因此, 本文提出一种对齐自修正的鲁棒跨模态检索算法(RCAR), 显著降低了噪声数据对模型的影响. 具体来说, RCAR首先进行多任务学习, 减缓模型对噪声数据的过拟合, 使数据分离. 然后, 利用两成分的贝塔混合模型将数据分为干净数据和噪声数据, 并根据后验概率修正对齐标签. 此外, 在噪声对齐范式中定义两种噪声类型: 部分噪声数据和完全噪声数据. 实验结果表明, 与当下流行的跨模态检索方法相比, RCAR在两种类型的噪声下都能取得更稳健的性能.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Arazo E, Ortego D, Albert P, et al., 2019. Unsupervised label noise modeling and loss correction. Proc 36^th Int Conf on Machine Learning, p.312–321.
Chang HS, Learned-Miller E, McCallum A, 2017. Active bias: training more accurate neural networks by emphasizing high variance samples. Proc 31^st Int Conf on Neural Information Processing Systems, p.1003–1013.
Chen H, Ding GG, Liu XD, et al., 2020. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12652–12660. https://doi.org/10.1109/CVPR42600.2020.01267
Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: universal image-text representation learning. Proc 16^th European Conf on Computer Vision, p.104–120. https://doi.org/10.1007/978-3-030-58577-8_7
Chung J, Gulcehre C, Cho KH, et al., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555
Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171–4186. https://doi.org/10.18653/v1/n19-1423
Diao HW, Zhang Y, Ma L, et al., 2021. Similarity reasoning and filtration for image-text matching. Proc AAAI 35^th Conf on Artificial Intelligence, p.1218–1226. https://doi.org/10.1609/aaai.v35i2.16209
Faghri F, Fleet DJ, Kiros JR, et al., 2018. VSE++: improving visual-semantic embeddings with hard negatives. British Machine Vision Conf, Article 12.
Geigle G, Pfeiffer J, Reimers N, et al., 2022. Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans Assoc Comput Ling, 10:503–521. https://doi.org/10.1162/tacl_a_00473
Google Scholar
Ghosh A, Kumar H, Sastry PS, 2017. Robust loss functions under label noise for deep neural networks. Proc 31^st Conf on Artificial Intelligence, p.1919–1925.
Han B, Yao QM, Yu XR et al., 2018. Co-teaching: robust training of deep neural networks with extremely noisy labels. Proc 32^nd Int Conf on Neural Information Processing Systems, p.8536–8546.
He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90
Huiskes MJ, Lew MS, 2008. The MIR flickr retrieval evaluation. Proc 1^st ACM Int Conf on Multimedia Information Retrieval, p.39–43. https://doi.org/10.1145/1460096.1460104
Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38^th Int Conf on Machine Learning, p.4904–4916.
Jiang L, Zhou ZY, Leung T, et al., 2018. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Proc 35^th Int Conf on Machine Learning, p.2309–2318.
Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128–3137. https://doi.org/10.1109/CVPR.2015.7298932
Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3^rd Int Conf on Learning Representations.
Lee KH, Chen X, Hua G, et al., 2018. Stacked cross attention for image–text matching. Proc 15^th European Conf on Computer Vision, p.212–228. https://doi.org/10.1007/978-3-030-01225-0_13
Li KP, Zhang YL, Li K, et al., 2019. Visual semantic reasoning for image-text matching. IEEE/CVF Int Conf on Computer Vision, p.4653–4661. https://doi.org/10.1109/ICCV.2019.00475
Li XJ, Yin X, Li CY, et al., 2020. Proc 16^th European Conf on Computer Vision, p.121–137. https://doi.org/10.1007/978-3-030-58577-8_8
Lin TY, Maire M, Belongie S, et al., 2014. Proc 13^th European Conf on Computer Vision, p.740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Lin XY, Bhattacharjee D, El Helou M, et al., 2021. Fidelity estimation improves noisy-image classification with pretrained networks. IEEE Signal Process Lett, 28:1719–1723. https://doi.org/10.1109/LSP.2021.3104769
Article Google Scholar
Liu TL, Tao DC, 2016. Classification with noisy labels by importance reweighting. IEEE Trans Patt Anal Mach Intell, 38(3):447–461. https://doi.org/10.1109/TPAMI.2015.2456899
Article MathSciNet Google Scholar
Lu JS, Batra D, Parikh D, et al., 2019. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc 33^rd Int Conf on Neural Information Processing Systems, p.13–23.
Lyu YM, Tsang IW, 2020. Curriculum loss: robust learning and generalization against label corruption. Proc 8^th Int Conf on Learning Representations.
Ma X, Huang H, Wang Y, et al., 2020. Normalized loss functions for deep learning with noisy labels. Proc 37^th Int Conf on Machine Learning, p.6543–6553.
Ma XJ, Wang YS, Houle ME, et al., 2018. Dimensionality-driven learning with noisy labels. Proc 35^th Int Conf on Machine Learning, p.3361–3370.
Ma ZY, Leijon A, 2011. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Patt Anal Mach Intell, 33(11):2160–2173. https://doi.org/10.1109/TPAMI.2011.63
Article Google Scholar
Manwani N, Sastry PS, 2013. Noise tolerance under risk minimization. IEEE Trans Cybern, 43(3):1146–1151. https://doi.org/10.1109/TSMCB.2012.2223460
Article Google Scholar
Messina N, Amato G, Esuli A, et al., 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl, 17(4):128. https://doi.org/10.1145/3451390
Article Google Scholar
Niwattanakul S, Singthongchai J, Naenudorn E, et al., 2013. Using of jaccard coefficient for keywords similarity. Proc Int MultiConf of Engineers and Computer Scientists, p.380–384.
Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38^th Int Conf on Machine Learning, p.8748–8763.
Reed SE, Lee H, Anguelov D, et al., 2015. Training deep neural networks on noisy labels with bootstrapping. Proc 3^rd Int Conf on Learning Representations.
Ren SQ, He KM, Girshick R, et al., 2017. Faster RCNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Ruder S, 2017. An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098
Song H, Kim M, Lee JG, 2019. SELFIE: refurbishing unclean samples for robust deep learning. Proc 36^th Int Conf on Machine Learning, p.5907–5915.
Song H, Kim M, Park D, et al., 2020. Learning from noisy labels with deep neural networks: a survey. https://arxiv.org/abs/2007.08199
van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(86):2579–2605.
MATH Google Scholar
Wang KY, Yin QY, Wang W, et al., 2016. A comprehensive survey on cross-modal retrieval. https://arxiv.org/abs/1607.06215
Wang RX, Liu TL, Tao DC, 2018. Multiclass learning with partially corrupted labels. IEEE Trans Neur Netw Learn Syst, 29(6):2568–2580. https://doi.org/10.1109/TNNLS.2017.2699783
Article MathSciNet Google Scholar
Yang J, Duan J, Tran S, et al., 2022. Vision-language pre-training with triple contrastive learning. IEEE/CVF
Conf on Computer Vision and Pattern Recognition, p.15650–15659. https://doi.org/10.1109/CVPR52688.2022.01522
Zhang HY, Xing XM, Liu L, 2021. DualGraph: a graph-based method for reasoning about label noise. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9654–9663. https://doi.org/10.1109/CVPR46437.2021.00953

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
Jinyi Guo (郭金一)
School of Mathematics and Statistics, Qingdao University, Qingdao, 266071, China
Jieyu Ding (丁洁玉)

Authors

Jinyi Guo (郭金一)
View author publications
You can also search for this author in PubMed Google Scholar
Jieyu Ding (丁洁玉)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jinyi GUO and Jieyu DING designed the research. Jinyi GUO processed the data and drafted the paper. Jieyu DING helped organize the paper. Jinyi GUO and Jieyu DING revised and finalized the paper.

Corresponding author

Correspondence to Jieyu Ding (丁洁玉).

Ethics declarations

Jinyi GUO and Jieyu DING declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. 12172186)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, J., Ding, J. Robust cross-modal retrieval with alignment refurbishment. Front Inform Technol Electron Eng 24, 1403–1415 (2023). https://doi.org/10.1631/FITEE.2200514

Download citation

Received: 27 October 2022
Accepted: 16 February 2023
Published: 07 November 2023
Issue Date: October 2023
DOI: https://doi.org/10.1631/FITEE.2200514

Key words

关键词

CLC number

TP391

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust cross-modal retrieval with alignment refurbishment

Abstract

摘要

Access this article

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation