Skip to main content
Log in

Robust cross-modal retrieval with alignment refurbishment

基于对齐自修正的鲁棒跨模态检索

  • Research Article
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data. Currently, many cross-modal retrieval methods have been proposed and have achieved excellent results; however, these are trained with clean cross-modal pairs, which are semantically matched but costly, compared with easily available data with noise alignment (i.e., paired but mismatched in semantics). When training these methods with noise-aligned data, the performance degrades dramatically. Therefore, we propose a robust cross-modal retrieval with alignment refurbishment (RCAR), which significantly reduces the impact of noise on the model. Specifically, RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable. Then, RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component. In addition, we define partial and complete noises in the noise-alignment paradigm. Experimental results show that, compared with the popular cross-modal retrieval methods, RCAR achieves more robust performance with both types of noise.

摘要

跨模态检索通过为不同模态数据建立一致的对齐方式来实现模态间的相互检索. 目前多种跨模态检索方法已被提出并取得良好性能. 这些方法使用干净对齐的跨模态数据进行训练. 虽然这些数据在语义上是匹配的, 但相较于互联网上容易获得的噪声对齐的数据(即成对但在语义上不匹配), 标注成本很高. 当用噪声对齐的数据训练这些模型时, 它们的性能会急剧下降. 因此, 本文提出一种对齐自修正的鲁棒跨模态检索算法(RCAR), 显著降低了噪声数据对模型的影响. 具体来说, RCAR首先进行多任务学习, 减缓模型对噪声数据的过拟合, 使数据分离. 然后, 利用两成分的贝塔混合模型将数据分为干净数据和噪声数据, 并根据后验概率修正对齐标签. 此外, 在噪声对齐范式中定义两种噪声类型: 部分噪声数据和完全噪声数据. 实验结果表明, 与当下流行的跨模态检索方法相比, RCAR在两种类型的噪声下都能取得更稳健的性能.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  • Arazo E, Ortego D, Albert P, et al., 2019. Unsupervised label noise modeling and loss correction. Proc 36th Int Conf on Machine Learning, p.312–321.

  • Chang HS, Learned-Miller E, McCallum A, 2017. Active bias: training more accurate neural networks by emphasizing high variance samples. Proc 31st Int Conf on Neural Information Processing Systems, p.1003–1013.

  • Chen H, Ding GG, Liu XD, et al., 2020. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12652–12660. https://doi.org/10.1109/CVPR42600.2020.01267

  • Chen YC, Li LJ, Yu LC, et al., 2020. UNITER: universal image-text representation learning. Proc 16th European Conf on Computer Vision, p.104–120. https://doi.org/10.1007/978-3-030-58577-8_7

  • Chung J, Gulcehre C, Cho KH, et al., 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. https://arxiv.org/abs/1412.3555

  • Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171–4186. https://doi.org/10.18653/v1/n19-1423

  • Diao HW, Zhang Y, Ma L, et al., 2021. Similarity reasoning and filtration for image-text matching. Proc AAAI 35th Conf on Artificial Intelligence, p.1218–1226. https://doi.org/10.1609/aaai.v35i2.16209

  • Faghri F, Fleet DJ, Kiros JR, et al., 2018. VSE++: improving visual-semantic embeddings with hard negatives. British Machine Vision Conf, Article 12.

  • Geigle G, Pfeiffer J, Reimers N, et al., 2022. Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans Assoc Comput Ling, 10:503–521. https://doi.org/10.1162/tacl_a_00473

    Google Scholar 

  • Ghosh A, Kumar H, Sastry PS, 2017. Robust loss functions under label noise for deep neural networks. Proc 31st Conf on Artificial Intelligence, p.1919–1925.

  • Han B, Yao QM, Yu XR et al., 2018. Co-teaching: robust training of deep neural networks with extremely noisy labels. Proc 32nd Int Conf on Neural Information Processing Systems, p.8536–8546.

  • He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90

  • Huiskes MJ, Lew MS, 2008. The MIR flickr retrieval evaluation. Proc 1st ACM Int Conf on Multimedia Information Retrieval, p.39–43. https://doi.org/10.1145/1460096.1460104

  • Jia C, Yang YF, Xia Y, et al., 2021. Scaling up visual and vision-language representation learning with noisy text supervision. Proc 38th Int Conf on Machine Learning, p.4904–4916.

  • Jiang L, Zhou ZY, Leung T, et al., 2018. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Proc 35th Int Conf on Machine Learning, p.2309–2318.

  • Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128–3137. https://doi.org/10.1109/CVPR.2015.7298932

  • Kingma DP, Ba J, 2015. Adam: a method for stochastic optimization. Proc 3rd Int Conf on Learning Representations.

  • Lee KH, Chen X, Hua G, et al., 2018. Stacked cross attention for image–text matching. Proc 15th European Conf on Computer Vision, p.212–228. https://doi.org/10.1007/978-3-030-01225-0_13

  • Li KP, Zhang YL, Li K, et al., 2019. Visual semantic reasoning for image-text matching. IEEE/CVF Int Conf on Computer Vision, p.4653–4661. https://doi.org/10.1109/ICCV.2019.00475

  • Li XJ, Yin X, Li CY, et al., 2020. Proc 16th European Conf on Computer Vision, p.121–137. https://doi.org/10.1007/978-3-030-58577-8_8

  • Lin TY, Maire M, Belongie S, et al., 2014. Proc 13th European Conf on Computer Vision, p.740–755. https://doi.org/10.1007/978-3-319-10602-1_48

  • Lin XY, Bhattacharjee D, El Helou M, et al., 2021. Fidelity estimation improves noisy-image classification with pretrained networks. IEEE Signal Process Lett, 28:1719–1723. https://doi.org/10.1109/LSP.2021.3104769

    Article  Google Scholar 

  • Liu TL, Tao DC, 2016. Classification with noisy labels by importance reweighting. IEEE Trans Patt Anal Mach Intell, 38(3):447–461. https://doi.org/10.1109/TPAMI.2015.2456899

    Article  MathSciNet  Google Scholar 

  • Lu JS, Batra D, Parikh D, et al., 2019. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc 33rd Int Conf on Neural Information Processing Systems, p.13–23.

  • Lyu YM, Tsang IW, 2020. Curriculum loss: robust learning and generalization against label corruption. Proc 8th Int Conf on Learning Representations.

  • Ma X, Huang H, Wang Y, et al., 2020. Normalized loss functions for deep learning with noisy labels. Proc 37th Int Conf on Machine Learning, p.6543–6553.

  • Ma XJ, Wang YS, Houle ME, et al., 2018. Dimensionality-driven learning with noisy labels. Proc 35th Int Conf on Machine Learning, p.3361–3370.

  • Ma ZY, Leijon A, 2011. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Patt Anal Mach Intell, 33(11):2160–2173. https://doi.org/10.1109/TPAMI.2011.63

    Article  Google Scholar 

  • Manwani N, Sastry PS, 2013. Noise tolerance under risk minimization. IEEE Trans Cybern, 43(3):1146–1151. https://doi.org/10.1109/TSMCB.2012.2223460

    Article  Google Scholar 

  • Messina N, Amato G, Esuli A, et al., 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans Multim Comput Commun Appl, 17(4):128. https://doi.org/10.1145/3451390

    Article  Google Scholar 

  • Niwattanakul S, Singthongchai J, Naenudorn E, et al., 2013. Using of jaccard coefficient for keywords similarity. Proc Int MultiConf of Engineers and Computer Scientists, p.380–384.

  • Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38th Int Conf on Machine Learning, p.8748–8763.

  • Reed SE, Lee H, Anguelov D, et al., 2015. Training deep neural networks on noisy labels with bootstrapping. Proc 3rd Int Conf on Learning Representations.

  • Ren SQ, He KM, Girshick R, et al., 2017. Faster RCNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  • Ruder S, 2017. An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098

  • Song H, Kim M, Lee JG, 2019. SELFIE: refurbishing unclean samples for robust deep learning. Proc 36th Int Conf on Machine Learning, p.5907–5915.

  • Song H, Kim M, Park D, et al., 2020. Learning from noisy labels with deep neural networks: a survey. https://arxiv.org/abs/2007.08199

  • van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9(86):2579–2605.

    MATH  Google Scholar 

  • Wang KY, Yin QY, Wang W, et al., 2016. A comprehensive survey on cross-modal retrieval. https://arxiv.org/abs/1607.06215

  • Wang RX, Liu TL, Tao DC, 2018. Multiclass learning with partially corrupted labels. IEEE Trans Neur Netw Learn Syst, 29(6):2568–2580. https://doi.org/10.1109/TNNLS.2017.2699783

    Article  MathSciNet  Google Scholar 

  • Yang J, Duan J, Tran S, et al., 2022. Vision-language pre-training with triple contrastive learning. IEEE/CVF

  • Conf on Computer Vision and Pattern Recognition, p.15650–15659. https://doi.org/10.1109/CVPR52688.2022.01522

  • Zhang HY, Xing XM, Liu L, 2021. DualGraph: a graph-based method for reasoning about label noise. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9654–9663. https://doi.org/10.1109/CVPR46437.2021.00953

Download references

Author information

Authors and Affiliations

Authors

Contributions

Jinyi GUO and Jieyu DING designed the research. Jinyi GUO processed the data and drafted the paper. Jieyu DING helped organize the paper. Jinyi GUO and Jieyu DING revised and finalized the paper.

Corresponding author

Correspondence to Jieyu Ding  (丁洁玉).

Ethics declarations

Jinyi GUO and Jieyu DING declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. 12172186)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, J., Ding, J. Robust cross-modal retrieval with alignment refurbishment. Front Inform Technol Electron Eng 24, 1403–1415 (2023). https://doi.org/10.1631/FITEE.2200514

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2200514

Key words

关键词

CLC number

Navigation