Unsupervised multimodal learning for image-text relation classification in tweets

Sun, Lin; Li, Qingyuan; Liu, Long; Su, Yindu

doi:10.1007/s10044-023-01204-5

Unsupervised multimodal learning for image-text relation classification in tweets

Theoretical Advances
Published: 10 October 2023

Volume 26, pages 1793–1804, (2023)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Lin Sun ORCID: orcid.org/0000-0003-2923-3281¹,
Qingyuan Li^1,2,
Long Liu³ &
…
Yindu Su^1,2

412 Accesses
Explore all metrics

Abstract

Recent studies show that the use of multimodality can effectively enhance the understanding of social media content. The relations between texts and images become an important basis for developing multimodal data and models. Some studies have attempted to label image-text relation (ITR) and build supervised learning models. However, manually labeling ITR is a challenging task and incurs many controversial labels because of disagreements among the annotators. In this paper, we present a novel unsupervised multimodal method called ITR pseudo-labeling (ITRp) that learns multimodal representations for various ITR types using different finetuning strategies. Our ITRp method generates pseudo-labels by clustering and uses them as supervision to train the classifier and encoders. We evaluate the ITRp method on the ITR dataset and the effects of the samples with incorrect labels on both the supervised and unsupervised models. The code and data are available on the website https://github.com/SuYindu/ITRp.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Classifier for Disaster Response

Text-Image Sentiment Analysis

Cross-Active Connection for Image-Text Multimodal Feature Fusion

Data availability

The datasets generated during and/or analyzed during the current study are available in the GitHub repository, https://github.com/SuYindu/ITRp.

Notes

https://github.com/huyt16/Twitter100k.
https://github.com/danielpreotiuc/text-image-relationship.
https://github.com/SuYindu/ITRp.
Lil Wayne, an American rapper.

References

Otto C, Springstein M, Anand A (2020) Ewerth R Characterization and classification of semantic image-text relations. Int J Multimed Inf Retrieval 9:31–45
Article Google Scholar
Sun L, Wang J, Zhang K, Su Y, Weng F (2021) Rpbert: A text-image relation propagation-based BERT model for multimodal NER. In: AAAI, pp 13860–13868
Ju X, Zhang D, Xiao R, Li J, Li S, Zhang M, Zhou G (2021) Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection. In: EMNLP, pp 4395–4405
Sosea T, Sirbu I, Caragea C, Caragea D, Rebedea T (2021) Using the image-text relationship to improve multimodal disaster tweet classification. In: ISCRAM 2021 conference proceedings—18th international conference on information systems for crisis response and management, pp 691–704
Vempala A, Preotiuc-Pietro D (2019) Categorizing and inferring the relationship between the text and image of twitter posts. In: Annual meeting of the association for computational linguistics
Martinec R, Salway A (2005) A system for image-text relations in new (and old) media. Vis Commun 4(3):337–371
Article Google Scholar
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 159–174
Carletta J, Isard A, Isard S, Kowtko JC, Doherty-Sneddon G, Anderson AH (1997) The reliability of a dialogue structure coding scheme. COLING 23(1):13–31
Google Scholar
Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. COLING 34(4):555–596
Google Scholar
Marsh EE, White MD (2003) A taxonomy of relationships between images and text. J Document 59(6):647–672
Article Google Scholar
Wang Z, Cui P, Xie L, Zhu W, Rui Y, Yang S (2014) Bilateral correspondence model for words-and-pictures association in multimedia-rich microblogs. ACM Trans Multim Comput Commun Appl 10(4):34–13421
Article Google Scholar
Chen T, Lu D, Kan MY, Cui P (2013) Understanding and classifying image tweets
Chen T, SalahEldeen H, He X, Kan MY, Lu D (2015) Velda: relating an image tweet’s text and images. In: AAAI conference on artificial intelligence
Zhang M, Hwa R, Kovashka A (2018) Equal but not the same: understanding the implicit relationship between persuasive images and text. In: British machine vision conference
Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. Int J Multimed Inf Retrieval 7:43–56
Article Google Scholar
Kruk J, Lubin J, Sikka K, Lin X, Jurafsky D, Divakaran A (2019) Integrating text and image: Determining multimodal document intent in instagram posts. In: Conference on empirical methods in natural language processing
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: European conference on computer vision
Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. In: Advances in neural information processing systems, vol 33, pp 9758–9770
Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Neural information processing systems
Li Z, Tang J (2016) Weakly supervised deep matrix factorization for social image understanding. IEEE Trans Image Process 26(1):276–288
Article MathSciNet MATH Google Scholar
Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098
Article Google Scholar
Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083
Article Google Scholar
Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vision 128:2265–2278
Article MathSciNet MATH Google Scholar
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cyber B 39(2):539–550
Google Scholar
Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. JMLR 18(17):1–5
Google Scholar
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9726–9735
Xie J, Girshick RB, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Balcan M, Weinberger KQ (eds) ICML, pp 478–487
Hu Y, Zheng L, Yang Y, Huang Y (2018) Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE TMM 20(4):927–938
Google Scholar
Radford A, Kim J.W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, pp 8748–8763
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9
Hessel J, Lee L (2020) Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In: EMNLP, pp 861–877
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp 6105–6114
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692
Tan H, Bansal M (2019) LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5100–5111
Fu J, Xu S, Liu H, Liu Y, Xie N, Wang CC, Liu J, Sun Y, Wang B (2022) Cma-clip: Cross-modality attention clip for text-image classification. In: 2022 IEEE international conference on image processing (ICIP), pp 2846–2850
Kingma D.P, Ba J (2015) Adam: A method for stochastic optimization. In: ICLR
MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297
Bishop CM (2007) Pattern recognition and machine learning, 5th Edition. In: Information science and statistics
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416
Article MathSciNet Google Scholar
Schwartz H.A, Giorgi S, Sap M, Crutchley P, Eichstaedt J, Ungar L (2017) Dlatk: differential language analysis toolkit. In: EMNLP, pp 55–60

Download references

Author information

Authors and Affiliations

Department of Computer Science, Hangzhou City University, 51 Huzhou Street, Hangzhou, 310015, Zhejiang, China
Lin Sun, Qingyuan Li & Yindu Su
College of Computer Science and Technology, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, Zhejiang, China
Qingyuan Li & Yindu Su
Zhejiang Development and Planning Institute, 598 Gudun Road, Hangzhou, 310012, Zhejiang, China
Long Liu

Authors

Lin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qingyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Long Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yindu Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Sun.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, L., Li, Q., Liu, L. et al. Unsupervised multimodal learning for image-text relation classification in tweets. Pattern Anal Applic 26, 1793–1804 (2023). https://doi.org/10.1007/s10044-023-01204-5

Download citation

Received: 15 May 2023
Accepted: 06 September 2023
Published: 10 October 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10044-023-01204-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised multimodal learning for image-text relation classification in tweets

Abstract

Access this article

Similar content being viewed by others

Multimodal Classifier for Disaster Response

Text-Image Sentiment Analysis

Cross-Active Connection for Image-Text Multimodal Feature Fusion

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised multimodal learning for image-text relation classification in tweets

Abstract

Access this article

Similar content being viewed by others

Multimodal Classifier for Disaster Response

Text-Image Sentiment Analysis

Cross-Active Connection for Image-Text Multimodal Feature Fusion

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation