Skip to main content
Log in

Visual Topic Semantic Enhanced Machine Translation for Multi-Modal Data Efficiency

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology. One of the research directions is employing relations among multi-modal data to enhance performance. However, the reliance on manually annotated multi-modal datasets results in a high cost of data labeling. In this paper, the topic semantics of images is proposed to alleviate the above problem. First, topic-related images can be automatically collected from the Internet by search engines. Second, topic semantics is sufficient to encode the relations between multi-modal data such as texts and images. Specifically, we propose a visual topic semantic enhanced translation (VTSE) model that utilizes topic-related images to construct a cross-lingual and cross-modal semantic space, allowing the VTSE model to simultaneously integrate the syntactic structure and semantic features. In the above process, topic similar texts and images are wrapped into groups so that the model can extract more robust topic semantics from a set of similar images and then further optimize the feature integration. The results show that our model outperforms competitive baselines by a large margin on the Multi30k and the Ambiguous COCO datasets. Our model can use external images to bring gains to translation, improving data efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Specia L, Frank S, Sima’an K, Elliott D. A shared task on multimodal machine translation and crosslingual image description. In Proc. the 1st Conference on Machine Translation: Volume 2, Shared Task Papers, Aug. 2016, pp.543–553. https://doi.org/10.18653/v1/W16-2346.

  2. Caglayan O, Madhyastha P, Specia L, Barrault L. Probing the need for visual context in multimodal machine translation. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4159–4170. https://doi.org/10.18653/v1/N19-1422.

  3. Huang P Y, Liu F, Shiang S R, Oh J, Dyer C. Attentionbased multimodal neural machine translation. In Proc. the 1st Conference on Machine Translation: Volume 2, Shared Task Papers, Aug. 2016, pp.639–645. https://doi.org/10.18653/v1/w16-2360.

  4. Caglayan O, Barrault L, Bougares F. Multimodal attention for neural machine translation. arXiv: 1609.03976, 2016. https://arxiv.org/abs/1609.03976, Nov. 2023.

  5. Elliott D, Frank S, Sima’an K, Specia L. Multi30K: Multilingual English-German image descriptions. In Proc. the 5th Workshop on Vision and Language, Aug. 2016, pp.70–74. https://doi.org/10.18653/v1/w16-3210.

  6. Kádár A, Chrupała G, Alishahi A. Representation of linguistic form and function in recurrent neural networks. Computational Linguistics, 2017, 43(4): 761–780. https://doi.org/10.1162/coli_a_00300.

    Article  MathSciNet  Google Scholar 

  7. Elliott D, Kádár A. Imagination improves multimodal translation. In Proc. the 8th International Joint Conference on Natural Language Processing, Nov. 2017, pp.130–141. https://doi.org/10.48550/1705.04350.

  8. Zhou M Y, Cheng R X, Lee Y J, Yu Z. A visual attention grounding neural model for multimodal machine translation. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, Oct. 31–Nov. 4, 2018, pp.3643–3653. https://doi.org/10.18653/v1/d18-1400.

  9. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. https://doi.org/10.48550/arXiv.1706.03762.

  10. Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting of the Association for Computational Linguistics, Jul. 2002, pp.311–318. https://doi.org/10.3115/1073083.1073135.

  11. Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Jun. 2005, pp.65–72. https://doi.org/10.3115/1626355.1626389.

  12. Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. https://doi.org/10.1007/978-3-319-10602-1_48.

  13. Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.91–99. https://doi.org/10.1109/tpami.2016.2577031.

  14. Calixto I, Liu Q, Campbell N. Doubly-attentive decoder for multi-modal neural machine translation. In Proc. the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2017, pp.1913–1924. https://doi.org/10.18653/v1/p17-1175.

  15. Arslan H S, Fishel M, Anbarjafari G. Doubly attentive transformer machine translation. arXiv: 1807.11605, 2018. https://arxiv.org/abs/1807.11605, Nov. 2023.

  16. Yao S W, Wan X J. Multimodal transformer for multimodal machine translation. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.4346–4350. https://doi.org/10.18653/v1/2020.acl-main.400.

  17. Delbrouck J B, Dupont S. An empirical study on the effectiveness of images in multimodal neural machine translation. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.910–919. https://doi.org/10.18653/v1/d17-1095.

  18. Yang P C, Chen B X, Zhang P, Sun X. Visual agreement regularized training for multi-modal machine translation. In Proc. the 37th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.9418–9425. https://doi.org/10.1609/aaai.v34i05.6484.

  19. Cheng Y. Agreement-based joint training for bidirectional attention-based neural machine translation. In Joint Training for Neural Machine Translation, Cheng Y (ed.), Springer, 2019, pp.11–23. https://doi.org/10.1007/978-981-32-9748-7_2.

  20. Ive J, Madhyastha P S, Specia L. Distilling translations with visual awareness. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.6525–6538. https://doi.org/10.18653/v1/p19-1653.

  21. Xia Y C, Tian F, Wu L J, Li J X, Qin T, Yu N H, Liu T Y. Deliberation networks: Sequence generation beyond one-pass decoding. In Proc. the 31st International Conference on Neural Information Processing Systems, Feb. 2017, pp.1782–1792. https://doi.org/10.5555/3294771.3294941.

  22. Lin H, Meng F D, Su J S, Yin Y J, Yang Z Y, Ge Y B. Dynamic context-guided capsule network for multimodal machine translation. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.1320–1329. https://doi.org/10.1145/3394171.3413715.

  23. Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.3859–3869.

  24. Yin Y J, Meng F D, Su J S, Zhou C L, Yang Z Y, Zhou J, Luo J B. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.3025–3035. https://doi.org/10.18653/v1/2020.acl-main.273.

  25. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.

  26. He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. https://doi.org/10.1109/cvpr.2016.90.

  27. Calixto I, Liu Q. Incorporating global visual features into attention-based neural machine translation. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.992–1003. https://doi.org/10.18653/v1/d17-1105.

  28. Long Q Y, Wang M X, Li L. Generative imagination elevates machine translation. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.5738–5748. https://doi.org/10.18653/v1/2021.naacl-main.457.

  29. Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A C, Bengio Y. Generative adversarial networks. In Proc. the 27th International Conference on Neural Information Processing Systems, Dec. 2014, pp.2672–2680. https://doi.org/10.1007/978-981-33-6048-8_1.

  30. Zhang Z S, Chen K H, Wang R, Utiyama M, Sumita E, Li Z C, Zhao H. Neural machine translation with universal visual representation. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.

  31. Manning C D, Surdeanu M, Bauer J, Finkel J R, Bethard S, Mcclosky D. The Stanford CoreNLP natural language processing toolkit. In Proc. the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Jun. 2014, pp.55–56. https://doi.org/10.3115/v1/p14-5010.

  32. Zhang B, Zhu J, Su H. Toward the third generation of artificial intelligence. Scientia Sinica Informationis, 2020, 50(8): 1281–1302. https://doi.org/10.1360/ssi-2020-0204.

    Article  Google Scholar 

  33. Arora S, Liang Y, Ma T Y. A simple but tough-to-beat baseline for sentence embeddings. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.

  34. Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532–1543. https://doi.org/10.3115/v1/d14-1162.

  35. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. the 3rd International Conference on Learning Representations, May 2015.

  36. Cho K, Merriënboer B V, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1724–1734. https://doi.org/10.3115/v1/d14-1179.

  37. Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Miceli-Barone A V, Mokry J, Nǎdejde M. Nematus: A toolkit for neural machine translation. In Proc. the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2017, pp.65–68. https://doi.org/10.18653/v1/e17-3017.

  38. Lu J S, Yang J W, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.289–297. https://doi.org/10.48550/1606.00061.

  39. Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv: 1411.2539, 2014. https://arxiv.org/abs/1411.2539, Nov. 2023.

  40. Koehn P, Hoang H, Birch A, Callison-Burch C. Moses: Open source toolkit for statistical machine translation. In Proc. the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Jun. 2007, pp.177–180. https://doi.org/10.3115/1557769.1557821.

  41. Shibata Y, Kida T, Fukamachi S, Takeda M, Shinohara A, Shinohara T. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University, 1999.

  42. Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

  43. Wang D, Xiong D. Efficient object-level visual context modeling for multimodal machine translation: Masking irrelevant objects helps grounding. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.2720–2728. https://doi.org/10.48550/arXiv.2101.05208.

  44. Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4171–4186. https://doi.org/10.18653/v1/N19-1423.

  45. Conneau A, Lample G. Cross-lingual language model pretraining. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, pp.7059–7069.

  46. Ni M H, Huang H Y, Su L, Cui E, Bharti T, Wang L J, Zhang D D, Duan N. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.3976–3985. https://doi.org/10.1109/cvpr46437.2021.00397.

  47. Zhou M Y, Zhou L W, Wang S H, Cheng Y, Li L J, Yu Z, Liu J J. UC2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proc. the 2021 IEEE/ CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.4153–4163. https://doi.org/10.1109/cvpr46437.2021.00414.

  48. Zhu J H, Xia Y C, Wu L J, He D, Qin T, Zhou W G, Li H Q, Liu T Y. Incorporating BERT into neural machine translation. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.

  49. Caglayan O, Kuyu M, Amac M S, Madhyastha P, Erdem E, Erdem A, Specia L. Cross-lingual visual pre-training for multimodal machine translation. In Proc. the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Apr. 2021, pp.1317–1324. https://doi.org/10.18653/v1/2021.eacl-main.112.

  50. Van Der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9(86): 2579–2605. https://doi.org/10.48550/2108.01301.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Hong Chong.

Supplementary Information

ESM 1

(PDF 151 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, C., Cai, SJ., Shi, BX. et al. Visual Topic Semantic Enhanced Machine Translation for Multi-Modal Data Efficiency. J. Comput. Sci. Technol. 38, 1223–1236 (2023). https://doi.org/10.1007/s11390-023-1302-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-023-1302-6

Keywords

Navigation