Consensus-Based Machine Translation for Code-Mixed Texts,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Consensus-Based Machine Translation for Code-Mixed Texts
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2024-03-09 , DOI: 10.1145/3628427
Sainik Kumar Mahata ₁ , Dipankar Das ₂ , Sivaji Bandyopadhyay ₂

Affiliation

Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.

中文翻译：

基于共识的代码混合文本机器翻译

由于其悠久的外国人交往历史，印度的多种语言现象非常普遍。这导致观众熟悉使用多种语言进行交谈。此外，由于社交媒体的蓬勃发展，多种语言的使用已变得广泛。因此，迫切需要一个能够为新手和单语用户提供服务的翻译系统。这种翻译系统可以通过统计机器翻译和神经机器翻译等方法来开发，每种方法都有其优点和缺点。此外，构建翻译系统所需的并行语料库以及代码混合数据并不容易获得。在目前的工作中，我们提出了两种翻译框架，它们可以通过构建一个集成模型来利用这些现有方法的各自优势，该模型对前述方法的最终输出达成共识并生成目标输出。开发的模型用于将英语-孟加拉语代码混合数据（用罗马文字编写）翻译成等效的单语孟加拉语实例。还开发了混合代码到单语言的平行语料库来训练前面的系统。实证结果显示，所开发的框架的 BLEU 和 TER 分数分别提高了 17.23 和 53.18、19.12 和 51.29。

更新日期：2024-03-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>