当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens
arXiv - CS - Artificial Intelligence Pub Date : 2024-03-26 , DOI: arxiv-2403.17407
S M Jishanul Islam, Sadia Ahmmed, Sahid Hossain Mustakim

Accurate transcription of Bengali text to the International Phonetic Alphabet (IPA) is a challenging task due to the complex phonology of the language and context-dependent sound changes. This challenge is even more for regional Bengali dialects due to unavailability of standardized spelling conventions for these dialects, presence of local and foreign words popular in those regions and phonological diversity across different regions. This paper presents an approach to this sequence-to-sequence problem by introducing the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh. The key idea is to provide the model with explicit information about the regional dialect or "district" of the input text before generating the IPA transcription. This is achieved by prepending a district token to the input sequence, effectively guiding the model to understand the unique phonetic patterns associated with each district. The DGT technique is applied to fine-tune several transformer-based models, on this new dataset. Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models like mT5, BanglaT5, and umT5. This is attributed to ByT5's ability to handle a high percentage of out-of-vocabulary words in the test set. The proposed approach highlights the importance of incorporating regional dialect information into ubiquitous natural language processing systems for languages with diverse phonological variations. The following work was a result of the "Bhashamul" challenge, which is dedicated to solving the problem of Bengali text with regional dialects to IPA transcription https://www.kaggle.com/competitions/regipa/. The training and inference notebooks are available through the competition link.

中文翻译:

使用地区引导令牌将具有地方方言的孟加拉语文本转录为 IPA

由于语言的复杂音系和依赖于上下文的声音变化,将孟加拉语文本准确转录为国际音标 (IPA) 是一项具有挑战性的任务。对于孟加拉方言来说,这一挑战更加严峻,因为这些方言没有​​标准化的拼写约定,这些地区存在流行的本地和外来单词,以及不同地区的语音多样性。本文通过在跨越孟加拉国六个地区的新数据集上引入地区引导令牌 (DGT) 技术,提出了解决序列到序列问题的方法。关键思想是在生成 IPA 转录之前为模型提供有关输入文本的地区方言或“地区”的明确信息。这是通过在输入序列中添加地区标记来实现的,有效地引导模型理解与每个地区相关的独特语音模式。 DGT 技术应用于在这个新数据集上微调多个基于 Transformer 的模型。实验结果证明了 DGT 的有效性,ByT5 模型比 mT5、BanglaT5 和 umT5 等基于单词的模型实现了卓越的性能。这归因于 ByT5 能够处理测试集中高比例的词汇外单词。所提出的方法强调了将区域方言信息纳入普遍存在的自然语言处理系统中的重要性,以处理具有不同语音变化的语言。以下工作是“Bhashamul”挑战赛的结果,该挑战赛致力于解决带有地方方言的孟加拉文本到 IPA 转录的问题 https://www.kaggle.com/competitions/regipa/。训练和推理笔记本可通过竞赛链接获取。
更新日期:2024-03-28
down
wechat
bug