Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2024-04-15 , DOI: 10.1145/3648362
Anil Ahmed ₁ , Degen Huang ₁ , Syed Yasser Arafat ₂ , Imran Hameed ₁

Affiliation

Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages such as English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages such as Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, a context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERT_BASE and ELMo using Urdu Wikipedia and news articles. Second, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short- and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-score of 93.99%, highlighting its efficacy for the U-NER task.

中文翻译：

使用 BERT 嵌入、数据增强和混合编码器-CNN 架构丰富乌尔都语 NER

命名实体识别（NER）是自然语言处理（NLP）不可或缺的组成部分，旨在识别和分类文本数据中的实体。虽然深度学习 (DL) 模型在英语、西班牙语和中文等资源丰富的语言的 NER 方面表现出色，但在处理乌尔都语等资源匮乏的语言时，它们面临着巨大的障碍。这些挑战源于乌尔都语复杂的语言特征，包括形态多样性、依赖于上下文的词汇以及训练数据的稀缺。本研究通过关注乌尔都语命名实体识别 (U-NER) 并介绍三个关键贡献来解决这些问题。首先，采用各种预训练的嵌入方法，包括 Word2vec (W2V)、GloVe、FastText、Transformers 双向编码器表示 (BERT) 和语言模型嵌入 (ELMo)。特别是，使用乌尔都语维基百科和新闻文章对 BERT _BASE和 ELMo进行微调。其次，一种新颖的生成数据增强（DA）技术用掩码标记替换命名实体（NE），采用预先训练的掩码语言模型来预测掩码标记，有效地扩展了训练数据集。最后，该研究引入了一种新颖的混合模型，将 Transformer 编码器与卷积神经网络 (CNN) 相结合，以捕获乌尔都语的复杂形态。这些模块使模型能够处理一词多义，提取短期和长期依赖关系，并增强学习能力。实证实验表明，所提出的模型结合了 BERT 嵌入和创新的 DA 方法，获得了 93.99% 的最高 F1 分数，突显了其在 U-NER 任务中的有效性。

更新日期：2024-04-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>