当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large scale annotated dataset for code-mix abusive short noisy text
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2024-01-25 , DOI: 10.1007/s10579-023-09707-7
Paras Tiwari , Sawan Rai , C. Ravindranath Chowdary

Abstract

With globalization and cultural exchange around the globe, most of the population gained knowledge of at least two languages. The bilingual user base on the Social Media Platform (SMP) has significantly contributed to the popularity of code-mixing. However, apart from multiple vital uses, SMP also suffer with abusive text content. Identifying abusive instances for a single language is a challenging task, and even more challenging for code-mix. The abusive posts detection problem is more complicated than it seems due to its unseemly, noisy data and uncertain context. To analyze these contents, the research community needs an appropriate dataset. A small dataset is not a suitable sample for the research work. In this paper, we have analyzed the dimensions of Devanagari-Roman code-mix in short noisy text. We have also discussed the challenges of abusive instances. We have proposed a cost-effective methodology with 20.38% relevancy score to collect and annotate the code-mix abusive text instances. Our dataset is eight times to the related state-of-the-art dataset. Our dataset ensures the balance with 55.81% instances in the abusive class and 44.19% in the non-abusive class. We have also conducted experiments to verify the usefulness of the dataset. We have performed experiments with traditional machine learning techniques, traditional neural network architecture, recurrent neural network architectures, and pre-trained Large Language Model (LLM). From our experiments, we have observed the suitability of the dataset for further scientific work.



中文翻译:

用于代码混合滥用短噪声文本的大规模注释数据集

摘要

随着全球化和全球文化交流,大多数人获得了至少两种语言的知识。社交媒体平台 (SMP) 上的双语用户群极大地促进了代码混合的流行。然而,除了多种重要用途之外,SMP 还受到滥用文本内容的困扰。识别单一语言的滥用实例是一项具有挑战性的任务,对于代码混合来说更具挑战性。由于其不体面、嘈杂的数据和不确定的背景,辱骂性帖子检测问题比看起来更复杂。为了分析这些内容,研究界需要合适的数据集。小数据集不适合研究工作。在本文中,我们分析了短噪声文本中梵文-罗马码混合的维度。我们还讨论了虐待事件的挑战。我们提出了一种具有成本效益的方法,相关性得分为 20.38%,用于收集和注释代码混合滥用文本实例。我们的数据集是相关最先进数据集的八倍。我们的数据集确保了滥用类中 55.81% 的实例和非滥用类中 44.19% 的实例的平衡。我们还进行了实验来验证数据集的有用性。我们使用传统机器学习技术、传统神经网络架构、循环神经网络架构和预训练的大型语言模型(LLM)进行了实验。从我们的实验中,我们观察到该数据集是否适合进一步的科学研究。

更新日期:2024-01-25
down
wechat
bug