THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2024-03-18 , DOI: 10.1145/3653017
Deepawali Sharma ₁ , Aakash Singh ₁ , Vivek Kumar Singh ₂

Affiliation

During the last decade, social media has gained significant popularity as a medium for individuals to express their views on various topics. However, some individuals also exploit the social media platforms to spread hatred through their comments and posts, some of which target individuals, communities or religions. Given the deep emotional connections people have to their religious beliefs, this form of hate speech can be divisive and harmful, and may result in issues of mental health as social disorder. Therefore, there is a need of algorithmic approaches for the automatic detection of instances of hate speech. Most of the existing studies in this area focus on social media content in English, and as a result several low-resource languages lack computational resources for the task. This study attempts to address this research gap by providing a high-quality annotated dataset designed specifically for identifying hate speech against religions in the Hindi-English code-mixed language. This dataset “Targeted Hate Speech Against Religion” (THAR)) consists of 11,549 comments and has been annotated by five independent annotators. It comprises two subtasks: (i) Subtask-1 (Binary classification), (ii) Subtask-2 (multi-class classification). To ensure the quality of annotation, the Fleiss Kappa measure has been employed. The suitability of the dataset is then further explored by applying different standard deep learning, and transformer-based models. The transformer-based model, namely Multilingual Representations for Indian Languages (MuRIL), is found to outperform the other implemented models in both subtasks, achieving macro average and weighted average F1 scores of 0.78 and 0.78 for Subtask-1, and 0.65 and 0.72 for Subtask-2, respectively. The experimental results obtained not only confirm the suitability of the dataset but also advance the research towards automatic detection of hate speech, particularly in the low-resource Hindi-English code-mixed language.

中文翻译：

THAR-针对宗教的针对性仇恨言论：高质量印地语-英语代码混合数据集，应用深度学习模型进行自动检测

在过去的十年中，社交媒体作为个人表达对各种主题的观点的媒介而受到广泛欢迎。然而，一些人还利用社交媒体平台通过评论和帖子传播仇恨，其中一些针对个人、社区或宗教。鉴于人们与其宗教信仰有着深厚的情感联系，这种形式的仇恨言论可能会造成分裂和有害，并可能导致心理健康问题和社会混乱。因此，需要一种自动检测仇恨言论实例的算法方法。该领域的大多数现有研究都集中在英语的社交媒体内容上，因此几种低资源语言缺乏用于该任务的计算资源。本研究试图通过提供专门用于识别印地语-英语代码混合语言中针对宗教的仇恨言论的高质量带注释数据集来弥补这一研究空白。该数据集“针对宗教的针对性仇恨言论”（THAR））包含 11,549 条评论，并由五位独立注释者进行了注释。它包含两个子任务：(i) Subtask-1（二元分类），(ii) Subtask-2（多类分类）。为了确保注释的质量，采用了 Fleiss Kappa 度量。然后通过应用不同的标准深度学习和基于 Transformer 的模型进一步探索数据集的适用性。基于 Transformer 的模型，即印度语言的多语言表示 (MuRIL)，被发现在两个子任务中都优于其他实现的模型，子任务 1 的宏观平均和加权平均 F1 分数分别为 0.78 和 0.78，子任务 1 的宏观平均和加权平均 F1 分数为 0.78 和 0.78，子任务 1 的宏观平均和加权平均 F1 分数为 0.78 和 0.78，子任务 1 的宏观平均和加权平均 F1 分数为 0.78 和 0.78，子任务 1 的宏观平均和加权平均 F1 分数为 0.78 和 0.78，子任务 1 的宏观平均和加权平均 F1 分数为 0.78 和 0.78，子任务 1 的宏观平均和加权平均 F1 分数为 0.78 和 0.78，子任务 1 为 0.65 和 0.72。分别是子任务2。获得的实验结果不仅证实了数据集的适用性，而且推进了仇恨言论自动检测的研究，特别是在资源匮乏的印地语-英语代码混合语言中。

更新日期：2024-03-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>