当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2024-03-30 , DOI: 10.1145/3655620
Raheem Sarwar 1 , Maneesha Perera 2 , Pin Shen Teh 1 , Raheel Nawaz 3 , Muhammad Umair Hassan 4
Affiliation  

Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. Author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages like Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of the small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author.



中文翻译:

跨越语言障碍:僧伽罗语文本中的作者归属

作者归属涉及从潜在作者库中确定匿名文本的原始作者。作者归因任务在多个领域都有应用,例如抄袭检测、数字文本取证和信息检索。虽然这些应用超出了任何单一语言,但现有的研究主要集中在英语上,由于语言差异和缺乏语言处理工具,对僧伽罗语等语言的应用提出了挑战。我们提出了第一个关于僧伽罗文本跨主题作者归属的综合研究,并提出了一种解决方案,即使测试样本和训练样本中的主题不同,也可以有效地执行作者归属任务。我们的解决方案由三个主要部分组成:(i)提取与主题无关的文体特征,(ii)借助相似性搜索生成小型候选作者集,以及(iii)识别真实作者。进行了多项实验研究,证明所提出的解决方案可以有效处理涉及大量候选作者和每个候选作者有限数量的文本样本的现实场景。

更新日期:2024-03-30
down
wechat
bug