当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering
arXiv - CS - Information Retrieval Pub Date : 2024-03-26 , DOI: arxiv-2403.17848
Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt

In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.



在本文中,我们通过引入ArabicaQA(第一个用于阿拉伯语机器阅读理解和开放域问答的大型数据集)来解决阿拉伯语自然语言处理(NLP)资源的巨大差距。这个综合数据集由众包工作者创建的 89,095 个可回答问题和 3,701 个不可回答问题组成,看起来与可回答问题类似,以及开放域问题的附加标签,标志着阿拉伯语 NLP 资源的重大进步。我们还推出了 AraDPR,这是第一个在阿拉伯语维基百科语料库上训练的密集段落检索模型,专门用于解决阿拉伯语文本检索的独特挑战。此外,我们的研究还包括对阿拉伯语问答的大型语言模型 (LLM) 进行广泛的基准测试,批判性地评估它们在阿拉伯语环境中的表现。总之,ArabicaQA、AraDPR 以及阿拉伯语问答领域法学硕士的基准测试为阿拉伯语 NLP 领域带来了重大进步。数据集和代码可公开访问以供进一步研究 https://github.com/DataScienceUIBK/ArabicaQA。