当前位置: X-MOL 学术Journal of Jewish Languages › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TAJA Corpus: Linguistically Tagged Written Algerian Judeo-Arabic Corpus
Journal of Jewish Languages Pub Date : 2022-07-07 , DOI: 10.1163/22134638-bja10020
Ofra Tirosh-Becker 1 , Oren M. Becker 2
Affiliation  

The Tagged Algerian Judeo-Arabic (TAJA) corpus is the first linguistically annotated corpus of any Judeo-Arabic dialect regardless of geography and period. The corpus is a genre-diverse collection of written Modern Algerian Judeo-Arabic texts, encompassing translations of the Bible and of liturgical texts, commentaries and original Judeo-Arabic books and journals. The TAJA corpus was manually annotated with parts-of-speech (POS) tags and detailed morphology tags. The goal of the new corpus is twofold. First, it preserves this endangered Judeo-Arabic language, expanding on previous fieldwork and going beyond the study of individual written texts. The corpus has already enabled us to make strides towards a grammar of written Algerian Judeo-Arabic. Second, this tagged corpus serves as a foundation for the development of Judeo-Arabic-specific Natural Language Processing (NLP) tools, which allow automatic POS tagging and morphological annotation of large collections of yet untapped texts in Algerian Judeo-Arabic and other Judeo-Arabic varieties.

中文翻译:

TAJA 语料库:语言标记书面阿尔及利亚犹太阿拉伯语语料库

标记的阿尔及利亚犹太-阿拉伯语 (泰雅) 语料库是第一个有语言注释的任何犹太-阿拉伯方言语料库,不分地域和时期。该语料库是现代阿尔及利亚犹太-阿拉伯语书面文本的多种类型的集合,包括圣经和礼仪文本、评论以及原始的犹太-阿拉伯语书籍和期刊的翻译。这泰雅语料库是用词性手动注释的(收银机) 标签和详细的形态学标签。新语料库的目标是双重的。首先,它保留了这种濒临灭绝的犹太-阿拉伯语,扩展了以前的实地考察,超越了对个别书面文本的研究。语料库已经使我们能够朝着书面阿尔及利亚犹太-阿拉伯语的语法迈进。其次,这个带标签的语料库是开发特定于犹太-阿拉伯语的自然语言处理的基础(自然语言处理) 工具,它允许自动收银机阿尔及利亚犹太-阿拉伯语和其他犹太-阿拉伯语变体中尚未开发的大量文本的标记和形态注释。
更新日期:2022-07-07
down
wechat
bug