Automatic genre identification: a survey,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automatic genre identification: a survey
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2023-11-16 , DOI: 10.1007/s10579-023-09695-8
Taja Kuzman , Nikola Ljubešić

Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.

中文翻译：

自动流派识别：一项调查

自动体裁识别（AGI）是一种专注于体裁的文本分类任务，即根据作者的目的、文本的共同功能和文本的常规形式定义的文本类别。获取体裁信息已被证明对广泛的学科有益，包括语言学、语料库语言学、计算语言学、自然语言处理、信息检索和信息安全。因此，在过去的 20 年里，众多研究人员收集了流派数据集，旨在开发高效的流派分类器。然而，他们定义流派图式、数据收集和手动注释的方法差异很大，导致数据集显着不同。由于大多数 AGI 实验都依赖于数据集，因此充分了解可用类型数据集之间的差异对于进入该领域的研究人员非常重要。在本文中，我们详细概述了 AGI 任务每个步骤的不同方法，从流派概念和流派模式的定义，到数据集收集和注释方法，最后到机器学习策略。特别关注最相关的流派图式和数据集的描述，并提供所有数据集可用性的详细信息。此外，本文还介绍了自动流派识别的机器学习方法的最新进展，并最后提出了开发稳定的多语言流派分类器的方向。

更新日期：2023-11-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>