当前位置: X-MOL 学术Aut. Control Comp. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm
Automatic Control and Computer Sciences Pub Date : 2024-02-27 , DOI: 10.3103/s0146411623070076
K. V. Lagutina

Abstract

This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.



中文翻译:

基于现代嵌入和节奏的俄语文本类型分类

摘要

本文研究了现代矢量文本模型,以解决俄语文本的流派分类问题。这些模型包括 ELMo 嵌入、预训练的 BERT 语言模型和一组基于词汇语法工具的数字节奏特征。这些实验是在五种类型的 10,000 篇文本的语料库上进行的:小说、科学文章、评论、VKontakte 社交网络的帖子以及 OpenCorpora 的新闻。节奏特征统计数据的可视化和分析使得区分节奏最多样化的流派(小说和评论)和最不多样化的流派(科学文章)成为可能。随后使用节奏和 LSTM 神经网络分类器对这些流派进行了最佳分类。使用 ELMo 和 BERT 嵌入按流派对文本进行聚类和分类,可以以少量错误将一种流派与另一种流派区分开。多分类F-measure达到99%。这项研究证实了现代嵌入在计算语言学任务中的有效性,并强调了流派分类材料上设置节奏特征的优点和局限性。

更新日期:2024-02-28
down
wechat
bug