Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

Lagutina, K. V.

doi:10.3103/S0146411623070076

Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

Published: 27 February 2024

Volume 57, pages 817–827, (2023)
Cite this article

Automatic Control and Computer Sciences Aims and scope Submit manuscript

K. V. Lagutina ORCID: orcid.org/0000-0002-1742-3240¹

25 Accesses
Explore all metrics

Abstract

This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

Kochetova, L.A. and Popov, V.V., Research of axiological dominants in press release genre based on automatic extraction of key words from corpus, Nauchnyy Dialog, 2019, no. 6, pp. 32–49. https://doi.org/10.24224/2227-1295-2019-6-32-49
Kessler, B., Numberg, G., and Schütze, H., Automatic detection of text genre, Proc. 35th Annu. Meeting on Association for Computational Linguistics and Eighth Conf. of the European Chapter of the Association for Computational Linguistics, Madrid, 1997, Stroudsburg, Pa.: Association for Computational Linguistics, 1997, pp. 32–38. https://doi.org/10.3115/976909.979622
Onan, A., An ensemble scheme based on language function analysis and feature engineering for text genre classification, J. Inf. Sci., 2018, vol. 44, no. 1, pp. 28–47. https://doi.org/10.1177/0165551516677911
Article Google Scholar
Dai, Z. and Huang, R., A joint model for structure-based news genre classification with application to text summarization, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Zong, Ch., Xia, F., Li, W., and Navigli, R., Eds., Association for Computational Linguistics, 2021, pp. 3332–3342. https://doi.org/10.18653/v1/2021.findings-acl.295
Lagutina, K.V., Lagutina, N.S., and Boychuk, E.I., Text classification by genres based on rhythmic characteristics, Autom. Control Comput. Sci., 2022, vol. 56, no. 7, pp. 735–743. https://doi.org/10.3103/S0146411622070136
Article Google Scholar
Lagutina, K., Poletaev, A., Lagutina, N., Boychuk, E., and Paramonov, I., Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries, 2020 26th Conf. of Open Innovations Association (FRUCT), Yaroslavl, 2020, IEEE, 2020, pp. 247–255. https://doi.org/10.23919/fruct48808.2020.9087430
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L., Deep contextualized word representations, Proc. 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans: Association for Computational Linguistics, 2018, vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/n18-1202
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., BERT: Pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Burstein, J., Doran, Ch., and Solorio, Th., Eds., Minneapolis: Association for Computational Linguistics, 2019, vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
Wang, C., Nulty, P., and Lillis, D., A comparative study on word embeddings in deep learning for text classification, Proc. 4th Int. Conf. on Natural Language Processing and Information Retrieval, Seoul, 2020, New York: Association for Computing Machinery, 2020, pp. 37–46. https://doi.org/10.1145/3443279.3443304
Kuratov, Y. and Arkhipov, M., Adaptation of deep bidirectional multilingual transformers for Russian language, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2019 (Computer Linguistics and Intelligent Technologies from the Annu. Int. Conf. Dialogue-2019), Moscow: 2019, pp. 333–339.
Kutuzov, A. and Pivovarova, L., RuShiftEval: A shared task on semantic shi. detection for Russian, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2021 (Computational Linguistics and Intellectual Technologies Papers from the Annu. Int. Conf. Dialogue-2021), 2021, vol. 20, pp. 533–545.
Rodina, J., Trofimova, Yu., Kutuzov, A., and Artemova, E., ELMo and BERT in semantic change detection for Russian, Analysis of Images, Social Networks and Texts. AIST 2020, Van der Aalst, W.M.P., Ed., Lecture Notes in Computer Science, Cham: Springer, 2020, pp. 175–186. https://doi.org/10.1007/978-3-030-72610-2_13
Glazkova, A.V., Topical classification of text fragments accounting for their nearest context, Autom. Remote Control, 2020, vol. 81, no. 12, pp. 2262–2276. https://doi.org/10.1134/s0005117920120097
Article Google Scholar
Batraeva, I.A., Nartsev, A.D., and Lezgyan, A.S., Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning, Vestn. Tomsk. Gos. Univ. Upr., Vychisl. Tekh. Inf., 2020, no. 50, pp. 14–22. https://doi.org/10.17223/19988605/50/2
Bocharov, V., Alexeeva, S., Granovsky, D., Protopopova, E., Stepanova, M., and Surikov, A., Crowdsourcing morphological annotation, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2013 (Computational Linguistics and Intellectual Technologies: Papers from the Annu. Int. Conf. Dialogue-2013), 2013, vol. 1, pp. 109–114.
Lagutina, K., Lagutina, N., Boychuk, E., Larionov, V., and Paramonov, I., Authorship verification of literary texts with rhythm features, 2021 28th Conf. of Open Innovations Association (FRUCT), Moscow, 2021, IEEE, 2021, pp. 240–251. https://doi.org/10.23919/fruct50888.2021.9347649

Download references

Funding

The work is supported by the President of Russian Federation Scholarship for young scientists and postgraduates, project no. SP-2109.2021.5.

Author information

Authors and Affiliations

Demidov State University, 150003, Yaroslavl, Russia
K. V. Lagutina

Authors

K. V. Lagutina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. V. Lagutina.

Ethics declarations

The author of this work declares that she has no conflicts of interest.

Additional information

Translated by A. Kolemesin

Publisher’s Note.

Allerton Press remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Cite this article

Lagutina, K.V. Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm. Aut. Control Comp. Sci. 57, 817–827 (2023). https://doi.org/10.3103/S0146411623070076

Download citation

Received: 17 August 2022
Revised: 04 November 2022
Accepted: 09 November 2022
Published: 27 February 2024
Issue Date: December 2023
DOI: https://doi.org/10.3103/S0146411623070076

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

Abstract

Access this article

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s Note.

About this article

Cite this article

Share this article

Keywords:

Search

Navigation