Abstract
This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.
REFERENCES
Kochetova, L.A. and Popov, V.V., Research of axiological dominants in press release genre based on automatic extraction of key words from corpus, Nauchnyy Dialog, 2019, no. 6, pp. 32–49. https://doi.org/10.24224/2227-1295-2019-6-32-49
Kessler, B., Numberg, G., and Schütze, H., Automatic detection of text genre, Proc. 35th Annu. Meeting on Association for Computational Linguistics and Eighth Conf. of the European Chapter of the Association for Computational Linguistics, Madrid, 1997, Stroudsburg, Pa.: Association for Computational Linguistics, 1997, pp. 32–38. https://doi.org/10.3115/976909.979622
Onan, A., An ensemble scheme based on language function analysis and feature engineering for text genre classification, J. Inf. Sci., 2018, vol. 44, no. 1, pp. 28–47. https://doi.org/10.1177/0165551516677911
Dai, Z. and Huang, R., A joint model for structure-based news genre classification with application to text summarization, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Zong, Ch., Xia, F., Li, W., and Navigli, R., Eds., Association for Computational Linguistics, 2021, pp. 3332–3342. https://doi.org/10.18653/v1/2021.findings-acl.295
Lagutina, K.V., Lagutina, N.S., and Boychuk, E.I., Text classification by genres based on rhythmic characteristics, Autom. Control Comput. Sci., 2022, vol. 56, no. 7, pp. 735–743. https://doi.org/10.3103/S0146411622070136
Lagutina, K., Poletaev, A., Lagutina, N., Boychuk, E., and Paramonov, I., Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries, 2020 26th Conf. of Open Innovations Association (FRUCT), Yaroslavl, 2020, IEEE, 2020, pp. 247–255. https://doi.org/10.23919/fruct48808.2020.9087430
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L., Deep contextualized word representations, Proc. 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans: Association for Computational Linguistics, 2018, vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/n18-1202
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., BERT: Pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Burstein, J., Doran, Ch., and Solorio, Th., Eds., Minneapolis: Association for Computational Linguistics, 2019, vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
Wang, C., Nulty, P., and Lillis, D., A comparative study on word embeddings in deep learning for text classification, Proc. 4th Int. Conf. on Natural Language Processing and Information Retrieval, Seoul, 2020, New York: Association for Computing Machinery, 2020, pp. 37–46. https://doi.org/10.1145/3443279.3443304
Kuratov, Y. and Arkhipov, M., Adaptation of deep bidirectional multilingual transformers for Russian language, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2019 (Computer Linguistics and Intelligent Technologies from the Annu. Int. Conf. Dialogue-2019), Moscow: 2019, pp. 333–339.
Kutuzov, A. and Pivovarova, L., RuShiftEval: A shared task on semantic shi. detection for Russian, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2021 (Computational Linguistics and Intellectual Technologies Papers from the Annu. Int. Conf. Dialogue-2021), 2021, vol. 20, pp. 533–545.
Rodina, J., Trofimova, Yu., Kutuzov, A., and Artemova, E., ELMo and BERT in semantic change detection for Russian, Analysis of Images, Social Networks and Texts. AIST 2020, Van der Aalst, W.M.P., Ed., Lecture Notes in Computer Science, Cham: Springer, 2020, pp. 175–186. https://doi.org/10.1007/978-3-030-72610-2_13
Glazkova, A.V., Topical classification of text fragments accounting for their nearest context, Autom. Remote Control, 2020, vol. 81, no. 12, pp. 2262–2276. https://doi.org/10.1134/s0005117920120097
Batraeva, I.A., Nartsev, A.D., and Lezgyan, A.S., Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning, Vestn. Tomsk. Gos. Univ. Upr., Vychisl. Tekh. Inf., 2020, no. 50, pp. 14–22. https://doi.org/10.17223/19988605/50/2
Bocharov, V., Alexeeva, S., Granovsky, D., Protopopova, E., Stepanova, M., and Surikov, A., Crowdsourcing morphological annotation, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2013 (Computational Linguistics and Intellectual Technologies: Papers from the Annu. Int. Conf. Dialogue-2013), 2013, vol. 1, pp. 109–114.
Lagutina, K., Lagutina, N., Boychuk, E., Larionov, V., and Paramonov, I., Authorship verification of literary texts with rhythm features, 2021 28th Conf. of Open Innovations Association (FRUCT), Moscow, 2021, IEEE, 2021, pp. 240–251. https://doi.org/10.23919/fruct50888.2021.9347649
Funding
The work is supported by the President of Russian Federation Scholarship for young scientists and postgraduates, project no. SP-2109.2021.5.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The author of this work declares that she has no conflicts of interest.
Additional information
Translated by A. Kolemesin
Publisher’s Note.
Allerton Press remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Lagutina, K.V. Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm. Aut. Control Comp. Sci. 57, 817–827 (2023). https://doi.org/10.3103/S0146411623070076
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0146411623070076