Skip to main content
Log in

Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm

  • Published:
Automatic Control and Computer Sciences Aims and scope Submit manuscript

Abstract

This article investigates modern vector text models for solving the problem of genre classifying Russian-language texts. The models include ELMo embeddings, a pretrained BERT language model, and a set of numerical rhythmic characteristics based on lexico-grammatical tools. The experiments have been carried out on a corpus of 10 000 texts in five genres: novels, scientific articles, reviews, posts from the VKontakte social network, and news from OpenCorpora. Visualization and analysis of statistics for rhythmic characteristics have made it possible to distinguish both the most diverse genres in terms of rhythm (novels and reviews) and the least (scientific articles). It is these genres that are subsequently classified best using rhythm and the LSTM neural network classifier. Clustering and classifying texts by genre using the ELMo and BERT embeddings make it possible to separate one genre from another with a small number of errors. The multiclassification F-measure reaches 99%. This study confirms the effectiveness of modern embeddings in the tasks of computational linguistics and highlights the advantages and limitations of the set rhythmic characteristics on the genre classification material.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

REFERENCES

  1. Kochetova, L.A. and Popov, V.V., Research of axiological dominants in press release genre based on automatic extraction of key words from corpus, Nauchnyy Dialog, 2019, no. 6, pp. 32–49. https://doi.org/10.24224/2227-1295-2019-6-32-49

  2. Kessler, B., Numberg, G., and Schütze, H., Automatic detection of text genre, Proc. 35th Annu. Meeting on Association for Computational Linguistics and Eighth Conf. of the European Chapter of the Association for Computational Linguistics, Madrid, 1997, Stroudsburg, Pa.: Association for Computational Linguistics, 1997, pp. 32–38. https://doi.org/10.3115/976909.979622

  3. Onan, A., An ensemble scheme based on language function analysis and feature engineering for text genre classification, J. Inf. Sci., 2018, vol. 44, no. 1, pp. 28–47. https://doi.org/10.1177/0165551516677911

    Article  Google Scholar 

  4. Dai, Z. and Huang, R., A joint model for structure-based news genre classification with application to text summarization, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Zong, Ch., Xia, F., Li, W., and Navigli, R., Eds., Association for Computational Linguistics, 2021, pp. 3332–3342. https://doi.org/10.18653/v1/2021.findings-acl.295

  5. Lagutina, K.V., Lagutina, N.S., and Boychuk, E.I., Text classification by genres based on rhythmic characteristics, Autom. Control Comput. Sci., 2022, vol. 56, no. 7, pp. 735–743. https://doi.org/10.3103/S0146411622070136

    Article  Google Scholar 

  6. Lagutina, K., Poletaev, A., Lagutina, N., Boychuk, E., and Paramonov, I., Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries, 2020 26th Conf. of Open Innovations Association (FRUCT), Yaroslavl, 2020, IEEE, 2020, pp. 247–255. https://doi.org/10.23919/fruct48808.2020.9087430

  7. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L., Deep contextualized word representations, Proc. 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans: Association for Computational Linguistics, 2018, vol. 1, pp. 2227–2237. https://doi.org/10.18653/v1/n18-1202

  8. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., BERT: Pre-training of deep bidirectional transformers for language understanding, Proc. 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Burstein, J., Doran, Ch., and Solorio, Th., Eds., Minneapolis: Association for Computational Linguistics, 2019, vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

  9. Wang, C., Nulty, P., and Lillis, D., A comparative study on word embeddings in deep learning for text classification, Proc. 4th Int. Conf. on Natural Language Processing and Information Retrieval, Seoul, 2020, New York: Association for Computing Machinery, 2020, pp. 37–46. https://doi.org/10.1145/3443279.3443304

  10. Kuratov, Y. and Arkhipov, M., Adaptation of deep bidirectional multilingual transformers for Russian language, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2019 (Computer Linguistics and Intelligent Technologies from the Annu. Int. Conf. Dialogue-2019), Moscow: 2019, pp. 333–339.

  11. Kutuzov, A. and Pivovarova, L., RuShiftEval: A shared task on semantic shi. detection for Russian, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2021 (Computational Linguistics and Intellectual Technologies Papers from the Annu. Int. Conf. Dialogue-2021), 2021, vol. 20, pp. 533–545.

  12. Rodina, J., Trofimova, Yu., Kutuzov, A., and Artemova, E., ELMo and BERT in semantic change detection for Russian, Analysis of Images, Social Networks and Texts. AIST 2020, Van der Aalst, W.M.P., Ed., Lecture Notes in Computer Science, Cham: Springer, 2020, pp. 175–186. https://doi.org/10.1007/978-3-030-72610-2_13

  13. Glazkova, A.V., Topical classification of text fragments accounting for their nearest context, Autom. Remote Control, 2020, vol. 81, no. 12, pp. 2262–2276. https://doi.org/10.1134/s0005117920120097

    Article  Google Scholar 

  14. Batraeva, I.A., Nartsev, A.D., and Lezgyan, A.S., Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning, Vestn. Tomsk. Gos. Univ. Upr., Vychisl. Tekh. Inf., 2020, no. 50, pp. 14–22. https://doi.org/10.17223/19988605/50/2

  15. Bocharov, V., Alexeeva, S., Granovsky, D., Protopopova, E., Stepanova, M., and Surikov, A., Crowdsourcing morphological annotation, Komp’yuternaya lingvistika i intellektual’nye tekhnologii po materialam ezhegodnoi mezhdunarodnoi konf. Dialog-2013 (Computational Linguistics and Intellectual Technologies: Papers from the Annu. Int. Conf. Dialogue-2013), 2013, vol. 1, pp. 109–114.

  16. Lagutina, K., Lagutina, N., Boychuk, E., Larionov, V., and Paramonov, I., Authorship verification of literary texts with rhythm features, 2021 28th Conf. of Open Innovations Association (FRUCT), Moscow, 2021, IEEE, 2021, pp. 240–251. https://doi.org/10.23919/fruct50888.2021.9347649

Download references

Funding

The work is supported by the President of Russian Federation Scholarship for young scientists and postgraduates, project no. SP-2109.2021.5.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. V. Lagutina.

Ethics declarations

The author of this work declares that she has no conflicts of interest.

Additional information

Translated by A. Kolemesin

Publisher’s Note.

Allerton Press remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lagutina, K.V. Genre Classification of Russian Texts Based on Modern Embeddings and Rhythm. Aut. Control Comp. Sci. 57, 817–827 (2023). https://doi.org/10.3103/S0146411623070076

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0146411623070076

Keywords:

Navigation