Skip to main content
Log in

Artificially Generated Text Fragments Search in Academic Documents

  • Published:
Doklady Mathematics Aims and scope Submit manuscript

Abstract

Recent advances in text generative models make it possible to create artificial texts that look like human-written texts. A large number of methods for detecting texts obtained using large language models have already been developed. However, improvement of detection methods occurs simultaneously with the improvement of generation methods. Therefore, it is necessary to explore new generative models and modernize existing approaches to their detection. In this paper, we present a large analysis of existing detection methods, as well as a study of lexical, syntactic, and stylistic features of the generated fragments. Taking into account the developments, we have tested the most qualitative, in our opinion, methods of detecting machine-generated documents for their further application in the scientific domain. Experiments were conducted for Russian and English languages on the collected datasets. The developed methods improved the detection quality to a value of 0.968 on the F1-score metric for Russian and 0.825 for English, respectively. The described techniques can be applied to detect generated fragments in scientific, research, and graduate papers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.

Similar content being viewed by others

REFERENCES

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, Calif., 2017, Ed. by U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, and R. Fergus (Curran Associates, Red Hook, N.Y., 2017), pp. 6000–6010.

  2. ChatGPT by OpenAI. https://chat.openai.com

  3. Jasper. https://www.jasper.ai

  4. Google Bard. https://bard.google.com/?hl=ru

  5. GigaChat by SberDevices. https://developers.sber.ru/portal/products/gigachat

  6. YaGPT by Yandex. https://yandex.ru/project/alice/yagpt

  7. “An inhabitant of Moscow defended his diplom written by a neural network,” Lenta.ru. https://lenta.ru/news/2023/02/01/neiroset/.

  8. V. V. Nikolaev and M. E. Rakhkonen, “Application of various tools and using the ChatGPT chatbot in writing scientific works tested in Antiplagiat program,” Professional’noe Yuridicheskoe Obraz. Nauka 1 (9), 78–81 (2023).

    Google Scholar 

  9. Yi. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Yi. Zhang, and H. Hu, “ArguGPT: Evaluating, understanding and identifying argumentative essays generated by GPT models,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2304.07666

  10. Yo. Ma, J. Liu, F. Yi, Q. Cheng, Yo. Huang, W. Lu, and X. Liu, “AI vs. human–Differentiation analysis of scientific content generation,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.10416

  11. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised Cross-lingual Representation Learning at Scale,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Ed. by D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Association for Computational Linguistics, 2019), pp. 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747

  12. J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Yu. Cui, Z. Zhou, C. Gong, Ya. Shen, J. Zhou, S. Chen, T. Gui, Q. Zhang, and X. Huang, “A comprehensive capability analysis of GPT-3 and GPT-3.5 series models,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2303.10420

  13. S. Badaskar, S. Agarwal, and Sh. Arora, “Identifying real or fake articles: Towards better language modeling,” in Proc. Third Int. Joint Conf. on Natural Language Processing (2008), Vol. 2. https://aclanthology.org/I08-2115.

  14. Yo. Freund and R. E. Shapire, “A short introduction to boosting,” J. Jpn. Soc. Artif. Intell. 14, 771–780 (1999).

    Google Scholar 

  15. V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic detection of fake news,” arXiv Preprint (2017). https://doi.org/10.48550/arXiv.1708.07104

  16. C. Zhou, Q. Li, C. Li, J. Yu, Yi. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.1708.07104

  17. D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, “Automatic detection of generated text is easiest when humans are fooled,” in Proc. 58th Annu. Meeting of the Association for Computational Linguistics, Ed. by D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Association for Computational Linguistics, 2020), pp. 1808–1822. https://doi.org/10.18653/v1/2020.acl-main.164

  18. G. Jawahar, M. Abdul-Mageed, and L. V. S. Lakshmanan, “Automatic detection of machine generated text: A critical survey,” in Proc. 28th Int. Conf. on Computational Linguistics, Barcelona, 2020, Ed. by D. Scott, N. Bel, and Ch. Zong (International Committee on Computational Linguistics, 2020). https://doi.org/10.18653/v1/2020.coling-main.208

  19. Yi. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv Preprint (2019). https://doi.org/10.48550/arXiv.1907.11692

  20. W. Zhong, D. Tang, Z. Xu, R. Wang, N. Duan, M. Zhou, J. Wang, and J. Yin, “Neural deepfake detection with factual structure of text,” in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), Ed. by B. Webber, T. Cohn, Yu. He, and Ya. Liu (Association for Computational Linguistics, 2020), pp. 2461–2470. https://doi.org/10.18653/v1/2020.emnlp-main.193

  21. G. Gritsay, A. Grabovoy, and Yu. Chekhovich, “Automatic detection of machine generated texts: Need more tokens,” in 2022 Ivannikov Memorial Workshop (IVMEM) (IEEE, 2022). https://doi.org/10.1109/ivmem57067.2022.9983964

    Book  Google Scholar 

  22. H. W. A. Hanley and Z. Durumeric, “Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.09820

  23. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Ya. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. 21, 140 (2021).

    MathSciNet  Google Scholar 

  24. L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A massively multilingual pre-trained text-to-text transformer,” in Proc. 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Ed. by K. Toutanova, A. Rumshisky, L. Zettlemoyer, et al. (Association for Computational Linguistics, 2020), pp. 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41

  25. Yu. Chen, H. Kang, V. Zhai, L. Li, R. Singh, and B. Raj, “GPT-sentinel: Distinguishing human and ChatGPT generated content,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.07969

  26. S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian J. Stat. 6 (1979).

  27. Yo. Benjamini and Yo. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” J. R. Stat. Soc.: Ser. B (Methodological) 57, 289–300 (1995). https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

    Article  MathSciNet  Google Scholar 

  28. J. Rodriguez, T. Hay, D. Gros, Z. Shamsi, and R. Srinivasan, “Cross-domain detection of GPT-2-generated technical text,” in Proc. 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, Wash., 2022, Ed. by M. Carpuat, M.-C. de Marneffe, and I. V. M. Ruiz (Association for Computational Linguistics, 2022), pp. 1213–1233. https://doi.org/10.18653/v1/2022.naacl-main.88

  29. Open access dataset for machine-generated text detection in Russian. https://data.mendeley.com/datasets/4ynxfp3w53/1

  30. Answers scraped from Yandex Q. https://huggingface.co/datasets/its5Q/yandex-q

  31. Dataset of ChatGPT-generated instructions in Russian. https://huggingface.co/datasets/IlyaGusev/ru_ turbo_alpaca

  32. Dataset of ChatGPT-generated chats in Russian. https://huggingface.co/datasets/IlyaGusev/ru_turbo_ saiga.

  33. Ya. Li, Q. Li, L. Cui, W. Bi, L. Wang, L. Yang, S. Shi, and Yu. Zhang, “Deepfake text detection in the wild,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.13242

  34. B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with GPT-4,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2304.03277

  35. B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Yu. Ding, J. Yue, and Yu. Wu, “How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.07597

  36. M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages,” in Analysis of Images, Social Networks and Texts, Ed. by M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Communications in Computer and Information Science, Vol. 542 (Springer, Cham, 2015), pp. 320–332. https://doi.org/10.1007/978-3-319-26123-2_31

  37. E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, 2002 (Association for Computational Linguistics, Stroudsburg, Pa., 2002), Vol. 1, pp. 63–70. https://doi.org/10.3115/1118108.1118117

  38. Google Translate. https://translate.google.com/?hl=ru

  39. E. Mitchell, Yo. Lee, A. Khazatsky, C. D. Manning, and Ch. Finn, “DetectGPT: Zero-shot machine-generated text detection using probability curvature,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.11305

  40. J. Pu, Z. Sarwar, S. M. Abdullah, A. Rehman, Yo. Kim, P. Bhattacharya, M. Javed, and B. Viswanath, “Deepfake text detection: Limitations and opportunities,” in 2023 IEEE Symp. on Security and Privacy (SP), San Francisco, 2023 (IEEE, 2023), pp. 1613–1630. https://doi.org/10.1109/sp46215.2023.10179387

  41. Paraphraser for Russian sentences. https://huggingface.co/cointegrated/rut5-base-paraphraser

  42. Paraphraser for English sentences. https://huggingface.co/ramsrigouthamg/t5_sentence_paraphraser

Download references

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to G. M. Gritsay, A. V. Grabovoy, A. S. Kildyakov or Yu. V. Chekhovich.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Translated by E. Oborin

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gritsay, G.M., Grabovoy, A.V., Kildyakov, A.S. et al. Artificially Generated Text Fragments Search in Academic Documents. Dokl. Math. 108 (Suppl 2), S434–S442 (2023). https://doi.org/10.1134/S1064562423701211

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064562423701211

Keywords:

Navigation