Artificially Generated Text Fragments Search in Academic Documents

Gritsay, G. M.; Grabovoy, A. V.; Kildyakov, A. S.; Chekhovich, Yu. V.

doi:10.1134/S1064562423701211

Artificially Generated Text Fragments Search in Academic Documents

Published: 11 March 2024

Volume 108, pages S434–S442, (2023)
Cite this article

Doklady Mathematics Aims and scope Submit manuscript

G. M. Gritsay^1,2,
A. V. Grabovoy^1,2,3,
A. S. Kildyakov¹ &
…
Yu. V. Chekhovich^1,3

39 Accesses
Explore all metrics

Abstract

Recent advances in text generative models make it possible to create artificial texts that look like human-written texts. A large number of methods for detecting texts obtained using large language models have already been developed. However, improvement of detection methods occurs simultaneously with the improvement of generation methods. Therefore, it is necessary to explore new generative models and modernize existing approaches to their detection. In this paper, we present a large analysis of existing detection methods, as well as a study of lexical, syntactic, and stylistic features of the generated fragments. Taking into account the developments, we have tested the most qualitative, in our opinion, methods of detecting machine-generated documents for their further application in the scientific domain. Experiments were conducted for Russian and English languages on the collected datasets. The developed methods improved the detection quality to a value of 0.968 on the F1-score metric for Russian and 0.825 for English, respectively. The described techniques can be applied to detect generated fragments in scientific, research, and graduate papers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of Computer-Generated Papers in Scientific Literature

Computer-Generated Text Detection Using Machine Learning: A Systematic Review

Testing of detection tools for AI-generated text

Article Open access 25 December 2023

REFERENCES

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, Calif., 2017, Ed. by U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, and R. Fergus (Curran Associates, Red Hook, N.Y., 2017), pp. 6000–6010.
ChatGPT by OpenAI. https://chat.openai.com
Jasper. https://www.jasper.ai
Google Bard. https://bard.google.com/?hl=ru
GigaChat by SberDevices. https://developers.sber.ru/portal/products/gigachat
YaGPT by Yandex. https://yandex.ru/project/alice/yagpt
“An inhabitant of Moscow defended his diplom written by a neural network,” Lenta.ru. https://lenta.ru/news/2023/02/01/neiroset/.
V. V. Nikolaev and M. E. Rakhkonen, “Application of various tools and using the ChatGPT chatbot in writing scientific works tested in Antiplagiat program,” Professional’noe Yuridicheskoe Obraz. Nauka 1 (9), 78–81 (2023).
Google Scholar
Yi. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Yi. Zhang, and H. Hu, “ArguGPT: Evaluating, understanding and identifying argumentative essays generated by GPT models,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2304.07666
Yo. Ma, J. Liu, F. Yi, Q. Cheng, Yo. Huang, W. Lu, and X. Liu, “AI vs. human–Differentiation analysis of scientific content generation,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.10416
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised Cross-lingual Representation Learning at Scale,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Ed. by D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Association for Computational Linguistics, 2019), pp. 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Yu. Cui, Z. Zhou, C. Gong, Ya. Shen, J. Zhou, S. Chen, T. Gui, Q. Zhang, and X. Huang, “A comprehensive capability analysis of GPT-3 and GPT-3.5 series models,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2303.10420
S. Badaskar, S. Agarwal, and Sh. Arora, “Identifying real or fake articles: Towards better language modeling,” in Proc. Third Int. Joint Conf. on Natural Language Processing (2008), Vol. 2. https://aclanthology.org/I08-2115.
Yo. Freund and R. E. Shapire, “A short introduction to boosting,” J. Jpn. Soc. Artif. Intell. 14, 771–780 (1999).
Google Scholar
V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic detection of fake news,” arXiv Preprint (2017). https://doi.org/10.48550/arXiv.1708.07104
C. Zhou, Q. Li, C. Li, J. Yu, Yi. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.1708.07104
D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, “Automatic detection of generated text is easiest when humans are fooled,” in Proc. 58th Annu. Meeting of the Association for Computational Linguistics, Ed. by D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Association for Computational Linguistics, 2020), pp. 1808–1822. https://doi.org/10.18653/v1/2020.acl-main.164
G. Jawahar, M. Abdul-Mageed, and L. V. S. Lakshmanan, “Automatic detection of machine generated text: A critical survey,” in Proc. 28th Int. Conf. on Computational Linguistics, Barcelona, 2020, Ed. by D. Scott, N. Bel, and Ch. Zong (International Committee on Computational Linguistics, 2020). https://doi.org/10.18653/v1/2020.coling-main.208
Yi. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv Preprint (2019). https://doi.org/10.48550/arXiv.1907.11692
W. Zhong, D. Tang, Z. Xu, R. Wang, N. Duan, M. Zhou, J. Wang, and J. Yin, “Neural deepfake detection with factual structure of text,” in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), Ed. by B. Webber, T. Cohn, Yu. He, and Ya. Liu (Association for Computational Linguistics, 2020), pp. 2461–2470. https://doi.org/10.18653/v1/2020.emnlp-main.193
G. Gritsay, A. Grabovoy, and Yu. Chekhovich, “Automatic detection of machine generated texts: Need more tokens,” in 2022 Ivannikov Memorial Workshop (IVMEM) (IEEE, 2022). https://doi.org/10.1109/ivmem57067.2022.9983964
Book Google Scholar
H. W. A. Hanley and Z. Durumeric, “Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.09820
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Ya. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. 21, 140 (2021).
MathSciNet Google Scholar
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A massively multilingual pre-trained text-to-text transformer,” in Proc. 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Ed. by K. Toutanova, A. Rumshisky, L. Zettlemoyer, et al. (Association for Computational Linguistics, 2020), pp. 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41
Yu. Chen, H. Kang, V. Zhai, L. Li, R. Singh, and B. Raj, “GPT-sentinel: Distinguishing human and ChatGPT generated content,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.07969
S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian J. Stat. 6 (1979).
Yo. Benjamini and Yo. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” J. R. Stat. Soc.: Ser. B (Methodological) 57, 289–300 (1995). https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Article MathSciNet Google Scholar
J. Rodriguez, T. Hay, D. Gros, Z. Shamsi, and R. Srinivasan, “Cross-domain detection of GPT-2-generated technical text,” in Proc. 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, Wash., 2022, Ed. by M. Carpuat, M.-C. de Marneffe, and I. V. M. Ruiz (Association for Computational Linguistics, 2022), pp. 1213–1233. https://doi.org/10.18653/v1/2022.naacl-main.88
Open access dataset for machine-generated text detection in Russian. https://data.mendeley.com/datasets/4ynxfp3w53/1
Answers scraped from Yandex Q. https://huggingface.co/datasets/its5Q/yandex-q
Dataset of ChatGPT-generated instructions in Russian. https://huggingface.co/datasets/IlyaGusev/ru_ turbo_alpaca
Dataset of ChatGPT-generated chats in Russian. https://huggingface.co/datasets/IlyaGusev/ru_turbo_ saiga.
Ya. Li, Q. Li, L. Cui, W. Bi, L. Wang, L. Yang, S. Shi, and Yu. Zhang, “Deepfake text detection in the wild,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.13242
B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with GPT-4,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2304.03277
B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Yu. Ding, J. Yue, and Yu. Wu, “How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.07597
M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages,” in Analysis of Images, Social Networks and Texts, Ed. by M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Communications in Computer and Information Science, Vol. 542 (Springer, Cham, 2015), pp. 320–332. https://doi.org/10.1007/978-3-319-26123-2_31
E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, 2002 (Association for Computational Linguistics, Stroudsburg, Pa., 2002), Vol. 1, pp. 63–70. https://doi.org/10.3115/1118108.1118117
Google Translate. https://translate.google.com/?hl=ru
E. Mitchell, Yo. Lee, A. Khazatsky, C. D. Manning, and Ch. Finn, “DetectGPT: Zero-shot machine-generated text detection using probability curvature,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.11305
J. Pu, Z. Sarwar, S. M. Abdullah, A. Rehman, Yo. Kim, P. Bhattacharya, M. Javed, and B. Viswanath, “Deepfake text detection: Limitations and opportunities,” in 2023 IEEE Symp. on Security and Privacy (SP), San Francisco, 2023 (IEEE, 2023), pp. 1613–1630. https://doi.org/10.1109/sp46215.2023.10179387
Paraphraser for Russian sentences. https://huggingface.co/cointegrated/rut5-base-paraphraser
Paraphraser for English sentences. https://huggingface.co/ramsrigouthamg/t5_sentence_paraphraser

Download references

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Antiplagiat Company, Moscow, Russia
G. M. Gritsay, A. V. Grabovoy, A. S. Kildyakov & Yu. V. Chekhovich
Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
G. M. Gritsay & A. V. Grabovoy
Federal Research Center Computer Science and Control, Russian Academy of Sciences, Moscow, Russia
A. V. Grabovoy & Yu. V. Chekhovich

Authors

G. M. Gritsay
View author publications
You can also search for this author in PubMed Google Scholar
A. V. Grabovoy
View author publications
You can also search for this author in PubMed Google Scholar
A. S. Kildyakov
View author publications
You can also search for this author in PubMed Google Scholar
Yu. V. Chekhovich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to G. M. Gritsay, A. V. Grabovoy, A. S. Kildyakov or Yu. V. Chekhovich.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Translated by E. Oborin

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gritsay, G.M., Grabovoy, A.V., Kildyakov, A.S. et al. Artificially Generated Text Fragments Search in Academic Documents. Dokl. Math. 108 (Suppl 2), S434–S442 (2023). https://doi.org/10.1134/S1064562423701211

Download citation

Received: 02 September 2023
Revised: 15 September 2023
Accepted: 18 October 2023
Published: 11 March 2024
Issue Date: December 2023
DOI: https://doi.org/10.1134/S1064562423701211

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions