Abstract
Recent advances in text generative models make it possible to create artificial texts that look like human-written texts. A large number of methods for detecting texts obtained using large language models have already been developed. However, improvement of detection methods occurs simultaneously with the improvement of generation methods. Therefore, it is necessary to explore new generative models and modernize existing approaches to their detection. In this paper, we present a large analysis of existing detection methods, as well as a study of lexical, syntactic, and stylistic features of the generated fragments. Taking into account the developments, we have tested the most qualitative, in our opinion, methods of detecting machine-generated documents for their further application in the scientific domain. Experiments were conducted for Russian and English languages on the collected datasets. The developed methods improved the detection quality to a value of 0.968 on the F1-score metric for Russian and 0.825 for English, respectively. The described techniques can be applied to detect generated fragments in scientific, research, and graduate papers.
Similar content being viewed by others
REFERENCES
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. 31st Int. Conf. on Neural Information Processing Systems, Long Beach, Calif., 2017, Ed. by U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, and R. Fergus (Curran Associates, Red Hook, N.Y., 2017), pp. 6000–6010.
ChatGPT by OpenAI. https://chat.openai.com
Jasper. https://www.jasper.ai
Google Bard. https://bard.google.com/?hl=ru
GigaChat by SberDevices. https://developers.sber.ru/portal/products/gigachat
YaGPT by Yandex. https://yandex.ru/project/alice/yagpt
“An inhabitant of Moscow defended his diplom written by a neural network,” Lenta.ru. https://lenta.ru/news/2023/02/01/neiroset/.
V. V. Nikolaev and M. E. Rakhkonen, “Application of various tools and using the ChatGPT chatbot in writing scientific works tested in Antiplagiat program,” Professional’noe Yuridicheskoe Obraz. Nauka 1 (9), 78–81 (2023).
Yi. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Yi. Zhang, and H. Hu, “ArguGPT: Evaluating, understanding and identifying argumentative essays generated by GPT models,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2304.07666
Yo. Ma, J. Liu, F. Yi, Q. Cheng, Yo. Huang, W. Lu, and X. Liu, “AI vs. human–Differentiation analysis of scientific content generation,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.10416
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised Cross-lingual Representation Learning at Scale,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics, Ed. by D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Association for Computational Linguistics, 2019), pp. 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747
J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Yu. Cui, Z. Zhou, C. Gong, Ya. Shen, J. Zhou, S. Chen, T. Gui, Q. Zhang, and X. Huang, “A comprehensive capability analysis of GPT-3 and GPT-3.5 series models,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2303.10420
S. Badaskar, S. Agarwal, and Sh. Arora, “Identifying real or fake articles: Towards better language modeling,” in Proc. Third Int. Joint Conf. on Natural Language Processing (2008), Vol. 2. https://aclanthology.org/I08-2115.
Yo. Freund and R. E. Shapire, “A short introduction to boosting,” J. Jpn. Soc. Artif. Intell. 14, 771–780 (1999).
V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic detection of fake news,” arXiv Preprint (2017). https://doi.org/10.48550/arXiv.1708.07104
C. Zhou, Q. Li, C. Li, J. Yu, Yi. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.1708.07104
D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, “Automatic detection of generated text is easiest when humans are fooled,” in Proc. 58th Annu. Meeting of the Association for Computational Linguistics, Ed. by D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (Association for Computational Linguistics, 2020), pp. 1808–1822. https://doi.org/10.18653/v1/2020.acl-main.164
G. Jawahar, M. Abdul-Mageed, and L. V. S. Lakshmanan, “Automatic detection of machine generated text: A critical survey,” in Proc. 28th Int. Conf. on Computational Linguistics, Barcelona, 2020, Ed. by D. Scott, N. Bel, and Ch. Zong (International Committee on Computational Linguistics, 2020). https://doi.org/10.18653/v1/2020.coling-main.208
Yi. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv Preprint (2019). https://doi.org/10.48550/arXiv.1907.11692
W. Zhong, D. Tang, Z. Xu, R. Wang, N. Duan, M. Zhou, J. Wang, and J. Yin, “Neural deepfake detection with factual structure of text,” in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), Ed. by B. Webber, T. Cohn, Yu. He, and Ya. Liu (Association for Computational Linguistics, 2020), pp. 2461–2470. https://doi.org/10.18653/v1/2020.emnlp-main.193
G. Gritsay, A. Grabovoy, and Yu. Chekhovich, “Automatic detection of machine generated texts: Need more tokens,” in 2022 Ivannikov Memorial Workshop (IVMEM) (IEEE, 2022). https://doi.org/10.1109/ivmem57067.2022.9983964
H. W. A. Hanley and Z. Durumeric, “Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.09820
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Ya. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res. 21, 140 (2021).
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A massively multilingual pre-trained text-to-text transformer,” in Proc. 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Ed. by K. Toutanova, A. Rumshisky, L. Zettlemoyer, et al. (Association for Computational Linguistics, 2020), pp. 483–498. https://doi.org/10.18653/v1/2021.naacl-main.41
Yu. Chen, H. Kang, V. Zhai, L. Li, R. Singh, and B. Raj, “GPT-sentinel: Distinguishing human and ChatGPT generated content,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.07969
S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian J. Stat. 6 (1979).
Yo. Benjamini and Yo. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” J. R. Stat. Soc.: Ser. B (Methodological) 57, 289–300 (1995). https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
J. Rodriguez, T. Hay, D. Gros, Z. Shamsi, and R. Srinivasan, “Cross-domain detection of GPT-2-generated technical text,” in Proc. 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, Wash., 2022, Ed. by M. Carpuat, M.-C. de Marneffe, and I. V. M. Ruiz (Association for Computational Linguistics, 2022), pp. 1213–1233. https://doi.org/10.18653/v1/2022.naacl-main.88
Open access dataset for machine-generated text detection in Russian. https://data.mendeley.com/datasets/4ynxfp3w53/1
Answers scraped from Yandex Q. https://huggingface.co/datasets/its5Q/yandex-q
Dataset of ChatGPT-generated instructions in Russian. https://huggingface.co/datasets/IlyaGusev/ru_ turbo_alpaca
Dataset of ChatGPT-generated chats in Russian. https://huggingface.co/datasets/IlyaGusev/ru_turbo_ saiga.
Ya. Li, Q. Li, L. Cui, W. Bi, L. Wang, L. Yang, S. Shi, and Yu. Zhang, “Deepfake text detection in the wild,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2305.13242
B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with GPT-4,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2304.03277
B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Yu. Ding, J. Yue, and Yu. Wu, “How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.07597
M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages,” in Analysis of Images, Social Networks and Texts, Ed. by M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Communications in Computer and Information Science, Vol. 542 (Springer, Cham, 2015), pp. 320–332. https://doi.org/10.1007/978-3-319-26123-2_31
E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, 2002 (Association for Computational Linguistics, Stroudsburg, Pa., 2002), Vol. 1, pp. 63–70. https://doi.org/10.3115/1118108.1118117
Google Translate. https://translate.google.com/?hl=ru
E. Mitchell, Yo. Lee, A. Khazatsky, C. D. Manning, and Ch. Finn, “DetectGPT: Zero-shot machine-generated text detection using probability curvature,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2301.11305
J. Pu, Z. Sarwar, S. M. Abdullah, A. Rehman, Yo. Kim, P. Bhattacharya, M. Javed, and B. Viswanath, “Deepfake text detection: Limitations and opportunities,” in 2023 IEEE Symp. on Security and Privacy (SP), San Francisco, 2023 (IEEE, 2023), pp. 1613–1630. https://doi.org/10.1109/sp46215.2023.10179387
Paraphraser for Russian sentences. https://huggingface.co/cointegrated/rut5-base-paraphraser
Paraphraser for English sentences. https://huggingface.co/ramsrigouthamg/t5_sentence_paraphraser
Funding
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors of this work declare that they have no conflicts of interest.
Additional information
Translated by E. Oborin
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gritsay, G.M., Grabovoy, A.V., Kildyakov, A.S. et al. Artificially Generated Text Fragments Search in Academic Documents. Dokl. Math. 108 (Suppl 2), S434–S442 (2023). https://doi.org/10.1134/S1064562423701211
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064562423701211