Skip to main content
Log in

Machine learning model for chatGPT usage detection in students’ answers to open-ended questions: Case of Lithuanian language

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

Abstract

The public availability of large language models, such as chatGPT, brings additional possibilities and challenges to education. Education institutions have to identify when large language models are used and when text is generated by the student itself. In this paper, chatGPT usage in students' answers is investigated. The main aim of the research was to build a machine learning model that could be used in the evaluation of students' answers to open-ended questions written in the Lithuanian language. The model should determine whether the answers were originally written students or answered with the help of chatGPT. A new dataset of student answers has been collected in to train machine learning models. The dataset consists of original student answers, chatGPT answers, and paraphrased chatGPT answers. A total of more than 1000 answers have been prepared. 24 combinations of text pre-processing algorithms have been analyzed. In text pre-processing, the main focus was on various tokenization methods, such as the Bag of Words and Ngrams, the stemming algorithm, and the stop words list. For the analyzed dataset, these pre-processing methods were more effective than application of multilanguage BERT for document embedding. Based on the features/properties of the dataset, the following learning algorithms have been investigated: artificial neural networks, decision trees, random forest, gradient boosting trees, k-nearest neighbours, and naive Bayes. The main results show that the highest accuracy of 87% in some cases can be obtained using gradient boosting trees, random forests, and artificial neural network algorithms. The lowest accuracy has been obtained using the k-nearest neighbouring algorithm. Furthermore, the results of experimental research suggest that the usage of chatGPT in student answers can be automatically identified. 

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data availability

The datasets gathered and analysed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/pavelstefanovi/students-and-chatgpt-answers-in-lithuanian.

References

  • Achyutha, P. N., Chaudhury, S., Bose, S. C., Kler, R., Surve, J., & Kaliyaperumal, K. (2022). User Classification and Stock Market-Based Recommendation Engine Based on Machine Learning and Twitter Analysis. Mathematical Problems in Engineering2022.

  • Adel, G. M., Ghallab, A., Street, S., & Sana’a, Y. (2014). Performance Comparisons on Online Plagiarism Detection Software in Arabic Theses. In International Conference on e-Commerce, e-Administration, e-Society, e-Education, and e-Technology.

  • Ali, A., & Taqa, A. Y. (2022). Analytical Study of Traditional and Intelligent Textual Plagiarism Detection Approaches. Journal of Education and Science, 31(1), 8–25.

    Article  Google Scholar 

  • AlSallal, M., Iqbal, R., Amin, S., James, A., & Palade, V. (2016). An integrated machine learning approach for extrinsic plagiarism detection. In 2016 9th International Conference on Developments in eSystems Engineering (DeSE) (pp. 203–208). IEEE.

  • Altheneyan, A. S., & Menai, M. E. B. (2020). Automatic plagiarism detection in obfuscated text. Pattern Analysis and Applications, 23, 1627–1650.

    Article  Google Scholar 

  • Alzahrani, S. M., Salim, N., & Abraham, A. (2011). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)42(2), 133–149.

  • Arabi, H., & Akbari, M. (2022). Improving plagiarism detection in text document using hybrid weighted similarity. Expert Systems with Applications, 207, 118034.

    Article  Google Scholar 

  • Baidoo-Anu, D., & Owusu Ansah, L. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Available at SSRN 4337484.

  • Bertram Gallant, T., Picciotto, M., Bozinovic, G., & Tour, E. (2019). Plagiarism or not? investigation of Turnitin®-detected similarity hits in biology laboratory reports. Biochemistry and Molecular Biology Education, 47(4), 370–379.

    Article  CAS  PubMed  Google Scholar 

  • Cambridge Advanced Learner’s Dictionary and Thesaurus. (2018). Meaning of “Plagiarism”https://dictionary.cambridge.org/dictionary/english/plagiarism

  • Chang, C. Y., Lee, S. J., Wu, C. H., Liu, C. F., & Liu, C. K. (2021). Using word semantic concepts for plagiarism detection in text documents. Information Retrieval Journal, 24, 298–321.

    Article  Google Scholar 

  • Dhillon, A., & Singh, A. (2019). Machine learning in healthcare data analysis: A survey. Journal of Biology and Today’s World, 8(6), 1–10.

    Google Scholar 

  • Dixon, M. F., Halperin, I., & Bilokon, P. (2020). Machine learning in Finance (Vol. 1170). Springer International Publishing.

    Book  Google Scholar 

  • Febriyanti, N., Rini, D. P., & Arsalan, O. (2022). Text Similarity Detection Between Documents Using Case Based Reasoning Method with Cosine Similarity Measure (Case Study SIMNG LPPM Universitas Sriwijaya). Sriwijaya Journal of Informatics and Applications3(2).

  • Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020). Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study. Journal of Information Security and Applications, 50, 102419.

    Article  Google Scholar 

  • Ghosh, S., Ghosh, A., Ghosh, B., & Roy, A. (2022). Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach. arXiv preprint arXiv:2203.13430.

  • Gyamfi, N. K., Ceponis, D., & Goranin, N. (2022). Automated system-level anomaly detection and classification using modified random forest. In 2022 1st International Conference on AI in Cybersecurity (ICAIC) (pp. 1–8). IEEE.

  • Handa, A., Sharma, A., & Shukla, S. K. (2019). Machine learning in cybersecurity: A review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4), e1306.

    Google Scholar 

  • Henderi, H., & Winarno, W. (2021). Text Mining an Automatic Short Answer Grading (ASAG), Comparison of Three Methods of Cosine Similarity, Jaccard Similarity and Dice's Coefficient. Journal of Applied Data Sciences2(2).

  • Kaggle. (2023). Students and chatGPT answers in Lithuanian. https://www.kaggle.com/datasets/pavelstefanovi/students-and-chatgpt-answers-in-lithuanian

  • Kapočiūtė-Dzikienė, J., & Salimbajevs, A. (2022). Comparison of Deep Learning Approaches for Lithuanian Sentiment Analysis. Baltic Journal of Modern Computing, 10(3), 283–294.

    Article  Google Scholar 

  • Khaled, F., & Al-Tamimi, M. S. H. (2021). Plagiarism detection methods and tools: An overview. Iraqi Journal of Science, 2771–2783.

  • Khalil, M., & Er, E. (2023). Will ChatGPT get you caught? Rethinking of plagiarism detection. arXiv preprint arXiv:2302.04335.

  • Lemantara, J., Sunarto, M. D., Hariadi, B., Sagirani, T., & Amelia, T. (2018). Prototype of online examination on MoLearn applications using text similarity to detect plagiarism. In 2018 5th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE) (pp. 131–136). IEEE.

  • Mishra, A. R., & Panchal, V. K. (2022). A novel approach to capture the similarity in summarized text using embedded model. International Journal on Smart Sensing and Intelligent Systems, 15(1), 1–20.

    Article  Google Scholar 

  • Qazi, M., Tollas, K., Kanchinadam, T., Bockhorst, J., & Fung, G. (2020). Designing and deploying insurance recommender systems using machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(4), e1363.

    Google Scholar 

  • Rastenis, J., Ramanauskaitė, S., Suzdalev, I., Tunaitytė, K., Janulevičius, J., & Čenys, A. (2021). Multi-Language spam/Phishing classification by Email Body text: Toward automated security Incident investigation. Electronics, 10(6), 668.

    Article  Google Scholar 

  • Roostaee, M., Fakhrahmad, S. M., & Sadreddini, M. H. (2020). Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection. Expert Systems with Applications, 160, 113718.

    Article  Google Scholar 

  • Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?. Journal of Applied Learning and Teaching6(1).

  • Schonlau, M., & Guenther, N. (2017). Text mining using n-grams. Schonlau, M., Guenther, N. Sucholutsky, I. Text mining using n-gram variables. The Stata Journal, 17(4), 866–881.

  • Shailaja, K., Seetharamulu, B., & Jabbar, M. A. (2018). Machine learning in healthcare: A review. In 2018 Second international conference on electronics, communication and aerospace technology (ICECA) (pp. 910–914). IEEE.

  • Shinde, P. P., & Shah, S. (2018). A review of machine learning and deep learning applications. In 2018 Fourth international conference on computing communication control and automation (ICCUBEA) (pp. 1–6). IEEE.

  • Stefanovič, P., & Kurasova, O. (2014). Creation of text document matrices and visualization by self-organizing map. Information Technology and Control, 43(1), 37–46.

    Article  Google Scholar 

  • Stefanovič, P., Kurasova, O., & Štrimaitis, R. (2019). The n-grams based text similarity detection approach using self-organizing maps and similarity measures. Applied Sciences, 9(9), 1870.

    Article  Google Scholar 

  • Štrimaitis, R., Stefanovič, P., Ramanauskaitė, S., & Slotkienė, A. (2022). A Combined Approach for Multi-Label Text Data Classification. Computational Intelligence and Neuroscience2022.

  • Thennakoon, A., Bhagyani, C., Premadasa, S., Mihiranga, S., & Kuruwitaarachchi, N. (2019). Real-time credit card fraud detection using machine learning. In 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) (pp. 488–493). IEEE.

  • Veisi, H., Golchinpour, M., Salehi, M., & Gharavi, E. (2022). Multi-level text document similarity estimation and its application for plagiarism detection. Iran Journal of Computer Science, 5(2), 143–155.

    Article  Google Scholar 

  • Wakil, K., Ghafoor, M., Abdulrahman, M., & Tariq, S. (2017). Plagiarism Detection System for the Kurdish.

  • Wang, J., & Dong, Y. (2020). Measurement of text similarity: A survey. Information, 11(9), 421.

    Article  Google Scholar 

  • Zubarev, D., & Sochenkov, I. (2019). Cross-language text alignment for plagiarism detection based on contextual and context-free models. In Proc. of the Annual International Conference “Dialogue (Vol. 1, pp. 799–810).

Download references

Funding

The research received no external funding.

Author information

Authors and Affiliations

Authors

Contributions

Pavel Stefanovič was responsible for the study conceptualization, while all authors contributed to the research design. Data collection, pre-processing, methodology and study supervision was done by Pavel Stefanovič. Birutė Piuskuvienė and Urtė Radvilaitė executed the experimental research, did the initial investigation. Simona Ramanauskaitė executed formal analysis and data validation. The manuscript was written by joint effort of all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Pavel Stefanovič.

Ethics declarations

Conflict of interest

There is no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Stefanovič, P., Pliuskuvienė, B., Radvilaitė, U. et al. Machine learning model for chatGPT usage detection in students’ answers to open-ended questions: Case of Lithuanian language. Educ Inf Technol (2024). https://doi.org/10.1007/s10639-024-12589-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10639-024-12589-z

Keywords

Navigation