Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.
Journal of Educational Evaluation for Health Professions Pub Date : 2023-11-20 , DOI: 10.3352/jeehp.2023.20.30
Betzy Clariza Torres-Zegarra 1 , Wagner Rios-Garcia 2 , Alvaro Micael Ñaña-Cordova 1 , Karen Fatima Arteaga-Cisneros 1 , Xiomara Cristina Benavente Chalco 1 , Marina Atena Bustamante Ordoñez 1 , Carlos Jesus Gutierrez Rios 1 , Carlos Alberto Ramos Godoy 3, 4 , Kristell Luisa Teresa Panta Quezada 4 , Jesus Daniel Gutierrez-Arratia 4, 5 , Javier Alejandro Flores-Cohaila 1, 4
Affiliation  

PURPOSE We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). METHODS This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). RESULTS GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09-0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. CONCLUSION Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

中文翻译:

ChatGPT、Bard、Claude 和 Bing 在秘鲁国家执照体检中的表现:一项横断面研究。

目的我们的目的是描述人工智能聊天机器人(包括 GPT-3.5、GPT-4、Bard、Claude 和 Bing)在秘鲁国家医疗执照考试 (P-NLME) 中的表现并评估其提供的理由的教育价值。方法这是一项横断面分析研究。2023 年 7 月 25 日,P-NLME 中的每个多项选择题 (MCQ) 被输入每个聊天机器人(GPT-3、GPT-4、Bing、Bard 和 Claude)3 次。然后,4 名医学教育工作者根据医学领域、项目类型以及 MCQ 是否需要秘鲁特定知识对 MCQ 进行了分类。他们评估了 2 个表现最佳者(GPT-4 和 Bing)的理由的教育价值。结果 GPT-4 得分为 86.7%,Bing 得分为 82.2%,其次是 Bard 和 Claude,秘鲁考生的历史成绩为 55%。在与正确答案相关的因素中,只有需要秘鲁特定知识的 MCQ 的胜算较低(比值比,0.23;95% 置信区间,0.09-0.61),而其余因素则没有显示出相关性。在评估 GPT-4 和 Bing 提供的理由的教育价值时,两者在课堂上的确定性、有用性或潜在用途方面都没有表现出任何显着差异。结论 在聊天机器人中,GPT-4 和 Bing 表现最好,其中 Bing 在秘鲁特定的 MCQ 中表现更好。此外,GPT-4 和 Bing 提供的理由的教育价值可以被认为是适当的。然而,有必要开始解决这些聊天机器人的教育价值,而不仅仅是它们在考试中的表现。
更新日期:2023-11-20
down
wechat
bug