Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.,Journal of Educational Evaluation for Health Professions

当前位置： X-MOL 学术 › Journal of Educational Evaluation for Health Professions › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.
Journal of Educational Evaluation for Health Professions Pub Date : 2023-11-20 , DOI: 10.3352/jeehp.2023.20.30
Betzy Clariza Torres-Zegarra ₁ , Wagner Rios-Garcia ₂ , Alvaro Micael Ñaña-Cordova ₁ , Karen Fatima Arteaga-Cisneros ₁ , Xiomara Cristina Benavente Chalco ₁ , Marina Atena Bustamante Ordoñez ₁ , Carlos Jesus Gutierrez Rios ₁ , Carlos Alberto Ramos Godoy _{3,

4} , Kristell Luisa Teresa Panta Quezada ₄ , Jesus Daniel Gutierrez-Arratia _{4,

5} , Javier Alejandro Flores-Cohaila _{1,

4}

Affiliation

PURPOSE We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). METHODS This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). RESULTS GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09-0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. CONCLUSION Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

中文翻译：

ChatGPT、Bard、Claude 和 Bing 在秘鲁国家执照体检中的表现：一项横断面研究。

目的我们的目的是描述人工智能聊天机器人（包括 GPT-3.5、GPT-4、Bard、Claude 和 Bing）在秘鲁国家医疗执照考试 (P-NLME) 中的表现并评估其提供的理由的教育价值。方法这是一项横断面分析研究。2023 年 7 月 25 日，P-NLME 中的每个多项选择题 (MCQ) 被输入每个聊天机器人（GPT-3、GPT-4、Bing、Bard 和 Claude）3 次。然后，4 名医学教育工作者根据医学领域、项目类型以及 MCQ 是否需要秘鲁特定知识对 MCQ 进行了分类。他们评估了 2 个表现最佳者（GPT-4 和 Bing）的理由的教育价值。结果 GPT-4 得分为 86.7%，Bing 得分为 82.2%，其次是 Bard 和 Claude，秘鲁考生的历史成绩为 55%。在与正确答案相关的因素中，只有需要秘鲁特定知识的 MCQ 的胜算较低（比值比，0.23；95% 置信区间，0.09-0.61），而其余因素则没有显示出相关性。在评估 GPT-4 和 Bing 提供的理由的教育价值时，两者在课堂上的确定性、有用性或潜在用途方面都没有表现出任何显着差异。结论在聊天机器人中，GPT-4 和 Bing 表现最好，其中 Bing 在秘鲁特定的 MCQ 中表现更好。此外，GPT-4 和 Bing 提供的理由的教育价值可以被认为是适当的。然而，有必要开始解决这些聊天机器人的教育价值，而不仅仅是它们在考试中的表现。

更新日期：2023-11-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>