Evaluating ChatGPT responses on thyroid nodules for patient education.,Thyroid

当前位置： X-MOL 学术 › Thyroid › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating ChatGPT responses on thyroid nodules for patient education.
Thyroid ( IF 6.6 ) Pub Date : 2023-11-27 , DOI: 10.1089/thy.2023.0491
Daniel J Campbell ₁ , Leonard E Estephan ₁ , Elliott M Sina ₁ , Eric V Mastrolonardo ₁ , Rahul Alapati ₁ , Dev R Amin ₁ , Elizabeth E Cottrill ₁

Affiliation

BACKGROUND ChatGPT, an artificial intelligence (AI) chatbot, is the fastest growing consumer application in history. Given recent trends identifying increasing patient use of Internet sources for self-education, we seek to evaluate the quality of ChatGPT-generated responses for patient education on thyroid nodules. METHODS ChatGPT was queried 4 times with 30 identical questions. Queries differed by initial chatbot prompting: no prompting, patient-friendly prompting, 8th-grade level prompting, and prompting for references. Answers were scored on a hierarchical score: incorrect, partially correct, correct, or correct with references. Proportions of responses at incremental score thresholds were compared by prompt type using chi-squared analysis. Flesch-Kincaid grade level was calculated for each answer. The relationship between prompt type and grade level was assessed using analysis of variance. References provided within ChatGPT answers were totaled and analyzed for veracity. RESULTS Across all prompts (n=120 questions), 83 answers (69.2%) were at least correct. Proportions of responses that were at least partially correct (p=0.795) and correct (p=0.402) did not differ by prompt; responses that were correct with references did (p<0.0001). Responses from 8th-grade level prompting were the lowest mean grade level (13.43 ± 2.86) and were significantly lower than no prompting (14.97 ± 2.01, p=0.01) and prompting for references (16.43 ± 2.05, p<0.0001). Prompting for references generated 80/80 (100%) of referenced publications within answers. Seventy references (87.5%) were legitimate citations, and 58/80 (72.5%) provided accurately reported information from the referenced publications. CONCLUSION ChatGPT overall provides appropriate answers to most questions on thyroid nodules regardless of prompting. Despite targeted prompting strategies, ChatGPT reliably generates responses corresponding to grade levels well-above accepted recommendations for presenting medical information to patients. Significant rates of AI hallucination may preclude clinicians from recommending the current version of ChatGPT as an educational tool for patients at this time.

中文翻译：

评估 ChatGPT 对甲状腺结节的反应以进行患者教育。

背景 ChatGPT 是一种人工智能 (AI) 聊天机器人，是历史上增长最快的消费者应用程序。鉴于最近的趋势表明患者越来越多地使用互联网资源进行自我教育，我们试图评估 ChatGPT 生成的甲状腺结节患者教育反应的质量。方法 ChatGPT 被查询 4 次，共 30 个相同问题。查询因初始聊天机器人提示而异：无提示、患者友好型提示、8 年级水平提示和参考文献提示。答案按等级评分：不正确、部分正确、正确或参考文献正确。使用卡方分析按提示类型比较增量分数阈值的响应比例。计算每个答案的 Flesch-Kincaid 等级水平。使用方差分析评估提示类型和年级水平之间的关系。ChatGPT 答案中提供的参考文献已被汇总并分析其准确性。结果在所有提示中（n=120 个问题），83 个答案 (69.2%) 至少是正确的。至少部分正确 (p=0.795) 和正确 (p=0.402) 的回答比例在提示时没有差异；参考文献正确的回答确实如此（p<0.0001）。八年级水平提示的反应是最低平均年级水平 (13.43 ± 2.86)，显着低于无提示 (14.97 ± 2.01，p=0.01) 和提示参考文献 (16.43 ± 2.05，p<0.0001)。提示参考文献会在答案中生成 80/80 (100%) 的参考出版物。70 篇参考文献 (87.5%) 是合法引用，58/80 (72.5%) 提供了所引用出版物中准确报告的信息。结论无论提示如何，ChatGPT 总体上都能为大多数有关甲状腺结节的问题提供适当的答案。尽管有针对性的提示策略，ChatGPT 仍能可靠地生成与等级水平相对应的响应，远远高于向患者呈现医疗信息的公认建议。AI 幻觉的显着发生率可能会阻止临床医生此时推荐当前版本的 ChatGPT 作为患者的教育工具。

更新日期：2023-11-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>