当前位置: X-MOL 学术Vascular › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients
Vascular ( IF 1.1 ) Pub Date : 2024-03-19 , DOI: 10.1177/17085381241240550
Ethan Chervonski 1 , Keerthi B. Harish 1 , Caron B. Rockman 2 , Mikel Sadek 2 , Katherine A. Teter 2 , Glenn R. Jacobowitz 2 , Todd L. Berland 2 , Joann Lohr 3 , Colleen Moore 4 , Thomas S. Maldonado 2
Affiliation  

ObjectivesGenerative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes.MethodsOpenAI’s ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1–4 from completely inaccurate to completely accurate), completeness (rated 1–4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales.ResultsChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch–Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard’s mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education ( p < .0001 for three metrics). ChatGPT’s mean response length (332 ± 79 words) was higher than Bard’s mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains ( p > .05 for all analyses).ConclusionsAI offers a novel means of educating patients that avoids the inundation of information from “Dr Google” and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.

中文翻译:

生成人工智能聊天机器人可以为患者常见的血管手术问题提供适当的信息响应

目标生成人工智能 (AI) 已成为与患者互动的一种有前景的工具。本研究的目的是评估 AI 对有关血管外科疾病过程的常见患者问题的回答质量。方法 OpenAI 的 ChatGPT-3.5 和 Google Bard 接受了跨越 7 个血管外科疾病领域的 24 个模拟患者问题的查询。一家三级学术中心的六名经验丰富的血管外科教员根据人工智能反应的准确性(从完全不准确到完全准确评分 1-4)、完整性(从完全不完整到完全完整评分 1-4)和适当性(二元)独立评分。还使用三个可读性量表对响应进行了评估。结果平均而言,ChatGPT 响应比 Bard 响应更准确(3.08 ± 0.33 与 2.82 ± 0.40,p < .01)。平均而言,ChatGPT 反应的评分比 Bard 反应更完整(2.98 ± 0.34 vs 2.62 ± 0.36,p < .01)。大多数 ChatGPT 回复(75.0%,n = 18)和几乎一半的 Bard 回复(45.8%,n = 11)被一致认为是适当的。几乎三分之一的 Bard 回复 (29.2%,n = 7) 被至少两名审稿人 (29.2%) 认为不合适,并且两个 Bard 回复 (8.4%) 被大多数人认为不合适。 ChatGPT 回答的平均 Flesch 阅读轻松度、Flesch-Kincaid 年级水平和 Gunning Fog 指数分别为 29.4 ± 10.8、14.5 ± 2.2 和 17.7 ± 3.1,表明接受过高等教育的回答具有可读性。 Bard 的平均可读性分数分别为 58.9 ± 10.5、8.2 ± 1.7 和 11.0 ± 2.0,表明高中教育程度的回答具有可读性(三个指标的 p < .0001)。 ChatGPT 的平均响应长度(332 ± 79 个单词)高于 Bard 的平均响应长度(183 ± 53 个单词,p < .001)。 ChatGPT 或 Bard 在疾病领域之间的准确性、完整性、可读性或响应长度没有差异(所有分析的 p > .05)。结论 AI 提供了一种新的患者教育方法,避免了来自“Google 博士”的信息泛滥。 ”以及医患会面的时间障碍。 ChatGPT 对无数患者问题提供了基本有效但不完美的答复,但牺牲了可读性。虽然巴德的回答更具可读性和简洁性,但其质量较差。需要进一步的研究来更好地理解大型语言模型在血管外科患者教育中的失败点。
更新日期:2024-03-19
down
wechat
bug