Assessing The Potential Of Mid-Sized Language Models For Clinical QA,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Assessing The Potential Of Mid-Sized Language Models For Clinical QA
arXiv - CS - Computation and Language Pub Date : 2024-04-24 , DOI: arxiv-2404.15894
Elliot Bolton, Betty Xiong, Vijaytha Muralidharan, Joel Schamroth, Vivek Muralidharan, Christopher D. Manning, Roxana Daneshjou

Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.

中文翻译：

评估中型语言模型用于临床质量保证的潜力

大型语言模型，如 GPT-4 和 Med-PaLM，在临床任务上表现出了令人印象深刻的表现；然而，它们需要访问计算，是闭源的，并且不能部署在设备上。 BioGPT-large、BioMedLM、LLaMA 2 和 Mistral 7B 等中型模型避免了这些缺点，但它们执行临床任务的能力尚未得到充分研究。为了帮助评估它们的临床使用潜力并帮助研究人员决定应该使用哪种模型，我们比较了它们在两项临床问答 (QA) 任务上的表现：MedQA 和消费者查询回答。我们发现 Mistral 7B 是性能最佳的模型，在所有基准测试中获胜，并且优于专门针对生物医学领域训练的模型。虽然 Mistral 7B 的 MedQA 分数为 63.0%，接近原始的 Med-PaLM，并且它通常可以对消费者的健康查询产生合理的响应，但仍然存在改进的空间。这项研究首次对临床任务中的开源中型模型进行了头对头评估。

更新日期：2024-04-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>