当前位置: X-MOL 学术Brain Inf. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing
Brain Informatics Pub Date : 2023-09-02 , DOI: 10.1186/s40708-023-00203-w
Jie Pan 1, 2 , Zilong Zhang 1 , Steven Ray Peters 3 , Shabnam Vatanpour 1 , Robin L Walker 1, 4 , Seungwon Lee 1, 2, 4 , Elliot A Martin 1, 4 , Hude Quan 1, 2

Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders’ abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes. CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients’ chart data were linked to administrative discharge abstract database (DAD) and Sunrise™ Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Of the study sample (n = 3036), the prevalence of CeVD was 11.8% (n = 360); the median patient age was 63; and females accounted for 50.3% (n = 1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease (“nursing transfer report,” “discharge summary,” “nursing notes,” and “inpatient consultation.”). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, “Cerebrovascular accident” and “Transient ischemic attack”), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%). The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.



通过自然语言处理 (NLP) 从住院电子病历 (EMR) 中提取脑血管疾病 (CeVD) 对于自动化疾病监测和改善患者治疗效果至关重要。现有方法依赖于编码者的抽象,存在时间延迟和编码不足的问题。本研究试图开发一种基于 NLP 的方法,利用 EMR 临床记录来检测 CeVD。CeVD 状态是通过对 2015 年 1 月 1 日至 6 月 30 日期间从加拿大阿尔伯塔省卡尔加里的 3 家医院随机选择的 18 岁或以上住院患者进行图表审查来确认的。这些患者的图表数据与行政出院相关联摘要数据库 (DAD) 和 Sunrise™ Clinical Manager (SCM) EMR 数据库按个人健康号码(唯一的终生标识符)和入院日期进行记录。我们通过结合两种临床概念提取方法和两种监督机器学习(ML)方法:随机森林和 XGBoost 来训练多个自然语言处理(NLP)预测模型。使用图表审查作为参考标准,我们将模型性能与常用的国际疾病分类 (ICD-10-CA) 代码的敏感性、特异性、阳性预测值 (PPV) 和阴性预测值指标进行比较值(净现值)。在研究样本中 (n = 3036),CeVD 的患病率为 11.8% (n = 360);患者年龄中位数为 63 岁;根据图表数据,女性占 50.3%(n = 1528)。从 EMR 中提取的 49 份临床文档中,有四种文档类型被认为是识别 CeVD 疾病最有影响力的文本源(“护理转诊报告”、“出院小结”、“护理笔记”和“住院会诊”)。性能最好的 NLP 模型是 XGBoost,它结合了 cTAKES 提取的统一医学语言系统概念(例如排名靠前的概念、“脑血管意外”和“短暂性脑缺血发作”)和术语频率-逆文档频率矢量化器。与ICD代码相比,该模型总体上取得了更高的有效性,例如敏感性(25.0% vs 70.0%)、特异性(99.3% vs 99.1%)、PPV(82.6 vs. 87.8%)和NPV(90.8% vs 97.1%) 。本研究开发的NLP算法在检测CeVD方面表现优于ICD代码算法。NLP 模型可以产生用于识别 CeVD 病例的自动化 EMR 工具,并可应用于监测和纵向研究等未来研究。