Use of Natural Language Processing to Identify Sexual and Reproductive Health Information in Clinical Text,Methods of Information in Medicine

当前位置： X-MOL 学术 › Methods Inf. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Use of Natural Language Processing to Identify Sexual and Reproductive Health Information in Clinical Text
Methods of Information in Medicine ( IF 1.7 ) Pub Date : 2024-02-20 , DOI: 10.1055/a-2233-2736
Elizabeth Harrison ₁ , Laura Kirkpatrick ₁ , Patrick Harrison ₂ , Traci Kazmerski ₁ , Yoshimi Sogawa ₁ , Harry Hochheiser ₃

Affiliation

Objectives This study aimed to enable clinical researchers without expertise in natural language processing (NLP) to extract and analyze information about sexual and reproductive health (SRH), or other sensitive health topics, from large sets of clinical notes.

Methods (1) We retrieved text from the electronic health record as individual notes. (2) We segmented notes into sentences using one of scispaCy's NLP toolkits. (3) We exported sentences to the labeling application Watchful and annotated subsets of these as relevant or irrelevant to various SRH categories by applying a combination of regular expressions and manual annotation. (4) The labeled sentences served as training data to create machine learning models for classifying text; specifically, we used spaCy's default text classification ensemble, comprising a bag-of-words model and a neural network with attention. (5) We applied each model to unlabeled sentences to identify additional references to SRH with novel relevant vocabulary. We used this information and repeated steps 3 to 5 iteratively until the models identified no new relevant sentences for each topic. Finally, we aggregated the labeled data for analysis.

Results This methodology was applied to 3,663 Child Neurology notes for 971 female patients. Our search focused on six SRH categories. We validated the approach using two subject matter experts, who independently labeled a sample of 400 sentences. Cohen's kappa values were calculated for each category between the reviewers (menstruation: 1, sexual activity: 0.9499, contraception: 0.9887, folic acid: 1, teratogens: 0.8864, pregnancy: 0.9499). After removing the sentences on which reviewers did not agree, we compared the reviewers' labels to those produced via our methodology, again using Cohen's kappa (menstruation: 1, sexual activity: 1, contraception: 0.9885, folic acid: 1, teratogens: 0.9841, pregnancy: 0.9871).

Conclusion Our methodology is reproducible, enables analysis of large amounts of text, and has produced results that are highly comparable to subject matter expert manual review.

中文翻译：

使用自然语言处理识别临床文本中的性健康和生殖健康信息

目标本研究旨在使没有自然语言处理 (NLP) 专业知识的临床研究人员能够从大量临床记录中提取和分析有关性健康和生殖健康 (SRH) 或其他敏感健康主题的信息。

方法 (1) 我们从电子健康记录中检索文本作为单独的注释。(2) 我们使用 scispaCy 的 NLP 工具包之一将笔记分割成句子。(3) 我们将句子导出到标签应用程序，通过应用正则表达式和手动注释的组合，注意并注释这些句子的子集与各种 SRH 类别相关或无关。（4）标记的句子作为训练数据，创建用于文本分类的机器学习模型；具体来说，我们使用了 spaCy 的默认文本分类集成，包括词袋模型和带有注意力的神经网络。(5) 我们将每个模型应用于未标记的句子，以识别具有新颖相关词汇的 SRH 的其他参考。我们使用这些信息并迭代地重复步骤 3 到 5，直到模型没有为每个主题识别出新的相关句子。最后，我们汇总了标记数据进行分析。

结果该方法适用于 971 名女性患者的 3,663 份儿童神经病学笔记。我们的搜索集中于六个性健康与生殖健康类别。我们聘请了两位主题专家验证了该方法，他们独立标记了 400 个句子的样本。计算评审者之间每个类别的科恩卡伯值（月经：1，性活动：0.9499，避孕：0.9887，叶酸：1，致畸剂：0.8864，怀孕：0.9499）。删除审稿人不同意的句子后，我们将审稿人的标签与通过我们的方法产生的标签进行比较，再次使用 Cohen's kappa（月经：1，性活动：1，避孕：0.9885，叶酸：1，致畸剂：0.9841），怀孕：0.9871）。

结论我们的方法是可重复的，能够分析大量文本，并且产生的结果与主题专家手动审查高度可比。

更新日期：2024-02-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南