当前位置: X-MOL 学术The Electronic Library › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Semi-automated ontology development scheme via text mining of scientific records
The Electronic Library ( IF 1.675 ) Pub Date : 2024-02-06 , DOI: 10.1108/el-06-2023-0165
Somayeh Tamjid , Fatemeh Nooshinfard , Molouk Sadat Hosseini Beheshti , Nadjla Hariri , Fahimeh Babalhavaeji

Purpose

The purpose of this study is to develop a domain independent, cost-effective, time-saving and semi-automated ontology generation framework that could extract taxonomic concepts from unstructured text corpus. In the human disease domain, ontologies are found to be extremely useful for managing the diversity of technical expressions in favour of information retrieval objectives. The boundaries of these domains are expanding so fast that it is essential to continuously develop new ontologies or upgrade available ones.

Design/methodology/approach

This paper proposes a semi-automated approach that extracts entities/relations via text mining of scientific publications. Text mining-based ontology (TmbOnt)-named code is generated to assist a user in capturing, processing and establishing ontology elements. This code takes a pile of unstructured text files as input and projects them into high-valued entities or relations as output. As a semi-automated approach, a user supervises the process, filters meaningful predecessor/successor phrases and finalizes the demanded ontology-taxonomy. To verify the practical capabilities of the scheme, a case study was performed to drive glaucoma ontology-taxonomy. For this purpose, text files containing 10,000 records were collected from PubMed.

Findings

The proposed approach processed over 3.8 million tokenized terms of those records and yielded the resultant glaucoma ontology-taxonomy. Compared with two famous disease ontologies, TmbOnt-driven taxonomy demonstrated a 60%–100% coverage ratio against famous medical thesauruses and ontology taxonomies, such as Human Disease Ontology, Medical Subject Headings and National Cancer Institute Thesaurus, with an average of 70% additional terms recommended for ontology development.

Originality/value

According to the literature, the proposed scheme demonstrated novel capability in expanding the ontology-taxonomy structure with a semi-automated text mining approach, aiming for future fully-automated approaches.



中文翻译:

通过科学记录文本挖掘的半自动化本体开发方案

目的

本研究的目的是开发一个独立于领域、经济有效、节省时间和半自动化的本体生成框架,可以从非结构化文本语料库中提取分类概念。在人类疾病领域,本体对于管理技术表达的多样性以支持信息检索目标非常有用。这些领域的边界扩展得如此之快,以至于必须不断开发新的本体或升级现有的本体。

设计/方法论/途径

本文提出了一种半自动化方法,通过科学出版物的文本挖掘来提取实体/关系。生成基于文本挖掘的本体(TmbOnt)命名的代码来帮助用户捕获、处理和建立本体元素。该代码将一堆非结构化文本文件作为输入,并将它们投影到高价值实体或关系中作为输出。作为一种半自动化方法,用户监督该过程,过滤有意义的前驱/后继短语并最终确定所需的本体分类。为了验证该方案的实际能力,进行了一个案例研究来推动青光眼本体分类。为此,从 PubMed 收集了包含 10,000 条记录的文本文件。

发现

所提出的方法处理了这些记录的超过 380 万个标记化术语,并产生了最终的青光眼本体分类。与两个著名的疾病本体相比,TmbOnt驱动的分类法对著名医学同义词库和本体分类法(如人类疾病本体论、医学主题词和国家癌症研究所同义词库)的覆盖率达到60%–100%,平均多出70%本体开发推荐的术语。

原创性/价值

根据文献,所提出的方案展示了通过半自动文本挖掘方法扩展本体分类结构的新颖能力,旨在未来的全自动方法。

更新日期:2024-02-06
down
wechat
bug