当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications
International Journal on Digital Libraries Pub Date : 2024-01-23 , DOI: 10.1007/s00799-023-00393-y
Komal Gupta , Ammaar Ahmad , Tirthankar Ghosal , Asif Ekbal

Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (subject, predicate, object) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, viz. Section Identification and Citance Classification. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: SciERC and SciClaim. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the NLPContributionGraph (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.



中文翻译:

基于 BERT 的序列深度神经架构,用于识别贡献陈述并从科学出版物中提取三元组的短语

自然语言处理(NLP)研究快速增长;因此,大量研究论文正在发表。从海量的非结构化数据中找出研究论文在任何特定领域的贡献是具有挑战性的。需要在知识图谱(KG)中构建相关贡献。在本文中,我们描述了我们为构建科学知识图(SKG)而完成四项任务的工作。我们提出了一个管道系统,可以执行贡献句识别、从贡献句中提取短语、信息单元(IU)分类,并将NLP 学术出版物中的短语组织成三元组(主语、谓语、宾语)。我们开发了一个多任务系统(ContriSci),用于贡献句识别,具有两个支持任务,即。 章节识别引文分类。我们使用 Transformers 的双向编码器表示 (BERT) - 条件随机场 (CRF) 模型进行短语提取,并使用两个附加数据集进行训练:SciERCSciClaim。为了将贡献句分类为 IU,我们使用基于 BERT 的模型。对于三元组提取,我们将三元组分为五类,并使用基于 BERT 的分类器对三元组进行分类。我们提出的方法在非端到端设置中,贡献句识别、短语提取、IUs 分类和三元组提取的 F1 分数分别为 64.21%、77.47%、84.52% 和 62.71%。在NLPContributionGraph (NCG)数据集的 F1 分数方面,贡献句识别、IU 分类和三元组提取的相对改进分别为 8.08、2.46 和 2.31 。我们的系统在所有四个子任务组合的端到端管道中实现了最佳性能(57.54% F1 分数)。我们在以下位置提供代码:https://github.com/92Komal/pipeline_triplet_extraction。

更新日期:2024-01-23
down
wechat
bug