Generation of training data for named entity recognition of artworks,Semantic Web

当前位置： X-MOL 学术 › Semant. Web › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Generation of training data for named entity recognition of artworks
Semantic Web ( IF 3 ) Pub Date : 2022-08-08 , DOI: 10.3233/sw-223177
Nitisha Jain ₁ , Alejandro Sierra-Múnera ₁ , Jan Ehmueller ₁ , Ralf Krestel ₁

Affiliation

Abstract

As machine learning techniques are being increasingly employed for text processing tasks, the need for training data has become a major bottleneck for their application. Manual generation of large scale training datasets tailored to each task is a time consuming and expensive process, which necessitates their automated generation. In this work, we turn our attention towards creation of training datasets for named entity recognition (NER) in the context of the cultural heritage domain. NER plays an important role in many natural language processing systems. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as digitized art archives, the recognition of fine-grained entity types such as titles of artworks is of high importance. Current state of the art tools are unable to adequately identify artwork titles due to unavailability of relevant training datasets. We analyse the particular difficulties presented by this domain and motivate the need for quality annotations to train machine learning models for identification of artwork titles. We present a framework with heuristic based approach to create high-quality training data by leveraging existing cultural heritage resources from knowledge bases such as Wikidata. Experimental evaluation shows significant improvement over the baseline for NER performance for artwork titles when models are trained on the dataset generated using our framework.

中文翻译：

为艺术品的命名实体识别生成训练数据

摘要

随着机器学习技术越来越多地用于文本处理任务，对训练数据的需求已成为其应用的主要瓶颈。手动生成针对每个任务量身定制的大规模训练数据集是一个耗时且昂贵的过程，因此需要自动生成它们。在这项工作中，我们将注意力转向在文化遗产领域的背景下为命名实体识别 (NER) 创建训练数据集。NER 在许多自然语言处理系统中发挥着重要作用。大多数 NER 系统通常仅限于一些常见的命名实体类型，例如人员、位置和组织。但是，对于数字化艺术档案等文化遗产资源而言，艺术品题名等细粒度实体类型的识别具有重要意义。由于相关培训数据集不可用，当前最先进的工具无法充分识别艺术品标题。我们分析了该领域提出的特殊困难，并激发了对质量注释的需求，以训练机器学习模型来识别艺术品标题。我们提出了一个基于启发式方法的框架，通过利用来自知识库（如 Wikidata）的现有文化遗产资源来创建高质量的训练数据。实验评估表明，当在使用我们的框架生成的数据集上训练模型时，艺术品标题的 NER 性能比基线有显着改善。我们分析了该领域提出的特殊困难，并激发了对质量注释的需求，以训练机器学习模型来识别艺术品标题。我们提出了一个基于启发式方法的框架，通过利用来自知识库（如 Wikidata）的现有文化遗产资源来创建高质量的训练数据。实验评估表明，当在使用我们的框架生成的数据集上训练模型时，艺术品标题的 NER 性能比基线有显着改善。我们分析了该领域提出的特殊困难，并激发了对质量注释的需求，以训练机器学习模型来识别艺术品标题。我们提出了一个基于启发式方法的框架，通过利用来自知识库（如 Wikidata）的现有文化遗产资源来创建高质量的训练数据。实验评估表明，当在使用我们的框架生成的数据集上训练模型时，艺术品标题的 NER 性能比基线有显着改善。

更新日期：2022-08-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>