当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Alfa: active learning for graph neural network-based semantic schema alignment
The VLDB Journal ( IF 4.2 ) Pub Date : 2023-11-21 , DOI: 10.1007/s00778-023-00822-z
Venkata Vamsikrishna Meduri , Abdul Quamar , Chuan Lei , Xiao Qin , Berthold Reinwald

Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignment of semantically rich schemas, often captured as ontologies. Most of these approaches are supervised and require large amounts of labeled training data, which is expensive in terms of cost and manual labor. Active learning (AL) techniques can alleviate this issue by intelligently choosing the data to be labeled utilizing a human-in-the-loop approach, while minimizing the amount of labeled training data required. However, existing active learning techniques are limited in their ability to utilize the rich semantic information from underlying schemas. Therefore, they cannot drive effective and efficient sample selection for human labeling that is necessary to scale to larger datasets. In this paper, we propose Alfa, an active learning framework to overcome these limitations. Alfa exploits the schema element properties as well as the relationships between schema elements (structure) to drive a novel ontology-aware sample selection and label propagation algorithm for training highly accurate alignment models. We propose semantic blocking to scale to larger datasets without compromising model quality. Our experimental results across three real-world datasets show that (1) Alfa leads to a substantial reduction (27–82%) in the cost of human labeling, (2) semantic blocking reduces label skew up to 40\(\times \) without adversely affecting model quality and scales AL to large datasets, and (3) sample selection achieves comparable schema matching quality (90% F1-score) to models trained on the entire set of available training data. We also show that Alfa outperforms the state-of-the-art ontology alignment system, BERTMap, in terms of (1) 10\(\times \) shorter time per AL iteration and (2) requiring half of the AL iterations to achieve the highest convergent F1-score.



中文翻译:

Alfa:基于图神经网络的语义模式对齐的主动学习

语义模式对齐旨在根据语义表示来匹配一对模式中的元素。它是数据集成的关键原语,有助于跨异构数据源创建通用数据结构。图表示学习等深度学习方法已显示出有效对齐语义丰富的模式(通常捕获为本体)的前景。这些方法大多数都受到监督,并且需要大量标记的训练数据,这在成本和体力劳动方面是昂贵的。主动学习 (AL) 技术可以通过利用人机交互方法智能选择要标记的数据来缓解此问题,同时最大限度地减少所需的标记训练数据量。然而,现有的主动学习技术在利用底层模式中丰富语义信息的能力方面受到限制。因此,它们无法推动有效且高效的人类标记样本选择,而这是扩展到更大数据集所必需的。在本文中,我们提出了Alfa,一种主动学习框架来克服这些限制。Alfa利用模式元素属性以及模式元素(结构)之间的关系来驱动新颖的本体感知样本选择和标签传播算法,以训练高度准确的对齐模型。我们提出语义阻塞以扩展到更大的数据集而不影响模型质量。我们在三个真实世界数据集上的实验结果表明,(1) Alfa导致人工标记成本大幅降低 (27–82%),(2) 语义分块将标签偏差减少高达 40 \(\times \)不会对模型质量产生不利影响,并将 AL 扩展到大型数据集,并且 (3) 样本选择实现了与在整套可用训练数据上训练的模型相当的模式匹配质量(90% F1 分数)。我们还表明,Alfa在以下方面优于最先进的本体对齐系统 BERTMap:(1) 每次 AL 迭代时间缩短 10 \(\times \)以及 (2) 需要一半的 AL 迭代才能实现最高的收敛 F1 分数。

更新日期:2023-11-21
down
wechat
bug