Efficient computation of comprehensive statistical information of large OWL datasets: a scalable approach,Enterprise Information Systems

当前位置： X-MOL 学术 › Enterp. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Efficient computation of comprehensive statistical information of large OWL datasets: a scalable approach
Enterprise Information Systems ( IF 4.4 ) Pub Date : 2022-04-24 , DOI: 10.1080/17517575.2022.2062683
Heba Mohamed _{1,

2} , Said Fathalla _{1,

2} , Jens Lehmann _{1,

3} , Hajira Jabeen ₄

Affiliation

ABSTRACT

Computing dataset statistics is crucial for exploring their structure, however, it becomes challenging for large-scale datasets. This has several key benefits, such as link target identification, vocabulary reuse, quality analysis, big data analytics, and coverage analysis. In this paper, we present the first attempt of developing a distributed approach (OWLStats) for collecting comprehensive statistics over large-scale OWL datasets. OWLStats is a distributed in-memory approach for computing 50 statistical criteria for OWL datasets utilizing Apache Spark. We have successfully integrated OWLStats into the SANSA framework. Experiments results prove that OWLStats is linearly scalable in terms of both node and data scalability.

中文翻译：

大型 OWL 数据集综合统计信息的高效计算：一种可扩展的方法

摘要

计算数据集统计数据对于探索其结构至关重要，然而，对于大规模数据集来说，这变得具有挑战性。这有几个关键好处，例如链接目标识别、词汇重用、质量分析、大数据分析和覆盖率分析。在本文中，我们首次尝试开发一种分布式方法（OWLStats）来收集大规模 OWL 数据集的综合统计数据。OWLStats 是一种分布式内存方法，用于利用 Apache Spark 计算 OWL 数据集的 50 个统计标准。我们已成功将 OWLStats 集成到 SANSA 框架中。实验结果证明OWLStats在节点可扩展性和数据可扩展性方面均具有线性可扩展性。

更新日期：2022-04-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>