当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gen-T: Table Reclamation in Data Lakes
arXiv - CS - Databases Pub Date : 2024-03-21 , DOI: arxiv-2403.14128
Grace Fan, Roee Shraga, Renée J. Miller

We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only SELECT-PROJECT- JOIN queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.

中文翻译:

Gen-T:数据湖中的表回收

我们引入表回收的问题。给定一个源表和一个大型表存储库,回收会找到一组表,这些表在集成时会尽可能地重现源表。与按示例查询或按目标查询等查询发现问题不同,表回收侧重于使用可能不完整或不一致的真实表尽可能完整地回收源表中的数据。为此,我们定义了一种新的表相似性度量,称为错误感知实例相似性,以度量回收表与源表的接近程度,这是一种基于数据交换中使用的实例相似性的度量。我们的搜索不仅涵盖 SELECT-PROJECT-JOIN 查询,还涵盖联合、外联以及一元运算符包含和补充的集成查询,这些已被证明在数据集成和融合中很重要。使用回收,数据科学家可以了解存储库中的任何表是否可用于准确回收源中的元组。如果不是,人们可以理解这是由于值差异还是数据不完整造成的。我们的解决方案 Gen-T 执行表发现,从表存储库中检索一组候选表,将它们过滤为一组原始表,然后集成这些表以尽可能接近地回收源。通过在包含多达 15K 个表的真实数据湖上进行实验,我们证明了我们的解决方案虽然是近似的,但在表存储库的大小方面是准确、高效和可扩展的,其中元组的平均数量从小(网络表)到极大(打开数据表)最多 1M 元组。
更新日期:2024-03-22
down
wechat
bug