Reservoir Sampling over Joins,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reservoir Sampling over Joins
arXiv - CS - Databases Pub Date : 2024-04-04 , DOI: arxiv-2404.03194
Binyang Dai, Xiao Hu, Ke Yi

Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical queries or training machine learning models. In this paper, we study the problem of how to maintain a random sample over joins while the tuples are streaming in. Without the join, this problem can be solved by some simple and classical reservoir sampling algorithms. However, the join operator makes the problem significantly harder, as the join size can be polynomially larger than the input. We present a new algorithm for this problem that achieves a near-linear complexity. The key technical components are a generalized reservoir sampling algorithm that supports a predicate, and a dynamic index for sampling over joins. We also conduct extensive experiments on both graph and relational data over various join queries, and the experimental results demonstrate significant performance improvement over the state of the art.

中文翻译：

通过连接进行油藏采样

对连接进行采样是大规模数据分析中的一项基本任务。连接结果的统一样本足以满足多种目的，例如回答分析查询或训练机器学习模型，而不是计算可能会很大的完整连接结果。在本文中，我们研究了如何在元组流入时通过连接维持随机样本的问题。如果没有连接，这个问题可以通过一些简单且经典的水库采样算法来解决。然而，连接运算符使问题变得更加困难，因为连接大小可能比输入大多项式。我们针对这个问题提出了一种新算法，可以实现接近线性的复杂度。关键技术组件是支持谓词的广义水库采样算法以及用于连接采样的动态索引。我们还对各种连接查询的图和关系数据进行了广泛的实验，实验结果表明，与现有技术相比，性能有了显着的提高。

更新日期：2024-04-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>