Cardinality estimation using normalizing flow,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cardinality estimation using normalizing flow
The VLDB Journal ( IF 4.2 ) Pub Date : 2023-08-29 , DOI: 10.1007/s00778-023-00808-x
Jiayi Wang , Chengliang Chai , Jiabin Liu , Guoliang Li

Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning-based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality, while data-driven approaches learn a distribution of tuples, select some samples that satisfy a SQL query, and use the data distributions of these selected tuples to estimate the cardinality of the SQL query. As query-driven methods rely on training queries, the estimation quality is not reliable when there are no high-quality training queries, while data-driven methods have no such limitation and have high adaptivity. In this work, we focus on data-driven methods. A good data-driven model should achieve three optimization goals. First, the model needs to capture data dependencies between columns and support large domain sizes (achieving high accuracy). Second, the model should achieve high inference efficiency, because many data samples are needed to estimate the cardinality (achieving low inference latency). Third, the model should not be too large (achieving a small model size). However, existing data-driven methods cannot simultaneously optimize the three goals. To address the limitations, we propose a novel cardinality estimator \(\texttt{FACE}\), which leverages the normalizing flow-based model to learn a continuous joint distribution for relational data. \(\texttt{FACE}\) can transform a complex distribution over continuous random variables into a simple distribution (e.g., multivariate normal distribution) and use the probability density to estimate the cardinality for both sequential queries and parallel queries. First, we design a dequantization method to make data more “continuous.” Second, we propose encoding and indexing techniques to handle Like predicates for string data. Third, we propose a Monte Carlo method to estimate the cardinality based on the \(\texttt{FACE}\) model. Fourth, we propose a grouping technique to process parallel queries. Fifth, we discuss how to support join queries. Experimental results show that our method significantly outperforms existing approaches in terms of estimation accuracy while keeping similar latency and model size.

中文翻译：

使用归一化流的基数估计

基数估计是查询优化中最重要的问题之一。最近，基于机器学习的技术被提出来有效地估计基数，其可以大致分为查询驱动和数据驱动的方法。查询驱动的方法学习从查询到其基数的回归模型，而数据驱动的方法学习元组的分布，选择一些满足 SQL 查询的样本，并使用这些选定元组的数据分布来估计元组的基数。 SQL 查询。由于查询驱动方法依赖于训练查询，因此在没有高质量训练查询的情况下估计质量不可靠，而数据驱动方法则没有这种限制并且具有很高的适应性。在这项工作中，我们专注于数据驱动的方法。一个好的数据驱动模型应该实现三个优化目标。首先，模型需要捕获列之间的数据依赖关系并支持大域大小（实现高精度）。其次，模型应该实现高推理效率，因为需要许多数据样本来估计基数（实现低推理延迟）。第三，模型不宜太大（实现小模型尺寸）。然而，现有的数据驱动方法无法同时优化这三个目标。为了解决这些限制，我们提出了一种新颖的基数估计器\(\texttt{FACE}\)，它利用基于归一化流的模型来学习关系数据的连续联合分布。\(\texttt{FACE}\)可以将连续随机变量的复杂分布转换为简单分布（例如多元正态分布），并使用概率密度来估计顺序查询和并行查询的基数。首先，我们设计了一种反量化方法，让数据更加“连续”。其次，我们提出编码和索引技术来处理字符串数据的Like谓词。第三，我们提出了一种基于\(\texttt{FACE}\)模型的蒙特卡罗方法来估计基数。第四，我们提出了一种分组技术来处理并行查询。第五，我们讨论如何支持连接查询。实验结果表明，我们的方法在估计精度方面显着优于现有方法，同时保持相似的延迟和模型大小。

更新日期：2023-08-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>