当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Corra: Correlation-Aware Column Compression
arXiv - CS - Databases Pub Date : 2024-03-25 , DOI: arxiv-2403.17229
Hanwen Liu, Mihail Stoian, Alexander van Renen, Andreas Kipf

Column encoding schemes have witnessed a spark of interest lately. This is not surprising -- as data volume increases, being able to keep one's dataset in main memory for fast processing is a coveted desideratum. However, it also seems that single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is because they do not exploit correlations in the data. Consider for instance the column pair ($\texttt{city}$, $\texttt{zip-code}$) of the DMV dataset: a city has only a few dozen unique zip codes. Such information, if properly exploited, can significantly reduce the space consumption of the latter column. In this work, we depart from the established, well-trodden path of compressing data using only single-column encoding schemes and introduce $\textit{correlation-aware}$ encoding schemes. We demonstrate their advantages compared to single-column encoding schemes on the well-known TPC-H's $\texttt{lineitem}$, LDBC's $\texttt{message}$, DMV, and Taxi. For example, we obtain a saving rate of 58.3% for $\texttt{lineitem}$'s $\texttt{shipdate}$, while the dropoff timestamps in Taxi witness a saving rate of 30.6%.

中文翻译:

Corra:相关感知列压缩

列编码方案最近引起了人们的兴趣。这并不奇怪——随着数据量的增加,能够将数据集保留在主内存中以便快速处理是一个令人垂涎的愿望。然而,单列编码方案在可实现的压缩大小方面似乎也已达到了一个稳定水平。我们认为这是因为他们没有利用数据中的相关性。例如,考虑 DMV 数据集的列对 ($\texttt{city}$, $\texttt{zip-code}$):一个城市只有几十个唯一的邮政编码。如果利用得当,这些信息可以显着减少后一列的空间消耗。在这项工作中,我们脱离了仅使用单列编码方案来压缩数据的既定的、常用的路径,并引入了 $\textit{correlation-aware}$ 编码方案。我们在著名的 TPC-H 的 $\texttt{lineitem}$、LDBC 的 $\texttt{message}$、DMV 和 Taxi 上展示了它们与单列编码方案相比的优势。例如,$\texttt{lineitem}$ 的 $\texttt{shipdate}$ 的节省率为 58.3%,而 Taxi 中的下车时间戳的节省率为 30.6%。
更新日期:2024-03-27
down
wechat
bug