Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan,IEEE Transactions on Cloud Computing

当前位置： X-MOL 学术 › IEEE Trans. Cloud Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan
IEEE Transactions on Cloud Computing ( IF 6.5 ) Pub Date : 2023-12-05 , DOI: 10.1109/tcc.2023.3339208
Grisha Weintraub ₁ , Ehud Gudes ₁ , Shlomi Dolev ₁ , Jeffrey D. Ullman ₂

Affiliation

Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).

中文翻译：

通过平衡的覆盖计划优化云数据湖查询

云数据湖成为存储大量数据的廉价解决方案。主要思想是计算层和存储层的分离。因此，廉价的云存储用于存储数据，而计算引擎用于以“按需”模式对该数据进行分析。然而，要在此架构中对数据执行任何计算，每次计算时都应通过网络将数据从存储层移动到计算层。显然，这会损害计算性能并且需要巨大的网络带宽。在本文中，我们研究了提高数据湖架构中查询性能的不同方法。我们定义了一个优化问题，可以证明可以加速数据湖查询。我们证明该问题是 NP 难问题并提出启发式方法。然后，我们通过实验证明我们的方法是可行且高效的（基于 TPC-H 基准，查询执行时间提高了 30 倍）。

更新日期：2023-12-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>