ROME: All Overlays Lead to Aggregation, but Some Are Faster than OthersJust Accepted,ACM Transactions on Computer Systems

当前位置： X-MOL 学术 › ACM Trans. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ROME: All Overlays Lead to Aggregation, but Some Are Faster than OthersJust Accepted
ACM Transactions on Computer Systems ( IF 1.5 ) Pub Date : 2022-03-16 , DOI: 10.1145/3516430
Marcel Blöcher ₁ , Emilio Coppa ₂ , Pascal Kleber ₁ , Patrick Eugster ₃ , William Culhane ₄ , Masoud Saeida Ardekani ₅

Affiliation

Aggregation is common in data analytics and crucial to distilling information from large datasets, but current data analytics frameworks do not fully exploit the potential for optimization in such phases. The lack of optimization is particularly notable in current “online” approaches which store data in main memory across nodes, shifting the bottleneck away from disk I/O toward network and compute resources, thus increasing the relative performance impact of distributed aggregation phases.

We present ROME, an aggregation system for use within data analytics frameworks or in isolation. ROME uses a set of novel heuristics based primarily on basic knowledge of aggregation functions combined with deployment constraints to efficiently aggregate results from computations performed on individual data subsets across nodes (e.g., merging sorted lists resulting from top-k). The user can either provide minimal information which allows our heuristics to be applied directly, or ROME can autodetect the relevant information at little cost. We integrated ROME as a subsystem into the Spark and Flink data analytics frameworks. We use real world data to experimentally demonstrate speedups up to 3 × over single level aggregation overlays, up to 21% over other multi-level overlays, and 50% for iterative algorithms like gradient descent at 100 iterations.

中文翻译：

ROME：所有覆盖都会导致聚合，但有些比其他更快刚刚接受

聚合在数据分析中很常见，对于从大型数据集中提取信息至关重要，但当前的数据分析框架并未充分利用这些阶段的优化潜力。在当前的“在线”方法中，缺乏优化尤其明显，这些方法将数据跨节点存储在主内存中，将瓶颈从磁盘 I/O 转移到网络和计算资源上，从而增加了分布式聚合阶段的相对性能影响。

我们展示了 ROME，这是一个在数据分析框架内或单独使用的聚合系统。ROME 使用一组新颖的启发式算法，主要基于聚合函数的基本知识与部署约束相结合，以有效地聚合对跨节点的单个数据子集执行的计算的结果（例如，合并从 top- k产生的排序列表）。用户可以提供允许直接应用我们的启发式方法的最少信息，或者 ROME 可以以很少的成本自动检测相关信息。我们将 ROME 作为子系统集成到 Spark 和 Flink 数据分析框架中。我们使用真实世界的数据通过实验证明了在单级聚合覆盖上的加速高达 3 倍，在其他多级覆盖上提高了 21%，在 100 次迭代时梯度下降等迭代算法提高了 50%。

更新日期：2022-03-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>