Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization,Knowledge and Information Systems

当前位置： X-MOL 学术 › Knowl. Inf. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2023-12-13 , DOI: 10.1007/s10115-023-02032-z
M. Maruf Öztürk

When there is a need to make an ultimate decision about the unique features of big data platforms, one should note that they have configurable parameters. Apache Spark is an open-source big data processing platform that can process real-time data, and it requires an advanced central processing unit and high memory capacity. Therefore, it gives us a great number of configurable parameters such as the number of cores and driver memory that are tuned during the execution. Different from the preceding works, in this study, a Kriging-based multi-objective optimization method is developed. Kriging-based means executing a surrogate model to create a response surface by providing a set of optimal solutions. The most important advantage of the proposed method over the alternatives is that it consists of three fitness functions. The method is evaluated on the MLlib library and the benchmarks of Hibench. MLlib provides various machine learning algorithms that are suitable to execute on resilient distributed data sets. The experimental results show that the proposed method outperformed the alternatives in hypervolume improvement and reducing uncertainty. Further, the results support the hypothesis that focusing on the parameters associated with data compression and memory usage improves the effectiveness of multi-objective optimization methods developed for Spark. Multi-objective optimization leads to an inevitable complexity in Spark due to the dimensionality of objective functions. Despite the fact that simplifying the setup and steps of optimization has proven to be the most effective way to reduce that complexity, it is not very effective to avoid ambiguity of the Pareto front. While the proposed method achieved 1.93x speedup in benchmark experiments, there is a remarkable difference (0.63 of speedup) between the speedup of our method and that of the closest competitor. Increasing the number of cores in multi-objective optimization does not contribute to speedup; rather, it leads to waste of CPU sources. Instead, the optimal number of cores should be determined by checking the changes of speedup with varying Spark configurations.

中文翻译：

使用基于高斯帕累托的多目标优化调整 Apache Spark 的参数

当需要对大数据平台的独特功能做出最终决定时，应该注意它们具有可配置的参数。 Apache Spark是一个开源的大数据处理平台，可以处理实时数据，需要先进的中央处理器和高内存容量。因此，它为我们提供了大量可配置的参数，例如在执行过程中调整的核心数量和驱动程序内存。与之前的工作不同，本研究开发了一种基于克里金法的多目标优化方法。基于克里金法意味着执行代理模型，通过提供一组最优解决方案来创建响应面。与其他方法相比，所提出的方法最重要的优点是它由三个适应度函数组成。该方法在 MLlib 库和 Hibench 基准测试上进行了评估。 MLlib 提供了适合在弹性分布式数据集上执行的各种机器学习算法。实验结果表明，所提出的方法在超体积改进和减少不确定性方面优于其他方法。此外，结果支持这样的假设：关注与数据压缩和内存使用相关的参数可以提高为 Spark 开发的多目标优化方法的有效性。由于目标函数的维数，多目标优化导致 Spark 中不可避免的复杂性。尽管简化设置和优化步骤已被证明是降低复杂性的最有效方法，但避免帕累托前沿的模糊性并不是很有效。而所提出的方法达到了1。在基准实验中加速比为 93 倍，我们的方法与最接近的竞争对手的加速比之间存在显着差异（加速比为 0.63）。在多目标优化中增加核心数量并不会有助于加速；相反，它会导致CPU资源的浪费。相反，最佳核心数量应通过检查不同 Spark 配置的加速比变化来确定。

更新日期：2023-12-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>