Abstract
When there is a need to make an ultimate decision about the unique features of big data platforms, one should note that they have configurable parameters. Apache Spark is an open-source big data processing platform that can process real-time data, and it requires an advanced central processing unit and high memory capacity. Therefore, it gives us a great number of configurable parameters such as the number of cores and driver memory that are tuned during the execution. Different from the preceding works, in this study, a Kriging-based multi-objective optimization method is developed. Kriging-based means executing a surrogate model to create a response surface by providing a set of optimal solutions. The most important advantage of the proposed method over the alternatives is that it consists of three fitness functions. The method is evaluated on the MLlib library and the benchmarks of Hibench. MLlib provides various machine learning algorithms that are suitable to execute on resilient distributed data sets. The experimental results show that the proposed method outperformed the alternatives in hypervolume improvement and reducing uncertainty. Further, the results support the hypothesis that focusing on the parameters associated with data compression and memory usage improves the effectiveness of multi-objective optimization methods developed for Spark. Multi-objective optimization leads to an inevitable complexity in Spark due to the dimensionality of objective functions. Despite the fact that simplifying the setup and steps of optimization has proven to be the most effective way to reduce that complexity, it is not very effective to avoid ambiguity of the Pareto front. While the proposed method achieved 1.93x speedup in benchmark experiments, there is a remarkable difference (0.63 of speedup) between the speedup of our method and that of the closest competitor. Increasing the number of cores in multi-objective optimization does not contribute to speedup; rather, it leads to waste of CPU sources. Instead, the optimal number of cores should be determined by checking the changes of speedup with varying Spark configurations.
Similar content being viewed by others
References
Apache Hadoop yarn (2022) https://spark.apache.org/docs/latest/running-on-yarn.html. Accessed 9 Mar 2022
Apache spark (2022) https://spark.apache.org/. Accessed 9 Mar 2022
Abolpour R, Javanmardi H, Dehghani M et al (2022) Optimal frequency regulation in an uncertain islanded microgrid: a modified direct search algorithm. IET Renew Power Gener 16(4):726–739. https://doi.org/10.1049/rpg2.12427
Aragón-Royón F, Jiménez-Vílchez A, Arauzo-Azofra A et al (2020) FSinR: an exhaustive package for feature selection. arXiv:2002.10330
Armbrust M, Xin RS, Lian C et al (2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394
Babar M, Jan MA, He X et al (2022) An optimized IoT-enabled big data analytics architecture for edge-cloud computing. IEEE Internet Things J 10(5):3995–4005
Baldacci L, Golfarelli M (2018) A cost model for SPARK SQL. IEEE Trans Knowl Data Eng 31(5):819–832
Baresi L, Bersani MM, Marconi F et al (2020) Using formal verification to evaluate the execution time of spark applications. Form Asp Comput 32(1):33–70
Bartz-Beielstein T (2023) Hyperparameter tuning and optimization applications. In: Hyperparameter tuning for machine and deep learning with R: a practical guide. Springer, pp 165–175
Brockhoff D, Tran TD, Hansen N (2015) Benchmarking numerical multiobjective optimizers revisited. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 639–646
Chen D, Zhang R (2020) Catla-hs: an open source project for tuning and analyzing MapReduce performance on Hadoop and Spark. IEEE Softw 39(1):61–69
Cheng G, Ying S, Wang B (2021) Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J Syst Softw 180:111028
Cheng G, Ying S, Wang B et al (2021) Efficient performance prediction for apache spark. J Parallel Distrib Comput 149:40–51
de Oliveira D, Porto F, Boeres C et al (2021) Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. Concurr Comput Pract Exp 33(5):e5972
Fekry A, Carata L, Pasquier T, et al (2020) Tuneful: an online significance-aware configuration tuner for big data analytics. arXiv:2001.08002
Ganelin I, Orhian E, Sasaki K et al (2016) Spark: big data cluster computing in production. Wiley, Hoboken
Gounaris A, Torres J (2018) A methodology for spark parameter tuning. Big Data Res 11:22–32
Guo Y, Shan H, Huang S et al (2021) Gml: efficiently auto-tuning flink’s configurations via guided machine learning. IEEE Trans Parallel Distrib Syst 32(12):2921–2935
He X, Zhao K, Chu X (2021) Automl: a survey of the state-of-the-art. Knowl-Based Syst 212:106622
Herodotou H, Chen Y, Lu J (2021) A survey on automatic parameter tuning for big data processing systems. ACM Comput Surv 53(2):1–37. https://doi.org/10.1145/3381027
Krishna R, Tang C, Sullivan K et al (2020) Conex: efficient exploration of big-data system configurations for better performance. IEEE Trans Softw Eng 48(3):893–909
Kunjir M (2020) Speeding up autotuning of the memory management options in data analytics. Distrib Parallel Databases 38(4):841–863
Lin C, Zhuang J, Feng J et al (2022) Adaptive code learning for spark configuration tuning. ICDE
Liu A (2016) Apache spark machine learning blueprints. Packt Publishing Ltd, Birmingham
Lucas Filho ER, de Almeida EC, Scherzinger S et al (2021) Investigating automatic parameter tuning for sql-on-hadoop systems. Big Data Res 25:100204
Mebane WR Jr, Sekhon JS (2011) Genetic optimization using derivatives: the rgenoud package for R. J Stat Softw 42:1–26
Meister M, Sheikholeslami S, Payberah AH et al (2020) Maggy: scalable asynchronous parallel hyperparameter search. In: Proceedings of the 1st workshop on distributed machine learning, pp 28–33
Meng X, Bradley J, Yavuz B et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
Micaelli P, Storkey A (2020) Non-greedy gradient-based hyperparameter optimization over long horizons
Mustafa S, Elghandour I, Ismail MA (2018) A machine learning approach for predicting execution time of spark jobs. Alex Eng J 57(4):3767–3778
Neshatpour K, Malik M, Ghodrat MA et al (2015) Energy-efficient acceleration of big data analytics applications using FPGAs. In: 2015 IEEE international conference on big data (big data). IEEE, pp 115–123
Picheny V, Gaudrie D, Le Riche R et al (2018) Targeting well-balanced solutions in multi-objective Bayesian optimization under a restricted budget. In: International conference on learning and intelligent optimization (LION’18)
Prats DB, Portella FA, Costa CH et al (2020) You only run once: spark auto-tuning from a single run. IEEE Trans Netw Serv Manag 17(4):2039–2051
Rodrigues M, Santos MY, Bernardino J (2019) Big data processing tools: an experimental performance evaluation. Wiley Interdiscip Rev Data Min Knowl Discov 9(2):e1297
Roustant O, Ginsbourger D, Deville YDK (2012) DiceOptim: two r packages for the analysis of computer experiments by kriging-based metamodeling and optimization. J Stat Softw 51:1–55
Sacks J, Welch WJ, Mitchell TJ et al (1989) Design and analysis of computer experiments. Stat Sci 4(4):409–423
Shah S, Amannejad Y, Krishnamurthy D et al (2021) Peridot: modeling execution time of spark applications. IEEE Open J Comput Soc 2:346–359
Shmeis Z, Jaber M (2019) A rewrite-based optimizer for spark. Future Gener Comput Syst 98:586–599
Verweij B, Ahmed S, Kleywegt AJ et al (2003) The sample average approximation method applied to stochastic routing problems: a computational study. Comput Optim Appl 24(2):289–333
Wang G, Xu J, He B (2016) A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS). IEEE, pp 586–593
Zaharia M, Xin RS, Wendell P et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Zhang H, Liu Z, Huang H, et al (2018) Ftsgd: An adaptive stochastic gradient descent algorithm for spark mllib. In: 2018 IEEE 16th intl conf on dependable, autonomic and secure computing, 16th intl conf on pervasive intelligence and computing, 4th intl conf on big data intelligence and computing and cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 828–835
Zhu Y, Liu J, Guo M et al (2017) Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 symposium on cloud computing, pp 338–350
Zitzler E, Thiele L, Bader J (2009) On set-based multiobjective optimization. IEEE Trans Evol Comput 14(1):58–79
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
MMÖ was involved in conceptualization, methodology, data curation, writing—original draft, visualization, and investigation.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Öztürk, M.M. Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization. Knowl Inf Syst 66, 1065–1090 (2024). https://doi.org/10.1007/s10115-023-02032-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-02032-z