Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization

Öztürk, M. Maruf

doi:10.1007/s10115-023-02032-z

Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization

Regular Paper
Published: 13 December 2023

Volume 66, pages 1065–1090, (2024)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

M. Maruf Öztürk¹

219 Accesses
Explore all metrics

Abstract

When there is a need to make an ultimate decision about the unique features of big data platforms, one should note that they have configurable parameters. Apache Spark is an open-source big data processing platform that can process real-time data, and it requires an advanced central processing unit and high memory capacity. Therefore, it gives us a great number of configurable parameters such as the number of cores and driver memory that are tuned during the execution. Different from the preceding works, in this study, a Kriging-based multi-objective optimization method is developed. Kriging-based means executing a surrogate model to create a response surface by providing a set of optimal solutions. The most important advantage of the proposed method over the alternatives is that it consists of three fitness functions. The method is evaluated on the MLlib library and the benchmarks of Hibench. MLlib provides various machine learning algorithms that are suitable to execute on resilient distributed data sets. The experimental results show that the proposed method outperformed the alternatives in hypervolume improvement and reducing uncertainty. Further, the results support the hypothesis that focusing on the parameters associated with data compression and memory usage improves the effectiveness of multi-objective optimization methods developed for Spark. Multi-objective optimization leads to an inevitable complexity in Spark due to the dimensionality of objective functions. Despite the fact that simplifying the setup and steps of optimization has proven to be the most effective way to reduce that complexity, it is not very effective to avoid ambiguity of the Pareto front. While the proposed method achieved 1.93x speedup in benchmark experiments, there is a remarkable difference (0.63 of speedup) between the speedup of our method and that of the closest competitor. Increasing the number of cores in multi-objective optimization does not contribute to speedup; rather, it leads to waste of CPU sources. Instead, the optimal number of cores should be determined by checking the changes of speedup with varying Spark configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-objective Big Data Optimization with jMetal and Spark

A Method to Identify Spark Important Parameters Based on Machine Learning

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Article 17 October 2023

Notes

References

Apache Hadoop yarn (2022) https://spark.apache.org/docs/latest/running-on-yarn.html. Accessed 9 Mar 2022
Apache spark (2022) https://spark.apache.org/. Accessed 9 Mar 2022
Abolpour R, Javanmardi H, Dehghani M et al (2022) Optimal frequency regulation in an uncertain islanded microgrid: a modified direct search algorithm. IET Renew Power Gener 16(4):726–739. https://doi.org/10.1049/rpg2.12427
Article Google Scholar
Aragón-Royón F, Jiménez-Vílchez A, Arauzo-Azofra A et al (2020) FSinR: an exhaustive package for feature selection. arXiv:2002.10330
Armbrust M, Xin RS, Lian C et al (2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394
Babar M, Jan MA, He X et al (2022) An optimized IoT-enabled big data analytics architecture for edge-cloud computing. IEEE Internet Things J 10(5):3995–4005
Article Google Scholar
Baldacci L, Golfarelli M (2018) A cost model for SPARK SQL. IEEE Trans Knowl Data Eng 31(5):819–832
Article Google Scholar
Baresi L, Bersani MM, Marconi F et al (2020) Using formal verification to evaluate the execution time of spark applications. Form Asp Comput 32(1):33–70
Article MathSciNet Google Scholar
Bartz-Beielstein T (2023) Hyperparameter tuning and optimization applications. In: Hyperparameter tuning for machine and deep learning with R: a practical guide. Springer, pp 165–175
Brockhoff D, Tran TD, Hansen N (2015) Benchmarking numerical multiobjective optimizers revisited. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 639–646
Chen D, Zhang R (2020) Catla-hs: an open source project for tuning and analyzing MapReduce performance on Hadoop and Spark. IEEE Softw 39(1):61–69
Article Google Scholar
Cheng G, Ying S, Wang B (2021) Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J Syst Softw 180:111028
Article Google Scholar
Cheng G, Ying S, Wang B et al (2021) Efficient performance prediction for apache spark. J Parallel Distrib Comput 149:40–51
Article Google Scholar
de Oliveira D, Porto F, Boeres C et al (2021) Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. Concurr Comput Pract Exp 33(5):e5972
Article Google Scholar
Fekry A, Carata L, Pasquier T, et al (2020) Tuneful: an online significance-aware configuration tuner for big data analytics. arXiv:2001.08002
Ganelin I, Orhian E, Sasaki K et al (2016) Spark: big data cluster computing in production. Wiley, Hoboken
Book Google Scholar
Gounaris A, Torres J (2018) A methodology for spark parameter tuning. Big Data Res 11:22–32
Article Google Scholar
Guo Y, Shan H, Huang S et al (2021) Gml: efficiently auto-tuning flink’s configurations via guided machine learning. IEEE Trans Parallel Distrib Syst 32(12):2921–2935
Article Google Scholar
He X, Zhao K, Chu X (2021) Automl: a survey of the state-of-the-art. Knowl-Based Syst 212:106622
Article Google Scholar
Herodotou H, Chen Y, Lu J (2021) A survey on automatic parameter tuning for big data processing systems. ACM Comput Surv 53(2):1–37. https://doi.org/10.1145/3381027
Article Google Scholar
Krishna R, Tang C, Sullivan K et al (2020) Conex: efficient exploration of big-data system configurations for better performance. IEEE Trans Softw Eng 48(3):893–909
Article Google Scholar
Kunjir M (2020) Speeding up autotuning of the memory management options in data analytics. Distrib Parallel Databases 38(4):841–863
Article Google Scholar
Lin C, Zhuang J, Feng J et al (2022) Adaptive code learning for spark configuration tuning. ICDE
Liu A (2016) Apache spark machine learning blueprints. Packt Publishing Ltd, Birmingham
Google Scholar
Lucas Filho ER, de Almeida EC, Scherzinger S et al (2021) Investigating automatic parameter tuning for sql-on-hadoop systems. Big Data Res 25:100204
Article Google Scholar
Mebane WR Jr, Sekhon JS (2011) Genetic optimization using derivatives: the rgenoud package for R. J Stat Softw 42:1–26
Article Google Scholar
Meister M, Sheikholeslami S, Payberah AH et al (2020) Maggy: scalable asynchronous parallel hyperparameter search. In: Proceedings of the 1st workshop on distributed machine learning, pp 28–33
Meng X, Bradley J, Yavuz B et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
MathSciNet Google Scholar
Micaelli P, Storkey A (2020) Non-greedy gradient-based hyperparameter optimization over long horizons
Mustafa S, Elghandour I, Ismail MA (2018) A machine learning approach for predicting execution time of spark jobs. Alex Eng J 57(4):3767–3778
Article Google Scholar
Neshatpour K, Malik M, Ghodrat MA et al (2015) Energy-efficient acceleration of big data analytics applications using FPGAs. In: 2015 IEEE international conference on big data (big data). IEEE, pp 115–123
Picheny V, Gaudrie D, Le Riche R et al (2018) Targeting well-balanced solutions in multi-objective Bayesian optimization under a restricted budget. In: International conference on learning and intelligent optimization (LION’18)
Prats DB, Portella FA, Costa CH et al (2020) You only run once: spark auto-tuning from a single run. IEEE Trans Netw Serv Manag 17(4):2039–2051
Article Google Scholar
Rodrigues M, Santos MY, Bernardino J (2019) Big data processing tools: an experimental performance evaluation. Wiley Interdiscip Rev Data Min Knowl Discov 9(2):e1297
Article Google Scholar
Roustant O, Ginsbourger D, Deville YDK (2012) DiceOptim: two r packages for the analysis of computer experiments by kriging-based metamodeling and optimization. J Stat Softw 51:1–55
Article Google Scholar
Sacks J, Welch WJ, Mitchell TJ et al (1989) Design and analysis of computer experiments. Stat Sci 4(4):409–423
MathSciNet Google Scholar
Shah S, Amannejad Y, Krishnamurthy D et al (2021) Peridot: modeling execution time of spark applications. IEEE Open J Comput Soc 2:346–359
Article Google Scholar
Shmeis Z, Jaber M (2019) A rewrite-based optimizer for spark. Future Gener Comput Syst 98:586–599
Article Google Scholar
Verweij B, Ahmed S, Kleywegt AJ et al (2003) The sample average approximation method applied to stochastic routing problems: a computational study. Comput Optim Appl 24(2):289–333
Article MathSciNet Google Scholar
Wang G, Xu J, He B (2016) A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS). IEEE, pp 586–593
Zaharia M, Xin RS, Wendell P et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65
Article Google Scholar
Zhang H, Liu Z, Huang H, et al (2018) Ftsgd: An adaptive stochastic gradient descent algorithm for spark mllib. In: 2018 IEEE 16th intl conf on dependable, autonomic and secure computing, 16th intl conf on pervasive intelligence and computing, 4th intl conf on big data intelligence and computing and cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 828–835
Zhu Y, Liu J, Guo M et al (2017) Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 symposium on cloud computing, pp 338–350
Zitzler E, Thiele L, Bader J (2009) On set-based multiobjective optimization. IEEE Trans Evol Comput 14(1):58–79
Article Google Scholar

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering and Natural Sciences, Suleyman Demirel University, 32040, Cunur, Isparta, Turkey
M. Maruf Öztürk

Authors

M. Maruf Öztürk
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MMÖ was involved in conceptualization, methodology, data curation, writing—original draft, visualization, and investigation.

Corresponding author

Correspondence to M. Maruf Öztürk.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Öztürk, M.M. Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization. Knowl Inf Syst 66, 1065–1090 (2024). https://doi.org/10.1007/s10115-023-02032-z

Download citation

Received: 28 April 2023
Revised: 02 October 2023
Accepted: 22 November 2023
Published: 13 December 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10115-023-02032-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization

Abstract

Access this article

Similar content being viewed by others

Multi-objective Big Data Optimization with jMetal and Spark

A Method to Identify Spark Important Parameters Based on Machine Learning

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization

Abstract

Access this article

Similar content being viewed by others

Multi-objective Big Data Optimization with jMetal and Spark

A Method to Identify Spark Important Parameters Based on Machine Learning

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation