Skip to main content
Log in

Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

When there is a need to make an ultimate decision about the unique features of big data platforms, one should note that they have configurable parameters. Apache Spark is an open-source big data processing platform that can process real-time data, and it requires an advanced central processing unit and high memory capacity. Therefore, it gives us a great number of configurable parameters such as the number of cores and driver memory that are tuned during the execution. Different from the preceding works, in this study, a Kriging-based multi-objective optimization method is developed. Kriging-based means executing a surrogate model to create a response surface by providing a set of optimal solutions. The most important advantage of the proposed method over the alternatives is that it consists of three fitness functions. The method is evaluated on the MLlib library and the benchmarks of Hibench. MLlib provides various machine learning algorithms that are suitable to execute on resilient distributed data sets. The experimental results show that the proposed method outperformed the alternatives in hypervolume improvement and reducing uncertainty. Further, the results support the hypothesis that focusing on the parameters associated with data compression and memory usage improves the effectiveness of multi-objective optimization methods developed for Spark. Multi-objective optimization leads to an inevitable complexity in Spark due to the dimensionality of objective functions. Despite the fact that simplifying the setup and steps of optimization has proven to be the most effective way to reduce that complexity, it is not very effective to avoid ambiguity of the Pareto front. While the proposed method achieved 1.93x speedup in benchmark experiments, there is a remarkable difference (0.63 of speedup) between the speedup of our method and that of the closest competitor. Increasing the number of cores in multi-objective optimization does not contribute to speedup; rather, it leads to waste of CPU sources. Instead, the optimal number of cores should be determined by checking the changes of speedup with varying Spark configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/c/dense-network/data?select=train.csv.

  2. https://www.kaggle.com/muhammad4hmed/malwaremicrosoftbig.

  3. https://zenodo.org/record/5731597.

  4. https://www.kaggle.com/datasets/lakshmi25npathi/santander-customer-transaction-prediction-dataset.

References

  1. Apache Hadoop yarn (2022) https://spark.apache.org/docs/latest/running-on-yarn.html. Accessed 9 Mar 2022

  2. Apache spark (2022) https://spark.apache.org/. Accessed 9 Mar 2022

  3. Abolpour R, Javanmardi H, Dehghani M et al (2022) Optimal frequency regulation in an uncertain islanded microgrid: a modified direct search algorithm. IET Renew Power Gener 16(4):726–739. https://doi.org/10.1049/rpg2.12427

    Article  Google Scholar 

  4. Aragón-Royón F, Jiménez-Vílchez A, Arauzo-Azofra A et al (2020) FSinR: an exhaustive package for feature selection. arXiv:2002.10330

  5. Armbrust M, Xin RS, Lian C et al (2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394

  6. Babar M, Jan MA, He X et al (2022) An optimized IoT-enabled big data analytics architecture for edge-cloud computing. IEEE Internet Things J 10(5):3995–4005

    Article  Google Scholar 

  7. Baldacci L, Golfarelli M (2018) A cost model for SPARK SQL. IEEE Trans Knowl Data Eng 31(5):819–832

    Article  Google Scholar 

  8. Baresi L, Bersani MM, Marconi F et al (2020) Using formal verification to evaluate the execution time of spark applications. Form Asp Comput 32(1):33–70

    Article  MathSciNet  Google Scholar 

  9. Bartz-Beielstein T (2023) Hyperparameter tuning and optimization applications. In: Hyperparameter tuning for machine and deep learning with R: a practical guide. Springer, pp 165–175

  10. Brockhoff D, Tran TD, Hansen N (2015) Benchmarking numerical multiobjective optimizers revisited. In: Proceedings of the 2015 annual conference on genetic and evolutionary computation, pp 639–646

  11. Chen D, Zhang R (2020) Catla-hs: an open source project for tuning and analyzing MapReduce performance on Hadoop and Spark. IEEE Softw 39(1):61–69

    Article  Google Scholar 

  12. Cheng G, Ying S, Wang B (2021) Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J Syst Softw 180:111028

    Article  Google Scholar 

  13. Cheng G, Ying S, Wang B et al (2021) Efficient performance prediction for apache spark. J Parallel Distrib Comput 149:40–51

    Article  Google Scholar 

  14. de Oliveira D, Porto F, Boeres C et al (2021) Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. Concurr Comput Pract Exp 33(5):e5972

    Article  Google Scholar 

  15. Fekry A, Carata L, Pasquier T, et al (2020) Tuneful: an online significance-aware configuration tuner for big data analytics. arXiv:2001.08002

  16. Ganelin I, Orhian E, Sasaki K et al (2016) Spark: big data cluster computing in production. Wiley, Hoboken

    Book  Google Scholar 

  17. Gounaris A, Torres J (2018) A methodology for spark parameter tuning. Big Data Res 11:22–32

    Article  Google Scholar 

  18. Guo Y, Shan H, Huang S et al (2021) Gml: efficiently auto-tuning flink’s configurations via guided machine learning. IEEE Trans Parallel Distrib Syst 32(12):2921–2935

    Article  Google Scholar 

  19. He X, Zhao K, Chu X (2021) Automl: a survey of the state-of-the-art. Knowl-Based Syst 212:106622

    Article  Google Scholar 

  20. Herodotou H, Chen Y, Lu J (2021) A survey on automatic parameter tuning for big data processing systems. ACM Comput Surv 53(2):1–37. https://doi.org/10.1145/3381027

    Article  Google Scholar 

  21. Krishna R, Tang C, Sullivan K et al (2020) Conex: efficient exploration of big-data system configurations for better performance. IEEE Trans Softw Eng 48(3):893–909

    Article  Google Scholar 

  22. Kunjir M (2020) Speeding up autotuning of the memory management options in data analytics. Distrib Parallel Databases 38(4):841–863

    Article  Google Scholar 

  23. Lin C, Zhuang J, Feng J et al (2022) Adaptive code learning for spark configuration tuning. ICDE

  24. Liu A (2016) Apache spark machine learning blueprints. Packt Publishing Ltd, Birmingham

    Google Scholar 

  25. Lucas Filho ER, de Almeida EC, Scherzinger S et al (2021) Investigating automatic parameter tuning for sql-on-hadoop systems. Big Data Res 25:100204

    Article  Google Scholar 

  26. Mebane WR Jr, Sekhon JS (2011) Genetic optimization using derivatives: the rgenoud package for R. J Stat Softw 42:1–26

    Article  Google Scholar 

  27. Meister M, Sheikholeslami S, Payberah AH et al (2020) Maggy: scalable asynchronous parallel hyperparameter search. In: Proceedings of the 1st workshop on distributed machine learning, pp 28–33

  28. Meng X, Bradley J, Yavuz B et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(1):1235–1241

    MathSciNet  Google Scholar 

  29. Micaelli P, Storkey A (2020) Non-greedy gradient-based hyperparameter optimization over long horizons

  30. Mustafa S, Elghandour I, Ismail MA (2018) A machine learning approach for predicting execution time of spark jobs. Alex Eng J 57(4):3767–3778

    Article  Google Scholar 

  31. Neshatpour K, Malik M, Ghodrat MA et al (2015) Energy-efficient acceleration of big data analytics applications using FPGAs. In: 2015 IEEE international conference on big data (big data). IEEE, pp 115–123

  32. Picheny V, Gaudrie D, Le Riche R et al (2018) Targeting well-balanced solutions in multi-objective Bayesian optimization under a restricted budget. In: International conference on learning and intelligent optimization (LION’18)

  33. Prats DB, Portella FA, Costa CH et al (2020) You only run once: spark auto-tuning from a single run. IEEE Trans Netw Serv Manag 17(4):2039–2051

    Article  Google Scholar 

  34. Rodrigues M, Santos MY, Bernardino J (2019) Big data processing tools: an experimental performance evaluation. Wiley Interdiscip Rev Data Min Knowl Discov 9(2):e1297

    Article  Google Scholar 

  35. Roustant O, Ginsbourger D, Deville YDK (2012) DiceOptim: two r packages for the analysis of computer experiments by kriging-based metamodeling and optimization. J Stat Softw 51:1–55

    Article  Google Scholar 

  36. Sacks J, Welch WJ, Mitchell TJ et al (1989) Design and analysis of computer experiments. Stat Sci 4(4):409–423

    MathSciNet  Google Scholar 

  37. Shah S, Amannejad Y, Krishnamurthy D et al (2021) Peridot: modeling execution time of spark applications. IEEE Open J Comput Soc 2:346–359

    Article  Google Scholar 

  38. Shmeis Z, Jaber M (2019) A rewrite-based optimizer for spark. Future Gener Comput Syst 98:586–599

    Article  Google Scholar 

  39. Verweij B, Ahmed S, Kleywegt AJ et al (2003) The sample average approximation method applied to stochastic routing problems: a computational study. Comput Optim Appl 24(2):289–333

    Article  MathSciNet  Google Scholar 

  40. Wang G, Xu J, He B (2016) A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th international conference on high performance computing and communications; IEEE 14th international conference on smart city; IEEE 2nd international conference on data science and systems (HPCC/SmartCity/DSS). IEEE, pp 586–593

  41. Zaharia M, Xin RS, Wendell P et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65

    Article  Google Scholar 

  42. Zhang H, Liu Z, Huang H, et al (2018) Ftsgd: An adaptive stochastic gradient descent algorithm for spark mllib. In: 2018 IEEE 16th intl conf on dependable, autonomic and secure computing, 16th intl conf on pervasive intelligence and computing, 4th intl conf on big data intelligence and computing and cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 828–835

  43. Zhu Y, Liu J, Guo M et al (2017) Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 symposium on cloud computing, pp 338–350

  44. Zitzler E, Thiele L, Bader J (2009) On set-based multiobjective optimization. IEEE Trans Evol Comput 14(1):58–79

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

MMÖ was involved in conceptualization, methodology, data curation, writing—original draft, visualization, and investigation.

Corresponding author

Correspondence to M. Maruf Öztürk.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Öztürk, M.M. Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization. Knowl Inf Syst 66, 1065–1090 (2024). https://doi.org/10.1007/s10115-023-02032-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-02032-z

Keywords

Navigation