Skip to main content
Log in

Improving resource utilization and fault tolerance in large simulations via actors

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Large simulations with many independent sub-simulations are common in scientific computing. There are numerous challenges, however, associated with performing such simulations in shared computing environments. For example, sub-simulations may have wildly varying completion times or not complete at all, leading to unpredictable runtimes as well as unbalanced and inefficient use of human and computational resources. In this study, we use the actor model of concurrent computation to improve both the resource utilization and fault tolerance for large-scale scientific computing simulations. More specifically, we use actors in the SUMMA model to manage a large-scale hydrological simulation over the North American continent with over 500,000 independent sub-simulations. We find that the actors implementation outperforms a standard array job submission as well as the job submission tool GNU Parallel by better balancing the computational load across processors. The actors implementation also improves fault tolerance and can eliminate the user intervention required to detect and re-submit failed jobs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Running a large set of jobs through the Slurm workload manager [26] is typically done through the use of array jobs

  2. https://docs.alliancecan.ca/wiki/National_systems.

  3. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

References

  1. Agha, G.: Concurrent object-oriented programming. Commun. ACM 33(9), 125–141 (1990). https://doi.org/10.1145/83880.84528

    Article  Google Scholar 

  2. Agha, G.A.: Actors - a model of concurrent computation in distributed systems. MIT Press series in artificial intelligence, Tech. Rep. (1985)

    Google Scholar 

  3. Anderson, D.P.: BOINC: A platform for volunteer computing. J. Grid Comput. 18(1), 99–122 (2020). https://doi.org/10.1007/s10723-019-09497-9

    Article  Google Scholar 

  4. Armstrong, J.: Erlang – a survey of the language and its industrial applications. In: Proc. INAP, pp 16–18 (1996)

  5. Babuji, Y., Woodard, A., Li, Z., et al.: Parsl: Pervasive parallel programming in python. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pp 25–36 (2019)

  6. Balis, B., Borowski, K.: Using an actor framework for scientific computing: opportunities and challenges. Comput. Inform. 35, 870–889 (2016)

    Google Scholar 

  7. Charousset, D., Hiesgen, R., Schmidt, T.C.: Revisiting actor programming in C++. Comput. Lang. Syst. Struct. 45, 105–131 (2016)

    Article  Google Scholar 

  8. Clark, M.P., Nijssen, B., Lundquist, J.D., et al.: A unified approach for process-based hydrologic modeling: 2. model implementation and case studies. Water Resour. Res. 51(4), 2515–2542 (2015)

  9. Cueto, C., Bates, O., Strong, G., et al.: Stride: A flexible software platform for high-performance ultrasound computed tomography. Comput. Methods Programs Biomed. 221, 106855–106855 (2022)

    Article  PubMed  Google Scholar 

  10. De Koster, J., Van Cutsem, T., De Meuter, W.: 43 years of actors: A taxonomy of actor models and their key properties. In: Proceedings of the 6th International Workshop on Programming Based on Actors, Agents, and Decentralized Control. Association for Computing Machinery, New York, NY, USA, AGERE 2016, pp 31–40, https://doi.org/10.1145/3001886.3001890 (2016)

  11. Dongarra, J., Herault, T., Robert, Y.: Fault Tolerance Techniques for High-Performance Computing. Springer, Cham, chap 1, 3–85 (2015). https://doi.org/10.1007/978-3-319-20943-2_1

    Article  Google Scholar 

  12. Haller, P., Odersky, M.: Scala actors: Unifying thread-based and event-based programming. Theor. Comput. Sci. 410, 202–220 (2009)

    Article  MathSciNet  Google Scholar 

  13. Hewitt, C., Bishop, P., Steiger, R.: A universal modular ACTOR formalism for artificial intelligence. In: Proc. International Joint Conference on Artificial Intelligence, pp 235–245 (1973)

  14. Hewitt, C.E.: Actor model of computation: scalable robust information systems. arXiv: Programming Languages (2010)

  15. Jain, A., Ong, S.P., Chen, W., et al.: Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. Pract. Exp. 27(17), 5037–5059 (2015)

    Article  Google Scholar 

  16. Janczykowski, M., Turek, W., Malawski, M., et al.: Large-scale urban traffic simulation with scala and high-performance computing system. J. Comput. Sci. 35, 91–101 (2019)

    Article  Google Scholar 

  17. Klenk, K., Green, K.R., Spiteri, R.J.: Summa actors. https://git.cs.usask.ca/numerical_simulations_lab/actors/Summa-Actors (2023)

  18. Knoben, W.J.M., Clark, M.P., Bales, J., et al.: Community workflows to advance reproducibility in hydrologic modeling: Separating model-agnostic and model-specific configuration steps in applications of large-domain hydrologic models. Water Resources Research 58(11):e2021WR031753. https://doi.org/10.1029/2021WR031753, https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2021WR031753 (2022)

  19. Merzky, A., Santcroos, M., Turilli, M., et al.: Radical-pilot: Scalable execution of heterogeneous and dynamic workloads on supercomputers. CoRR, arXiv:1512.08194 (2015)

  20. Miller, M.S., Tribble, E.D., Shapiro, J.: Concurrency among strangers. In: De Nicola, R., Sangiorgi, D. (eds.) Trustworthy Global Computing, pp. 195–229. Springer, Berlin, Heidelberg (2005)

    Chapter  Google Scholar 

  21. Pellegrino, M., Lombardo, G., Cagnoni, S., et al.: High-performance computing and abms for high-resolution covid-19 spreading simulation. Future Internet 14(3), 83 (2022)

    Article  Google Scholar 

  22. Starzec, M., Starzec, G., Byrski, A., et al.: Distributed ant colony optimization based on actor model. Parallel Comput. 90, 102573 (2019)

    Article  MathSciNet  Google Scholar 

  23. Tange, O.: GNU Parallel 2018. Ole Tange (2018). https://doi.org/10.5281/zenodo.1146014

    Article  ADS  Google Scholar 

  24. Tulika, E., Doroshenko, A., Zhereb, K.: Using choreography of actors and rewriting rules to adapt legacy Fortran programs to cloud computing. In: Ginige, A., Mayr, H.C., Plexousakis, D., et al. (eds.) Information and Communication Technologies in Education, Research, and Industrial Applications, pp. 76–96. Springer, Cham (2017)

    Google Scholar 

  25. Varela, C., Agha, G.: Programming dynamically reconfigurable open systems with salsa. SIGPLAN Not 36(12), 20–34 (2001). https://doi.org/10.1145/583960.583964

    Article  Google Scholar 

  26. Yoo, A.B., Jette, M.A., Grondona, M.: Slurm: Simple linux utility for resource management. In: Job Scheduling Strategies for Parallel Processing, pp. 44–60. Springer, Berlin, Heidelberg, Lecture Notes in Computer Science (2003)

    Chapter  Google Scholar 

  27. Yuan, Y., Wu, Y., Wang, Q., et al.: Job failures in high performance computing systems: a large-scale empirical study. Comput. Math. Appl. 63(2), 365–377. https://doi.org/10.1016/j.camwa.2011.07.040, https://www.sciencedirect.com/science/article/pii/S0898122111005980, advances in context, cognitive, and secure computing (2012)

Download references

Acknowledgements

The authors would like to thank Dr. Reza Zolfaghari and Dr. Kevin R. Green for their help in the initial development of SUMMA-Actors and this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyle Klenk.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Klenk, K., Spiteri, R.J. Improving resource utilization and fault tolerance in large simulations via actors. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04318-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10586-024-04318-5

Keywords

Navigation