Skip to main content
Log in

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Large-scale HPC systems experience failures arising from faults in hardware, software, and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault tolerance has up to now been the focus when considering means to augment the Message Passing Interface (MPI) for fault-tolerant operations. This narrow model of faults (usually restricted only to process or node failures) is insufficient. Without a more general model for consensus, gaps in the ability to detect, isolate, mitigate, and recover HPC applications efficiently will arise. Focusing on crash failures is insufficient because a chain of underlying components may lead to system failures in MPI. What is more, clusters and leadership-class machines alike often have Reliability, Availability, and Serviceability Systems to convey predictive and real-time fault and error information, which does not map strictly to process and node crashes. A broader study of failures beyond crash failures in MPI will thus be useful in conjunction with consensus mechanism for developers as they continue to design, develop, and implement fault-tolerant HPC systems that reflect observable faults in actual systems. We describe key factors that must be considered during consensus-mechanism design. We illustrate some of the current MPI fault tolerance models based on use cases. We offer a novel classification of common consensus mechanisms based on these factors such as the network model, failure types, and based on use cases (e.g., fault detection, synchronization) of the consensus in the computation process, including crash fault tolerance as one category.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Consensus under fault-free operations is also an inherent property of typical bulk-synchronous parallel programs / data-parallel programs.

  2. Byzantine failures include crash failures.

  3. Reaching agreement is never guaranteed in theory, but is often possible heuristically in practice (cf, FLP [34]).

  4. There can be security concerns about enabling a parallel program to receive fault information from the exterior of the parallel system. Coping with any possible covert channels through translation and vetting of such information appears tractable in practice.

References

  1. Fromentin, E., Raynal, M., Tronel, F.: On classes of problems in asyn- chronous distributed systems with process crashes. In: Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003), pp. 470–477 (1999). https://doi.org/10.1109/ICDCS.1999.776549

  2. Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755, (2014). https://doi.org/10.1109/DSN.2014.78

  3. Sultana, N., Rüfenacht, M., Skjellum, A., Laguna, I., Mohror, K.: Failure recovery for bulk synchronous applications with MPI stages. Parallel Comput. 84, 1–14 (2019). https://doi.org/10.1016/j.parco.2019.02.007

    Article  Google Scholar 

  4. Amin, H.: Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines. PhD dissertation, The University of Alabama at Birmingham (2014)

  5. Altarawneh, A., Herschberg, T., Medury, S., Kandah, F., Skjellum, A.: Buterin’s scalability trilemma viewed through a state-change-based classification for common consensus algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0727–0736 (2020). https://doi.org/10.1109/CCWC47524.2020.9031204

  6. Dolev, D., Reischuk, R.: Bounds on information exchange for byzantine agreement. J. ACM (JACM) 32(1), 191–204 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  7. Giménez, A., Gamblin, T., Bhatele, A., Wood, C., Shoga, K., Marathe, A., Bremer, P.-T., Hamann, B., Schulz, M.: Scrubjay: Deriving knowledge from the disarray of hpc performance data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’17, New York, (2017). https://doi.org/10.1145/3126908.3126935

  8. Guo, H., Di, S., Gupta, R., Peterka, T., Cappello, F.: La VALSE: scalable log visualization for fault characterization in supercomputers. In: Childs, H., Cucchietti, F. (eds.) Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association (2018)

  9. Martino, C. D., Jha, S., Kramer, W., Kalbarczyk, Z., Iyer, R. K.: Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at EXtreme Scale, pp. 11–18. Association for Computing Machinery, FTXS ’15, New York, (2015). https://doi.org/10.1145/2751504.2751511

  10. Buntinas, D.: Scalable distributed consensus to support mpi fault tolerance. In : 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249 (2012). https://doi.org/10.1109/IPDPS.2012.113

  11. Nowakowski, W.: Network management software for redundant ethernet ring. Theor. Appl. Sci. 48, 24–29 (2017)

    Article  Google Scholar 

  12. Libby, R.: Effective HPC hardware management and failure prediction strategy using IPMI. In: Proceedings of the Linux Symposium. Citeseer, (2003)

  13. Baudet, M., Ching, A., Chursin, A., Danezis, G., Garillot, F., Li, Z., Malkhi, D., Naor, O., Perelman, D., Sonnino, A.: State machine replication in the libra blockchain (2019)

  14. Driscoll, K., Hall, B., Paulitsch, M., Zumsteg, P., Sivencrona, H.: The real byzantine generals. In: The 23rd Digital Avionics Systems Conference (IEEE Cat. No.04CH37576), vol. 2, pp. 6.D.4–61 (2004). https://doi.org/10.1109/DASC.2004.1390734

  15. Forum, M.P.I.: MPI: A Message-passing Interface Standard, Version 3.1. (2015). High-Performance Computing Center Stuttgart, University of Stuttgart, (2015). URL https://books.google.com/books?id=Fbv7jwEACAAJ

  16. Bar-Noy, A., Dolev, D.: Consensus algorithms with one-bit messages. Distrib. Comput. 4(3), 105–110 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  17. Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, pp. 173–186, (1999). URL https://dl.acm.org/citation.cfm?id=296824

  18. El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: Understanding how hpc systems fail. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE, (2013)

  19. King, S., Nadal, S.: Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. self-published paper, (2012)

  20. Ismail, L., Materwala, H.: A review of blockchain architecture and consensus protocols: use cases, challenges, and solutions. Symmetry (2019). https://doi.org/10.3390/sym11101198

    Article  Google Scholar 

  21. Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. USENIX Association, USENIX ATC’14, pp. 305-320, USA (2014)

  22. Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, SC ’11, New York (2011a). https://doi.org/10.1145/2063384.2063443

  23. Chang, T.-H., Hong, M., Liao, W.-C., Wang, X.: Asynchronous distributed admm for large-scale optimization-part i: algorithm and convergence analysis. IEEE Trans. Signal Process. 64(12), 3118–3130 (2016). https://doi.org/10.1109/TSP.2016.2537271

    Article  MathSciNet  MATH  Google Scholar 

  24. Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., Abraham, I.: Hotstuff: Bft consensus with linearity and responsiveness. In: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pp. 347–356. ACM (2019)

  25. Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). https://doi.org/10.1177/1094342013488238

    Article  Google Scholar 

  26. Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerant failure detection and consensus. In: Proceedings of the 22nd European MPI Users’ Group Meeting. Association for Computing Machinery, EuroMPI ’15, New York, (2015) https://doi.org/10.1145/2802658.2802660

  27. Popov, S.: The tangle. White Paper 1(3) (2018)

  28. Altarawneh, A., Skjellum, A.: The security ingredients for correct and byzantine fault-tolerant blockchain consensus algorithms. In: 2020 International Symposium on Networks, Computers and Communications (ISNCC), pp. 1–9, (2020). https://doi.org/10.1109/ISNCC49221.2020.9297326

  29. Al-Mamun, A., Li, T., Sadoghi, M., Jiang, L., Shen, H.-T., Zhao, D.: Hpchain: an mpi-based blockchain framework for data fidelity in high-performance computing systems (2019)

  30. Cachin, C., Vukolić, M.: Blockchain consensus protocols in the wild. arXiv preprint arXiv:1707.01873, (2017)

  31. De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, (2018)

  32. Dwork, C., Naor, M.: Pricing via processing or combatting junk mail. In: Brickell, E.F. (ed.) Advances in Cryptology – CRYPTO’ 92, pp. 139–147. Springer, Berlin Heidelberg (1993)

    Chapter  Google Scholar 

  33. Lamport, L.: The weak byzantine generals problem. J. ACM 30(3), 668–676 (1983). https://doi.org/10.1145/2402.322398

    Article  MathSciNet  MATH  Google Scholar 

  34. Bosilca, G., Bouteiller, A., Herault, T., Le Fèvre, V., Robert, Y., Dongarra, J.: Revisiting credit distribution algorithms for distributed termination detection. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 611–620 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00095

  35. Moise, I.: Efficient agreement protocols in asynchronous distributed systems. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 2022–2025. IEEE, (2011)

  36. Ropars, T., Lefray, A., Kim, D., Schiper, A.: Efficient process replication for MPI applications: Sharing work between replicas. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 645–654, (2015). https://doi.org/10.1109/IPDPS.2015.29

  37. Fischer, M.J.: The consensus problem in unreliable distributed systems (a brief survey). In: Karpinski, M. (ed.) Foundations of Computation Theory, pp. 127–140. Springer, Berlin Heidelberg (1983)

    Chapter  Google Scholar 

  38. Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: system-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005). https://doi.org/10.1177/1094342005056139

    Article  Google Scholar 

  39. Ferreira, K., Stearley, J., Laros, J. H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. (2011b). https://doi.org/10.1145/2063384.2063443

  40. Woo, S., Lang, S., Latham, R., Ross, R., Thakur, R.: Reliable MPI-IO through layout-aware replication (2011)

  41. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998). https://doi.org/10.1145/279227.279229

    Article  MATH  Google Scholar 

  42. Borowsky, E., Gafni, E.: Generalized flp impossibility result for<i>t</i>-resilient asynchronous computations. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing. STOC ’93, pp. 91–100. Association for Computing Machinery, New York. ISBN 0897915917. (1993). https://doi.org/10.1145/167088.167119

  43. Brokaw, T., Koziuk, G.: The intelligent platform management interface (IPMI) and enclosure management. Electron. Eng. (Lond.) 72, 19 (2000)

    Google Scholar 

  44. Costa, C. H. A., Park, Y., Rosenburg, B. S., Cher, C.-Y., Ryu, K. D.: A system software approach to proactive memory-error avoidance. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 707–718. IEEE Press, (2014). https://doi.org/10.1109/SC.2014.63

  45. Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: Logaider: A tool for mining potential correlations of hpc log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451 (2017). https://doi.org/10.1109/CCGRID.2017.18

  46. Leners, J.B., Wu, H., Hung, W.-L., Aguilera, M.K, Walfish, M.: Detecting failures in distributed systems with the falcon spy network, In: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pp. 279–294. New York, NY, Association for Computing Machinery (2011). https://doi.org/10.1145/2043556.2043583

  47. Moses, Y., Raynal, M.: Revisiting simultaneous consensus with crash failures. J. Parallel Distrib. Comput. 69(4), 400–409 (2009). https://doi.org/10.1016/j.jpdc.2009.01.001

    Article  Google Scholar 

  48. Bano, S., Sonnino, A., Al-Bassam, M., Azouvi, S., McCorry, P., Meiklejohn, S., Danezis, G.: Sok: Consensus in the age of blockchains. In: Proceedings of the 1st ACM Conference on Advances in Financial Technologies, pp. 183–198 (2019)

  49. Aguilera, M. K., Toueg, S.: Randomization and failure detection: a hybrid approach to solve consensus. Technical report (1996)

  50. Al-Mamun, A., Zhao, D.: BAASH: enabling blockchain-as-a-service on high-performance computing systems. CoRR Preprint at arxiv: 2001.07022 (2020)

  51. Lamport, L., Shostak, R.E., Pease, M.C.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)

    Article  MATH  Google Scholar 

  52. Duan, S.: Building reliable and practical byzantine fault tolerance. PhD dissertation, University of California Davis (2016)

  53. Fan, X., Chai, Q.: Roll-dpos: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. MobiQuitous ’18, pp. 482–484. New York (2018). https://doi.org/10.1145/3286978.3287023

  54. ...Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., Debardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)

    Article  Google Scholar 

  55. Darius, B.: Scalable distributed consensus to support mpi fault tolerance. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1240–1249. IEEE, (2012)

  56. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Depend. Secur. Comput. 7(4), 337–350 (2009)

    Article  Google Scholar 

  57. Leesatapornwongsa, T., Lukman, J.F., Lu, S., Gunawi, H.S.: TaxDC: a taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. SIGPLAN Not. 51(4), 517–530 (2016). https://doi.org/10.1145/2954679.2872374

    Article  Google Scholar 

  58. Omwenga, M., Otim, J., Lumala, A.: Robust mobile cloud services through offline support, pp. 90–93 (2012). https://doi.org/10.1109/ACSEAC.2012.27

  59. Stone, J., Partridge, C.: When the CRC and TCP checksum disagree. SIGCOMM Comput. Commun. Rev. 30(4), 309–319 (2000). https://doi.org/10.1145/347057.347561

    Article  Google Scholar 

  60. Huang, S.-T.: Detecting termination of distributed computations by external agents. In: [1989] Proceedings. The 9th International Conference on Distributed Computing Systems, pp. 79–84, (1989). https://doi.org/10.1109/ICDCS.1989.37933

  61. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011). https://doi.org/10.1561/2200000016

    Article  MATH  Google Scholar 

  62. Losada, N., González, P., Martín, M.J., Bosilca, G., Bouteiller, A., Teranishi, K.: Fault tolerance of MPI applications in exascale systems: the ULFM solution. Future Gener. Comput. Syst. 106, 467–481 (2020). https://doi.org/10.1016/j.future.2020.01.026

    Article  Google Scholar 

  63. Hassani, A., Skjellum, A., Bangalore, P. V., Brightwell, R.: Practical resilient cases for fa-mpi, a transactional fault-tolerant mpi. In: Proceedings of the 3rd Workshop on Exascale MPI. Association for Computing Machinery, ExaMPI ’15, New York (2015). https://doi.org/10.1145/2831129.2831130

  64. Hursey, J., Naughton, T., Vallee, G., Graham, R.L.: A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) Recent Advances in the Message Passing Interface, pp. 255–263. Springer, Berlin Heidelberg (2011)

    Chapter  Google Scholar 

  65. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35(2), 288–323 (1988). https://doi.org/10.1145/42282.42283

    Article  MathSciNet  Google Scholar 

  66. García-Pérez, Á., Gotsman, A., Meshman, Y., Sergey, I.: Paxos consensus, deconstructed and abstracted. In: Ahmed, A. (ed.) Programming Languages and Systems, pp. 912–939. Springer International Publishing, Cham (2018)

    Chapter  MATH  Google Scholar 

  67. Miguel Castro, B.L.: Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst. 20(4), 398–461 (2002). https://doi.org/10.1145/571637.571640

    Article  Google Scholar 

Download references

Acknowledgements

This work was performed with partial support from the National Science Foundation under Grants Nos. CCF-1562659, CCF-1562306, CCF-1617690, CCF-1822191, and CCF-1821431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Contributions

The first draft of the manuscript was written by all authors and we all commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Grace Nansamba.

Ethics declarations

Conflict of interest

The authors confirm that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nansamba, G., Altarawneh, A. & Skjellum, A. A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC. Int J Parallel Prog 51, 128–149 (2023). https://doi.org/10.1007/s10766-022-00749-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-022-00749-y

Keywords

Navigation