Skip to main content
Log in

Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

With the widespread use of the multicore systems having smaller transistor sizes, soft errors become an important issue for parallel program execution. Fault injection is a prevalent method to quantify the soft error rates of the applications. However, it is very time consuming to perform detailed fault injection experiments. Therefore, prediction-based techniques have been proposed to evaluate the soft error vulnerability in a faster way. In this work, we present a soft error vulnerability prediction approach for parallel applications using machine learning algorithms. We define a set of features including thread communication, data sharing, parallel programming, and performance characteristics; and train our models based on three ML algorithms. This study uses the parallel programming features, as well as the combination of all features for the first time in vulnerability prediction of parallel programs. We propose two models for the soft error vulnerability prediction: (1) A regression model with rigorous feature selection analysis that estimates correct execution rates, (2) A novel classification model that predicts the vulnerability level of the target programs. We get maximum prediction accuracy rate of 73.2% for the regression-based model, and achieve 89% F-score for our classification model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. perf: Linux profiling with performance counters (2015). https://perf.wiki.kernel.org/index.php/Main_Page

  2. Andersch, M., Juurlink, B., Chi, C.C.: A benchmark suite for evaluating parallel programming models. In: Proceedings 24th Workshop on Parallel Systems and Algorithms (2011)

  3. Barrow-Williams, N., Fensch, C., Moore, S.: A communication characterisation of splash-2 and parsec. In: IEEE International Symposium on Workload Characterization (IISWC) (2009)

  4. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The parsec benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (2008)

  5. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  6. Chatzidimitriou, A., Bodmann, P., Papadimitriou, G., Gizopoulos, D., Rech, P.: Demystifying soft error assessment strategies on arm cpus: microarchitectural fault injection vs. neutron beam experiments. In: 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 26–38 (2019)

  7. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  8. da Rosa, F.R., Garibotti, R., Ost, L., Reis, R.: Using machine learning techniques to evaluate multicore soft error reliability. IEEE Trans. Circuits Syst. I: Reg. Pap. 66(6), 2151–2164 (2019)

    Article  Google Scholar 

  9. Deniz, E., Sen, A., Kahne, B., Holt, J.: Minime: Pattern-aware multicore benchmark synthesizer. IEEE Trans. Comput. 64(8), 2239–2252 (2015). https://doi.org/10.1109/TC.2014.2349522

    Article  MathSciNet  MATH  Google Scholar 

  10. Diener, M., Cruz, E.H., Pilla, L.L., Dupros, F., Navaux, P.O.: Characterizing communication and page usage of parallel applications for thread and data mapping. Perform. Eval. 88–89, 18–36 (2015)

    Article  Google Scholar 

  11. Diener, M., Cruz, E.H.M., Alves, M.A.Z., Alhakeem, M.S., Navaux, P.O.A., Heiß, H.U.: Locality and balance for communication-aware thread mapping in multicore systems. In: European Conference on Parallel Processing (Euro-Par) (2015)

  12. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Proceedings of the 9th International Conference on Neural Information Processing Systems (NIPS) (1996)

  13. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)

    Article  MathSciNet  Google Scholar 

  14. Guo, L., Li, D., Laguna, I.: PARIS: Predicting Application Resilience Using Machine Learning. arXiv e-prints arXiv:1812.02944 (2018)

  15. Hari, S.K.S., Tsai, T., Stephenson, M., Keckler, S.W., Emer, J.: Sassifi: An architecture-level fault injection tool for gpu application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (2017)

  16. Iqbal, S.M.Z., Liang, Y., Grahn, H.: Parmibench—an open-source benchmark for embedded multiprocessor systems. IEEE Comput. Archit. Lett. 9(2), 45–48 (2010)

    Article  Google Scholar 

  17. Kalra, C., Previlon, F., Li, X., Rubin, N., Kaeli, D.: Prism: Predicting resilience of gpu applications using statistical methods. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018)

  18. Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: Ipas: Intelligent protection against silent output corruption in scientific applications. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (2016)

  19. Leveugle, R., Calvez, A., Maistri, P., Vanhauwaert, P.: Statistical fault injection: quantified error and confidence. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE) (2009)

  20. Li, G., Pattabiraman, K., Hari, S.K.S., Sullivan, M., Tsai, T.: Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (2018)

  21. Liu, L., Ci, L., Liu, W., Yang, H.: Identifying sdc-causing instructions based on random forests algorithm. KSII Trans. Internet Inf. Syst. 13, 1566–1582 (2019)

    Google Scholar 

  22. Liu, Y., Li, J., Zhuang, Y.: Instruction sdc vulnerability prediction using long short-term memory neural network. In: Gan, G., Li, B., Li, X., Wang, S. (eds.) Advanced Data Mining and Applications, pp. 140–149. Springer, Cham (2018)

    Chapter  Google Scholar 

  23. Lu, Q., Pattabiraman, K., Gupta, M.S., Rivers, J.A.: Sdctune: A model for predicting the sdc proneness of an application for configurable protection. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) (2014)

  24. Luk, C.K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.J., Hazelwood, K.: Pin: Building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (2005)

  25. Mittal, S., Vetter, J.S.: A survey of techniques for modeling and improving reliability of computing systems. IEEE Trans. Parall. Distrib. Syst. 27(4), 1226–1238 (2016)

    Article  Google Scholar 

  26. Mukherjee, S.: Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2008)

    Google Scholar 

  27. Mutlu, B.O., Kestor, G., Cristal, A., Unsal, O., Krishnamoorthy, S.: Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) (2019)

  28. Nie, B., Xue, J., Gupta, S., Patel, T., Engelmann, C., Smirni, E., Tiwari, D.: Machine learning models for gpu error prediction in a large scale hpc system. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (2018)

  29. Oliveira, D., Moreira, F.B., Rech, P., Navaux, P.: Predicting the reliability behavior of hpc applications. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) (2018)

  30. Oliveira, D.A.G.D., Pilla, L.L., Hanzich, M., Fratin, V., Fernandes, F., Lunardi, C., Cela, J.M., Navaux, P.O.A., Carro, L., Rech, P.: Radiation-induced error criticality in modern hpc parallel accelerators. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 577–588 (2017)

  31. Parasyris, K., Tziantzoulis, G., Antonopoulos, C.D., Bellas, N.: Gemfi: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 622–629 (2014)

  32. Pearce, O., Gamblin, T., de Supinski, B.R., Schulz, M., Amato, N.M.: Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing (2012)

  33. Poovey, J., Railing, B., Conte, T.: Parallel pattern detection for architectural improvements. In: Proceedings of the 3rd USENIX Conference Hot Topic Parallelism (2011)

  34. Rodrigues, G.S., Kastensmidt, F.L., Reis, R., Rosa, F., Ost, L.: Analyzing the impact of using pthreads versus openmp under fault injection in arm cortex-a9 dual-core. In: 16th European Conference on Radiation and Its Effects on Components and Systems (RADECS) (2016)

  35. Rosa, F.d., Bandeira, V., Reis, R., Ost, L.: Extensive evaluation of programming models and isas impact on multicore soft error reliability. In: Proceedings of the 55th Annual Design Automation Conference (DAC) (2018)

  36. Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., Debardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)

    Article  Google Scholar 

  37. Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K.B., Stearley, J., Shalf, J., Gurumurthi, S.: Memory errors in modern systems: The good, the bad, and the ugly. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2015)

  38. Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report 12-01, University of Illinois at Urbana-Champaign (2012)

  39. Tanikella, K., Koy, Y., Jeyapaul, R., Kyoungwoo Lee, Shrivastava, A.: gemv: A validated toolset for the early exploration of system reliability. In: 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 159–163 (2016)

  40. Vishnu, A.V., Dam, H., Tallent, N.R., Kerbyson, D.J., Hoisie, A.: Fault modeling of extreme scale applications using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2016)

  41. Wei, J., Thomas, A., Li, G., Pattabiraman, K.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: International Conference on Dependable Systems and Networks (DSN) (2014)

  42. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The splash-2 programs: Characterization and methodological considerations. In: Proceedings of the 22Nd Annual International Symposium on Computer Architecture (ISCA) (1995)

  43. Yang, N., Wang, Y.: Predicting the silent data corruption vulnerability of instructions in programs. In: 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS) (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Işıl Öz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Öz, I., Arslan, S. Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning. Int J Parallel Prog 49, 410–439 (2021). https://doi.org/10.1007/s10766-021-00707-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-021-00707-0

Keywords

Navigation