Skip to main content
Log in

Partitioning-Aware Performance Modeling of Distributed Graph Processing Tasks

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Much of the data being produced in large scale by modern applications represents connected entities and their relationships, that can be modeled as large graphs. In order to extract valuable information from these large datasets, several parallel and distributed graph processing engines have been proposed. These systems are designed to run in large clusters, where resources must by allocated efficiently. Aiming to handle this problem, this paper presents a performance prediction model for GPS, a popular Pregel-based graph processing framework. By leveraging a micro-partitioning technique, our system can use various partitioning algorithms that greatly reduce the execution time, comparing with the simple hash partitioning that is commonly used in graph processing systems. Experimental results show that the prediction model has accuracy close to 90%, allowing it to be used in schedulers or to estimate the cost of running graph processing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. https://aws.amazon.com/ec2/instance-types/.

  2. https://docs.aws.amazon.com/AWSEC2/latest/APIReference/.

References

  1. Andreev, K., Racke, H.: Balanced graph partitioning. Theory Comput. Syst. 39(6), 929–939 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  2. Avery, C.: Giraph: Large-scale graph processing infrastructure on Hadoop. In: Proceedings of Hadoop Summit (2011)

  3. Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proceedings of the 13th International World Wide Web Conference, pp. 595–601. Manhattan, USA (2004)

  4. Cherkassky, B.V., Goldberg, A.V., Radzik, T.: Shortest paths algorithms: theory and experimental evaluation. Math. Program. 73(2), 129–174 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  5. Cordeiro, M., Sarmento, R.P., Brazdil, P., Gama, J.: Evolving networks and social network analysis methods and techniques. In: Social Media and Journalism-Trends, Connections, Implications. IntechOpen (2018)

  6. Danilevsky, M., Koh, E.: Information graph model and application to online advertising. In: Proceedings of the 1st Workshop on User Engagement Optimization, pp. 11–14. ACM (2013)

  7. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  8. Fernandes, K., Melhem, R., Hammoud, M.: Investigating and modeling performance scalability for distributed graph analytics. In: 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp. 34–3, (2018)

  9. Garimella, K., Morales, G.D.F., Gionis, A., Mathioudakis, M.: Quantifying controversy on social media. ACM Trans. Social Comput. 1(1), 3 (2018)

    Article  Google Scholar 

  10. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th Symposium on Operating System Design and Implementation (2012)

  11. Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of Pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–58 (2014). https://doi.org/10.14778/2732977.2732980

    Article  Google Scholar 

  12. Joaquim, P., Bravo, M., Rodrigues, L., Matos, M.: Hourglass: leveraging transient resources for time-constrained graph processing in the cloud. In: Proceedings of the Fourteenth EuroSys Conference 2019, ACM, p. 35, (2019)

  13. Karypis, G., Kumar, V.: Multilevel graph partitioning schemes. In: ICPP (3), pp. 113–122 (1995)

  14. Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016). https://doi.org/10.1109/TPDS.2015.2405552

    Article  Google Scholar 

  15. Kumar, D., Raj, A., Dharanipragada, J.: Graphsteal: Dynamic re-partitioning for efficient graph processing in heterogeneous clusters. In: Cloud Computing (CLOUD), 2017 IEEE 10th International Conference on, pp. 439–446. IEEE (2017)

  16. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection (2014)

  17. Li, Z., Zhang, B., Ren, S., Liu, Y., Qin, Z., Goh, R.S.M., Gurusamy, M.: Performance modelling and cost effective execution for distributed graph processing on configurable VMs. Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing pp. 74–83 (2017)

  18. Lumsdaine, A., Gregor, D., Hendrickson, B., Berry, J.: Challenges in parallel graph processing. Parallel Process. Lett. 17(01), 5–20 (2007)

    Article  MathSciNet  Google Scholar 

  19. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010)

  20. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Tech. rep, Stanford InfoLab (1999)

    Google Scholar 

  21. Presser, D., Siqueira, F., Reina, F.: Performance modeling and task scheduling in distributed graph processing. In: 2018 IEEE International Congress on Big Data (BigData Congress), pp. 135–142. IEEE (2018)

  22. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: AAAI (2015). URL http://networkrepository.com

  23. Rule Quest Research: Data mining with cubist (2020). URL https://www.rulequest.com/cubist-info.html

  24. Salihoglu, S., Widom, J.: GPS: A graph processing system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management (2013)

  25. Seo, S., Yoon, E.J., Kim, J., Jin, S., Kim, J.S., Maeng, S.: Hama: An efficient matrix computation with the mapreduce framework. In: Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science, pp. 721–726 (2010)

  26. Tsourakakis, C., Gkantsidis, C., Radunovic, B., Vojnovic, M.: Fennel: Streaming graph partitioning for massive scale graphs. In: Proceedings of the 7th ACM international conference on Web search and data mining, pp. 333–342. ACM (2014)

  27. Turek, J., Wolf, J.L., Yu, P.S.: Approximate algorithms scheduling parallelizable tasks. In: Proceedings of the 4th ACM Symposium on Parallel Algorithms and Architectures (1992). https://doi.org/10.1145/140901.141909

  28. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  29. Webber, J.: Real-time fraud detection with graphs (2016). URL https://www.infoq.com/presentations/graph-fraud-detection

  30. White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc., USA (2012)

    Google Scholar 

  31. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: A resilient distributed graph system on spark. In: Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems (2013)

  32. Xue, J., Yang, Z., Hou, S., Dai, Y.: When computing meets heterogeneous cluster: Workload assignment in graph computation. In: Big Data (Big Data), 2015 IEEE International Conference on, IEEE, pp. 154–163, (2015)

  33. Yalavarthi, V.K., Khan, A.: Steering top-k influencers in dynamic graphs via local updates. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp. 576–583, (2018)

  34. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., et al.: Apache Spark: A unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This study was financed in part by CAPES (http://www.capes.gov.br) and by CNPq (http://cnpq.br).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Presser.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Presser, D., Siqueira, F. Partitioning-Aware Performance Modeling of Distributed Graph Processing Tasks. Int J Parallel Prog 51, 231–255 (2023). https://doi.org/10.1007/s10766-023-00753-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-023-00753-w

Keywords

Navigation