Skip to main content
Log in

A Methodology for Efficient Tile Size Selection for Affine Loop Kernels

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Reducing the number of data accesses in memory hierarchy is of paramount importance on modern computer systems. One of the key optimizations addressing this problem is loop tiling, a well-known loop transformation that enhances data locality in memory hierarchy. The selection of an appropriate tile size is tackled by using both static (analytical) and dynamic empirical (auto-tuning) methods. Current analytical models are not accurate enough to effectively model the complex modern memory hierarchies and loop kernels with diverse characteristics, while auto-tuning methods are either too time-consuming (due to the huge search space) or less accurate (when heuristics are used to reduce the search space). In this paper, we reveal two important inefficiencies of current analytical loop tiling methods and we provide the theoretical background on how current methods can address these inefficiencies. To this end, we propose a new loop tiling method for affine loop kernels where the cache size, cache line size and cache associativity are better utilized, compared to the existing methods. Our evaluation results prove the efficiency of the proposed method in terms of cache misses and execution time, against related works, icc/gcc compilers and Pluto tool, on x86 and ARM based platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Notices 43(6), 101–113 (2008). https://doi.org/10.1145/1379022.1375595

    Article  Google Scholar 

  2. Mehta, S., Beeraka, G., Yew, P.C.: Tile size selection revisited. ACM Trans. Archit. Code Optim. 10(4), 1–27 (2013)

    Article  Google Scholar 

  3. Tavarageri, S., Pouchet, L.N., Ramanujam, J., Rountev, A., Sadayappan, P.: Dynamic selection of tile sizes. In: Proceedings of the 2011 18th International Conference on High Performance Computing, HIPC ’11, p. 1–10. IEEE Computer Society (2011)

  4. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)

    Article  Google Scholar 

  5. Sarkar, V., Megiddo, N.: An analytical model for loop tiling and its solution. In: 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422), pp. 146–153 (2000)

  6. Chatterjee, S., Parker, E., Hanlon, P.J., Lebeck, A.R.: Exact analysis of the cache behavior of nested loops. ACM SIGPLAN Notices 36(5), 286–297 (2001)

    Article  Google Scholar 

  7. Narasimhan, K., Acharya, A., Baid, A., Bondhugula, U.: A practical tile size selection model for affine loop nests. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’21, p. 27–39. Association for Computing Machinery, New York, NY (2021). https://doi.org/10.1145/3447818.3462213

  8. Li, R., Sukumaran-Rajam, A., Veras, R., Low, T.M., Rastello, F., Rountev, A., Sadayappan, P.: Analytical cache modeling and tilesize optimization for tensor contractions. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY (2019). https://doi.org/10.1145/3295500.3356218

  9. Hsu, Ch., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279–294 (2004). https://doi.org/10.1023/B:SUPE.0000011388.54204.8e

    Article  MATH  Google Scholar 

  10. Kelefouras, V., Kritikakou, A., Mporas, I., Kolonias, V.: A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures. J. Supercomput. 72(3), 804–844 (2016)

    Article  Google Scholar 

  11. Kelefouras, V.I., Kritikakou, A., Goutis, C.: A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD. J. Supercomput. (2014). https://doi.org/10.1007/s11227-014-1098-9

    Article  MATH  Google Scholar 

  12. Kelefouras, V., Kritikakou, A., Papadima, E., Goutis, C.: A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J. Supercomput. 71(7), 2644–2667 (2015)

    Article  Google Scholar 

  13. Kelefouras, V.I., Athanasiou, G.S., Alachiotis, N., Michail, H.E., Kritikakou, A.S., Goutis, C.E.: A methodology for speeding up fast Fourier transform focusing on memory architecture utilization. IEEE Trans. Signal Process. 59(12), 6217–6226 (2011)

    Article  MathSciNet  Google Scholar 

  14. Li, Y., Sun, H., Pang, J.: Revisiting split tiling for stencil computations in polyhedral compilation. J. Supercomput. 78(1), 440–470 (2021)

    Article  Google Scholar 

  15. Cohen, A., Zhao, J.: Flextended tiles: a flexible extension of overlapped tiles for polyhedral compilation. ACM Trans. Archit. Code Optim. (2020). https://doi.org/10.1145/3369382

    Article  Google Scholar 

  16. Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, p. 207-218. Association for Computing Machinery, New York, NY (2012). https://doi.org/10.1145/2259016.2259044

  17. Bondhugula, U., Bandishti, V., Pananilath, I.: Diamond tiling: tiling techniques to maximize parallelism for stencil computations. IEEE Trans. Parallel Distrib. Syst. 28(5), 1285–1298 (2017). https://doi.org/10.1109/TPDS.2016.2615094

    Article  Google Scholar 

  18. Alshboul, M., Tuck, J., Solihin, Y.: Wet: write efficient loop tiling for non-volatile main memory. In: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference, DAC ’20. IEEE Press (2020)

  19. Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, p. 147–157. Association for Computing Machinery, New York, NY (2009)

  20. Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO ’10, p. 200–209. Association for Computing Machinery, New York, NY (2010)

  21. Hartono, A., Baskaran, M., Ramanujam, J., Sadayappan, P.: Dyntile: parametric tiled loop generation for parallel execution on multicore processors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)

  22. Renganarayanan, L., Kim, D., Strout, M.M., Rajopadhye, S.: Parameterized loop tiling. ACM Trans. Program. Lang. Syst. 34(1), 1–41 (2012)

    Article  Google Scholar 

  23. Mehdi, A., Béatrice, C., Stéphanie, E., Ronan, K., Onil, G., Serge, G., Janice, O., François Xavier, P., Grégoire, P., Villalon., P.: Par4all : from convex array regions to heterogeneous computing. In: 2nd International Workshop on Polyhedral Compilation Techniques (2012)

  24. Tavarageri, S., Hartono, A., Baskaran, M., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: Parametric tiling of affine loop nests. In: 15th Workshop on Compilers for Parallel Computing (CPC’10). Vienna, Austria (2010)

  25. Hammami, E., Slama, Y.: An overview on loop tiling techniques for code generation. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 280–287 (2017)

  26. Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A.E., O’Brien, K.: Automatic creation of tile size selection models. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, p. 190–199. Association for Computing Machinery, New York, NY (2010)

  27. Sato, Y., Yuki, T., Endo, T.: An autotuning framework for scalable execution of tiled code via iterative polyhedral compilation. ACM Trans. Archit. Code Optim. (2019). https://doi.org/10.1145/3293449

    Article  Google Scholar 

  28. Abella, J.: Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In: Proceedings of the 2002 International Conference on Parallel Processing Workshops, ICPPW ’02, p. 568. IEEE Computer Society (2002)

  29. Parsa, S., Lotfi, S.: A new genetic algorithm for loop tiling. J. Supercomput. 37, 249–269 (2006)

    Article  Google Scholar 

  30. Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05, p. 111–122. IEEE Computer Society (2005)

  31. Shirako, J., Sharma, K., Fauzia, N., Pouchet, L.N., Ramanujam, J., Sadayappan, P., Sarkar, V.: Analytical bounds for optimal tile size selection. In: Proceedings of the 21st International Conference on Compiler Construction, CC’12, p. 101–121. Springer-Verlag, Berlin, Heidelberg (2012)

  32. Bao, B., Ding, C.: Defensive loop tiling for shared cache. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO ’13, pp. 1–11. IEEE Computer Society, Washington, DC (2013). https://doi.org/10.1109/CGO.2013.6495008

  33. Kelefouras, V., Georgios, K., Nikolaos, V.: Combining software cache partitioning and loop tiling for effective shared cache management. ACM Trans. Embed. Comput. Syst. (2018). https://doi.org/10.1145/3202663

    Article  Google Scholar 

  34. Nethercote, N., Walsh, R., Fitzhardinge, J.: Building workload characterization tools with valgrind. In: IISWC, p. 2. IEEE Computer Society (2006)

  35. Bao, W., Krishnamoorthy, S., Pouchet, L.N., Sadayappan, P.: Analytical modeling of cache behavior for affine programs. Proc. ACM Program. Lang. (2017). https://doi.org/10.1145/3158120

    Article  Google Scholar 

  36. Gysi, T., Grosser, T., Brandner, L., Hoefler, T.: A fast analytical model of fully associative caches. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, p. 816–829. Association for Computing Machinery, New York, NY (2019). https://doi.org/10.1145/3314221.3314606

  37. Wang, D., Sun, X.H.: APC: a novel memory metric and measurement methodology for modern memory systems. IEEE Trans. Comput. 63(7), 1626–1639 (2014). https://doi.org/10.1109/TC.2013.38

    Article  MathSciNet  MATH  Google Scholar 

  38. Pouchet, L.: Polybench/c. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/. Accessed 10 Oct 2020

  39. Linux kernel profiling with perf. https://perf.wiki.kernel.org/index.php/Tutorial. Accessed 10 Oct 2020

Download references

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 957210 - XANDAR: X-by-Construction Design framework for Engineering Autonomous & Distributed Real-time Embedded Software Systems.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasilios Kelefouras.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kelefouras, V., Djemame, K., Keramidas, G. et al. A Methodology for Efficient Tile Size Selection for Affine Loop Kernels. Int J Parallel Prog 50, 405–432 (2022). https://doi.org/10.1007/s10766-022-00734-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-022-00734-5

Keywords

Navigation