A Methodology for Efficient Tile Size Selection for Affine Loop Kernels

Kelefouras, Vasilios; Djemame, Karim; Keramidas, Georgios; Voros, Nikolaos

doi:10.1007/s10766-022-00734-5

A Methodology for Efficient Tile Size Selection for Affine Loop Kernels

Published: 23 May 2022

Volume 50, pages 405–432, (2022)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Vasilios Kelefouras ORCID: orcid.org/0000-0001-9591-913X¹,
Karim Djemame²,
Georgios Keramidas³ &
…
Nikolaos Voros⁴

279 Accesses
2 Citations
Explore all metrics

Abstract

Reducing the number of data accesses in memory hierarchy is of paramount importance on modern computer systems. One of the key optimizations addressing this problem is loop tiling, a well-known loop transformation that enhances data locality in memory hierarchy. The selection of an appropriate tile size is tackled by using both static (analytical) and dynamic empirical (auto-tuning) methods. Current analytical models are not accurate enough to effectively model the complex modern memory hierarchies and loop kernels with diverse characteristics, while auto-tuning methods are either too time-consuming (due to the huge search space) or less accurate (when heuristics are used to reduce the search space). In this paper, we reveal two important inefficiencies of current analytical loop tiling methods and we provide the theoretical background on how current methods can address these inefficiencies. To this end, we propose a new loop tiling method for affine loop kernels where the cache size, cache line size and cache associativity are better utilized, compared to the existing methods. Our evaluation results prove the efficiency of the proposed method in terms of cache misses and execution time, against related works, icc/gcc compilers and Pluto tool, on x86 and ARM based platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Analytical Model for Loop Tiling Transformation

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Article 13 May 2019

Hardware Based Loop Optimization for CGRA Architectures

References

Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Notices 43(6), 101–113 (2008). https://doi.org/10.1145/1379022.1375595
Article Google Scholar
Mehta, S., Beeraka, G., Yew, P.C.: Tile size selection revisited. ACM Trans. Archit. Code Optim. 10(4), 1–27 (2013)
Article Google Scholar
Tavarageri, S., Pouchet, L.N., Ramanujam, J., Rountev, A., Sadayappan, P.: Dynamic selection of tile sizes. In: Proceedings of the 2011 18th International Conference on High Performance Computing, HIPC ’11, p. 1–10. IEEE Computer Society (2011)
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Article Google Scholar
Sarkar, V., Megiddo, N.: An analytical model for loop tiling and its solution. In: 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422), pp. 146–153 (2000)
Chatterjee, S., Parker, E., Hanlon, P.J., Lebeck, A.R.: Exact analysis of the cache behavior of nested loops. ACM SIGPLAN Notices 36(5), 286–297 (2001)
Article Google Scholar
Narasimhan, K., Acharya, A., Baid, A., Bondhugula, U.: A practical tile size selection model for affine loop nests. In: Proceedings of the ACM International Conference on Supercomputing, ICS ’21, p. 27–39. Association for Computing Machinery, New York, NY (2021). https://doi.org/10.1145/3447818.3462213
Li, R., Sukumaran-Rajam, A., Veras, R., Low, T.M., Rastello, F., Rountev, A., Sadayappan, P.: Analytical cache modeling and tilesize optimization for tensor contractions. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, New York, NY (2019). https://doi.org/10.1145/3295500.3356218
Hsu, Ch., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279–294 (2004). https://doi.org/10.1023/B:SUPE.0000011388.54204.8e
Article MATH Google Scholar
Kelefouras, V., Kritikakou, A., Mporas, I., Kolonias, V.: A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures. J. Supercomput. 72(3), 804–844 (2016)
Article Google Scholar
Kelefouras, V.I., Kritikakou, A., Goutis, C.: A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD. J. Supercomput. (2014). https://doi.org/10.1007/s11227-014-1098-9
Article MATH Google Scholar
Kelefouras, V., Kritikakou, A., Papadima, E., Goutis, C.: A methodology for speeding up matrix vector multiplication for single/multi-core architectures. J. Supercomput. 71(7), 2644–2667 (2015)
Article Google Scholar
Kelefouras, V.I., Athanasiou, G.S., Alachiotis, N., Michail, H.E., Kritikakou, A.S., Goutis, C.E.: A methodology for speeding up fast Fourier transform focusing on memory architecture utilization. IEEE Trans. Signal Process. 59(12), 6217–6226 (2011)
Article MathSciNet Google Scholar
Li, Y., Sun, H., Pang, J.: Revisiting split tiling for stencil computations in polyhedral compilation. J. Supercomput. 78(1), 440–470 (2021)
Article Google Scholar
Cohen, A., Zhao, J.: Flextended tiles: a flexible extension of overlapped tiles for polyhedral compilation. ACM Trans. Archit. Code Optim. (2020). https://doi.org/10.1145/3369382
Article Google Scholar
Zhou, X., Giacalone, J.P., Garzarán, M.J., Kuhn, R.H., Ni, Y., Padua, D.: Hierarchical overlapped tiling. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO ’12, p. 207-218. Association for Computing Machinery, New York, NY (2012). https://doi.org/10.1145/2259016.2259044
Bondhugula, U., Bandishti, V., Pananilath, I.: Diamond tiling: tiling techniques to maximize parallelism for stencil computations. IEEE Trans. Parallel Distrib. Syst. 28(5), 1285–1298 (2017). https://doi.org/10.1109/TPDS.2016.2615094
Article Google Scholar
Alshboul, M., Tuck, J., Solihin, Y.: Wet: write efficient loop tiling for non-volatile main memory. In: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference, DAC ’20. IEEE Press (2020)
Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, p. 147–157. Association for Computing Machinery, New York, NY (2009)
Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO ’10, p. 200–209. Association for Computing Machinery, New York, NY (2010)
Hartono, A., Baskaran, M., Ramanujam, J., Sadayappan, P.: Dyntile: parametric tiled loop generation for parallel execution on multicore processors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)
Renganarayanan, L., Kim, D., Strout, M.M., Rajopadhye, S.: Parameterized loop tiling. ACM Trans. Program. Lang. Syst. 34(1), 1–41 (2012)
Article Google Scholar
Mehdi, A., Béatrice, C., Stéphanie, E., Ronan, K., Onil, G., Serge, G., Janice, O., François Xavier, P., Grégoire, P., Villalon., P.: Par4all : from convex array regions to heterogeneous computing. In: 2nd International Workshop on Polyhedral Compilation Techniques (2012)
Tavarageri, S., Hartono, A., Baskaran, M., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: Parametric tiling of affine loop nests. In: 15th Workshop on Compilers for Parallel Computing (CPC’10). Vienna, Austria (2010)
Hammami, E., Slama, Y.: An overview on loop tiling techniques for code generation. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 280–287 (2017)
Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A.E., O’Brien, K.: Automatic creation of tile size selection models. In: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, p. 190–199. Association for Computing Machinery, New York, NY (2010)
Sato, Y., Yuki, T., Endo, T.: An autotuning framework for scalable execution of tiled code via iterative polyhedral compilation. ACM Trans. Archit. Code Optim. (2019). https://doi.org/10.1145/3293449
Article Google Scholar
Abella, J.: Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In: Proceedings of the 2002 International Conference on Parallel Processing Workshops, ICPPW ’02, p. 568. IEEE Computer Society (2002)
Parsa, S., Lotfi, S.: A new genetic algorithm for loop tiling. J. Supercomput. 37, 249–269 (2006)
Article Google Scholar
Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05, p. 111–122. IEEE Computer Society (2005)
Shirako, J., Sharma, K., Fauzia, N., Pouchet, L.N., Ramanujam, J., Sadayappan, P., Sarkar, V.: Analytical bounds for optimal tile size selection. In: Proceedings of the 21st International Conference on Compiler Construction, CC’12, p. 101–121. Springer-Verlag, Berlin, Heidelberg (2012)
Bao, B., Ding, C.: Defensive loop tiling for shared cache. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO ’13, pp. 1–11. IEEE Computer Society, Washington, DC (2013). https://doi.org/10.1109/CGO.2013.6495008
Kelefouras, V., Georgios, K., Nikolaos, V.: Combining software cache partitioning and loop tiling for effective shared cache management. ACM Trans. Embed. Comput. Syst. (2018). https://doi.org/10.1145/3202663
Article Google Scholar
Nethercote, N., Walsh, R., Fitzhardinge, J.: Building workload characterization tools with valgrind. In: IISWC, p. 2. IEEE Computer Society (2006)
Bao, W., Krishnamoorthy, S., Pouchet, L.N., Sadayappan, P.: Analytical modeling of cache behavior for affine programs. Proc. ACM Program. Lang. (2017). https://doi.org/10.1145/3158120
Article Google Scholar
Gysi, T., Grosser, T., Brandner, L., Hoefler, T.: A fast analytical model of fully associative caches. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, p. 816–829. Association for Computing Machinery, New York, NY (2019). https://doi.org/10.1145/3314221.3314606
Wang, D., Sun, X.H.: APC: a novel memory metric and measurement methodology for modern memory systems. IEEE Trans. Comput. 63(7), 1626–1639 (2014). https://doi.org/10.1109/TC.2013.38
Article MathSciNet MATH Google Scholar
Pouchet, L.: Polybench/c. http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/. Accessed 10 Oct 2020
Linux kernel profiling with perf. https://perf.wiki.kernel.org/index.php/Tutorial. Accessed 10 Oct 2020

Download references

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 957210 - XANDAR: X-by-Construction Design framework for Engineering Autonomous & Distributed Real-time Embedded Software Systems.

Author information

Authors and Affiliations

University Plymouth, Plymouth, UK
Vasilios Kelefouras
School of Computing, University of Leeds, Leeds, UK
Karim Djemame
School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Georgios Keramidas
Electrical & Computer Engineering Department, University of Peloponnese, Patras, Greece
Nikolaos Voros

Authors

Vasilios Kelefouras
View author publications
You can also search for this author in PubMed Google Scholar
Karim Djemame
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Keramidas
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaos Voros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vasilios Kelefouras.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelefouras, V., Djemame, K., Keramidas, G. et al. A Methodology for Efficient Tile Size Selection for Affine Loop Kernels. Int J Parallel Prog 50, 405–432 (2022). https://doi.org/10.1007/s10766-022-00734-5

Download citation

Received: 30 July 2021
Accepted: 30 April 2022
Published: 23 May 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s10766-022-00734-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Methodology for Efficient Tile Size Selection for Affine Loop Kernels

Abstract

Access this article

Similar content being viewed by others

An Analytical Model for Loop Tiling Transformation

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Hardware Based Loop Optimization for CGRA Architectures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Methodology for Efficient Tile Size Selection for Affine Loop Kernels

Abstract

Access this article

Similar content being viewed by others

An Analytical Model for Loop Tiling Transformation

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Hardware Based Loop Optimization for CGRA Architectures

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation