research-article

Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations

Authors:
Jie Zhao

College of Computer Science and Electronic Engineering, Hunan University, China

College of Computer Science and Electronic Engineering, Hunan University, China

0000-0003-2303-9736
View Profile

,
Jinchen Xu

Information Engineering University, China

Information Engineering University, China

0000-0002-6275-2617
View Profile

,
Peng Di

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0002-5799-5876
View Profile

,
Wang Nie

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0001-9903-8217
View Profile

,
Jiahui Hu

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0002-4367-0464
View Profile

,
Yanzhi Yi

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0002-3486-3731
View Profile

,
Sijia Yang

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0009-0005-5634-398X
View Profile

,
Zhen Geng

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0003-1031-6431
View Profile

,
Renwei Zhang

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0002-9744-5676
View Profile

,
Bojie Li

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0002-7390-3548
View Profile

,
Zhiliang Gan

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0000-0001-8983-4666
View Profile

,
Xuefeng Jin

Huawei Technologies Co. Ltd., China

Huawei Technologies Co. Ltd., China

0009-0006-2487-9402
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 41 Issue 1-4Article No.: 5pp 1–45https://doi.org/10.1145/3635305

Published:15 January 2024Publication History

ACM Transactions on Computer Systems

Abstract

Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers.

In this article, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upward exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability.

We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8× over cuBLAS/cuDNN and 1.1× over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.

REFERENCES

[1] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, Kudlur Manjunath, Levenberg Josh, Monga Rajat, Moore Sherry, Murray Derek G., Steiner Benoit, Tucker Paul, Vasudevan Vijay, Warden Pete, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadiGoogle Scholar
[2] Acharya Aravind, Bondhugula Uday, and Cohen Albert. 2018. Polyhedral auto-transformation with no integer linear programming. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’18). ACM, New York, NY, 529–542. DOI:Google ScholarDigital Library
[3] Acharya Aravind, Bondhugula Uday, and Cohen Albert. 2020. Effective loop fusion in polyhedral compilation using fusion conflict graphs. ACM Transactions on Architecture and Code Optimization 17, 4 (Sept. 2020), Article 26, 26 pages. DOI:Google ScholarDigital Library
[4] Adams Andrew, Ma Karima, Anderson Luke, Baghdadi Riyadh, Li Tzu-Mao, Gharbi Michaël, Steiner Benoit, Johnson Steven, Fatahalian Kayvon, Durand Frédo, and Ragan-Kelley Jonathan. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics 38, 4 (July 2019), Article 121, 12 pages. DOI:Google ScholarDigital Library
[5] Ansel Jason, Kamil Shoaib, Veeramachaneni Kalyan, Ragan-Kelley Jonathan, Bosboom Jeffrey, O’Reilly Una-May, and Amarasinghe Saman. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, 303–316. DOI:Google ScholarDigital Library
[6] Baghdadi Riyadh, Ray Jessica, Romdhane Malek Ben, Sozzo Emanuele Del, Akkas Abdurrahman, Zhang Yunming, Suriana Patricia, Kamil Shoaib, and Amarasinghe Saman. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’19). 193–205. DOI:Google ScholarCross Ref
[7] Bao Hesheng, Bielak Jacobo, Ghattas Omar, Kallivokas Loukas F., O’Hallaron David R., Shewchuk Jonathan R., and Xu Jifeng. 1998. Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers. Computer Methods in Applied Mechanics and Engineering 152, 1 (1998), 85–102. DOI:Google ScholarCross Ref
[8] Bastoul Cédric. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT ’04). IEEE, Los Alamitos, CA, 7–16. DOI:Google ScholarCross Ref
[9] Bhaskaracharya Somashekaracharya G., Demouth Julien, and Grover Vinod. 2020. Automatic kernel generation for Volta tensor cores. arXiv:cs.PL/2006.12645 (2020).Google Scholar
[10] Bondhugula Uday. 2015. PolyMage Benchmarks. (commit d20264ef). Retrieved December 7, 2023 from https://github.com/bondhugula/polymage-benchmarksGoogle Scholar
[11] Bondhugula Uday, Bandishti Vinayaka, and Pananilath Irshad. 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (Oct. 2017), 1285–1298. DOI:Google ScholarDigital Library
[12] Bondhugula Uday, Gunluk Oktay, Dash Sanjeeb, and Renganarayanan Lakshminarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT ’10). ACM, New York, NY, 343–352. DOI:Google ScholarDigital Library
[13] Bondhugula Uday, Hartono Albert, Ramanujam J., and Sadayappan P.. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, 101–113. DOI:Google ScholarDigital Library
[14] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Shen Haichen, Cowan Meghan, Wang Leyuan, Hu Yuwei, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). 578–594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle Scholar
[15] Chen Tianqi, Zheng Lianmin, Yan Eddie, Jiang Ziheng, Moreau Thierry, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389–3400.Google Scholar
[16] Cheng Heng-Tze, Koc Levent, Harmsen Jeremiah, Shaked Tal, Chandra Tushar, Aradhye Hrishi, Anderson Glen, Corrado Greg, Chai Wei, Ispir Mustafa, Anil Rohan, Haque Zakaria, Hong Lichan, Jain Vihan, Liu Xiaobing, and Shah Hemal. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS ’16). ACM, New York, NY, 7–10. DOI:Google ScholarDigital Library
[17] Chetlur Sharan, Woolley Cliff, Vandermersch Philippe, Cohen Jonathan, Tran John, Catanzaro Bryan, and Shelhamer Evan. 2014. cuDNN: Efficient primitives for deep learning. arXiv:cs.NE/1410.0759 (2014).Google Scholar
[18] Corporation Standard Performance Evaluation. 2007. SPEC CPU2000 V1.3. Retrieved December 7, 2023 from http://www.spec.org/cpu2000/Google Scholar
[19] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. DOI:Google ScholarCross Ref
[20] Feautrier Paul. 1991. Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 1 (Feb. 1991), 23–53. DOI:Google ScholarCross Ref
[21] Feautrier Paul and Lengauer Christian. 2011. Polyhedron model. In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 1581–1592. DOI:Google ScholarCross Ref
[22] Google. 2017. XLA: Optimizing Compiler for Machine Learning. Retrieved December 7, 2023 from https://www.tensorflow.org/xlaGoogle Scholar
[23] Grosser Tobias, Cohen Albert, Holewinski Justin, Sadayappan P., and Verdoolaege Sven. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’14). ACM, New York, NY, Article 66, 10 pages. DOI:Google ScholarDigital Library
[24] Grosser Tobias, Groesslinger Armin, and Lengauer Christian. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4 (2012), 1250010.Google ScholarCross Ref
[25] Grosser Tobias, Verdoolaege Sven, and Cohen Albert. 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4, Article 12 (July 2015), 50 pages. DOI:Google ScholarDigital Library
[26] Halide. 2013. Halide Benchmarks (commit 8c23a197). Retrieved December 7, 2023 from https://github.com/halide/HalideGoogle Scholar
[27] Harris Chris and Stephens Mike. 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 147–151.Google ScholarCross Ref
[28] Hartono Albert, Baskaran Muthu Manikandan, Bastoul Cédric, Cohen Albert, Krishnamoorthy Sriram, Norris Boyana, Ramanujam J., and Sadayappan P.. 2009. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). ACM, New York, NY, 147–157. DOI:Google ScholarDigital Library
[29] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 770–778. DOI:Google ScholarCross Ref
[30] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:cs.CV/1704.04861 (2017).Google Scholar
[31] Huawei. 2020. MindSpore. Retrieved December 7, 2023 from https://www.mindspore.cn/enGoogle Scholar
[32] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML ’15), Vol. 37. 448–456.Google Scholar
[33] Irigoin François and Triolet Rémi. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’88). ACM, New York, NY, 319–329. DOI:Google ScholarDigital Library
[34] Jangda Abhinav and Bondhugula Uday. 2020. An effective fusion and tile size model for PolyMage. ACM Transactions on Programming Languages and Systems 42, 3 (Nov. 2020), Article 12, 27 pages. DOI:Google ScholarDigital Library
[35] Jangda Abhinav and Guha Arjun. 2020. Model-based warp overlapped tiling for image processing programs on GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’20). ACM, New York, NY, 317–328. DOI:Google ScholarDigital Library
[36] Karr Michael. 1976. Affine relationships among variables of a program. Acta Informatica 6, 2 (June 1976), 133–151. DOI:Google ScholarDigital Library
[37] Kennedy Ken and McKinley Kathryn S.. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. 301–320. Google ScholarDigital Library
[38] Kim DaeGon, Renganarayanan Lakshminarayanan, Rostron Dave, Rajopadhye Sanjay, and Strout Michelle Mills. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC ’07). ACM, New York, NY, Article 51, 12 pages. DOI:Google ScholarDigital Library
[39] Krishnamoorthy Sriram, Baskaran Muthu, Bondhugula Uday, Ramanujam J., Rountev Atanas, and Sadayappan P.. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, 235–244. DOI:Google ScholarDigital Library
[40] Lattner Chris, Amini Mehdi, Bondhugula Uday, Cohen Albert, Davis Andy, Pienaar Jacques, Riddle River, Shpeisman Tatiana, Vasilache Nicolas, and Zinenko Oleksandr. 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’21). 2–14. DOI:Google ScholarDigital Library
[41] Lecun Y., Bottou L., Bengio Y., and Haffner P.. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324. DOI:Google ScholarCross Ref
[42] Liao Heng, Tu Jiajin, Xia Jing, Liu Hu, Zhou Xiping, Yuan Honghui, and Hu Yuxing. 2021. Ascend: A scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA ’21). 789–801. DOI:Google ScholarCross Ref
[43] Mehta Sanyam, Beeraka Gautham, and Yew Pen-Chung. 2013. Tile size selection revisited. ACM Transactions on Architecture and Code Optimization 10, 4 (Dec. 2013), Article 35, 27 pages. DOI:Google ScholarDigital Library
[44] Mullapudi Ravi Teja, Vasista Vinay, and Bondhugula Uday. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, 429–443. DOI:Google ScholarDigital Library
[45] Niu Wei, Guan Jiexiong, Wang Yanzhi, Agrawal Gagan, and Ren Bin. 2021. DNNFusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 883–898. DOI:Google ScholarDigital Library
[46] NVIDIA. 2013. cuBLAS. Retrieved December 7, 2023 from https://developer.nvidia.com/cublasGoogle Scholar
[47] Paris Sylvain, Hasinoff Samuel W., and Kautz Jan. 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. Communications of the ACM 58, 3 (Feb. 2015), 81–91. DOI:Google ScholarDigital Library
[48] Paris Sylvain, Kornprobst Pierre, and Tumblin Jack. 2009. Bilateral Filtering. Now Publishers, Hanover, MA. Google ScholarDigital Library
[49] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Alban Desmaison, Andreas Kopf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8026–8037.Google Scholar
[50] Pitchanathan Arjun, Ulmann Christian, Weber Michel, Hoefler Torsten, and Grosser Tobias. 2021. FPL: Fast presburger arithmetic through transprecision. Proceedings of the ACM on Programming Languages 5, OOPSLA (Oct. 2021), Article 162, 26 pages. DOI:Google ScholarDigital Library
[51] Pouchet Louis-Noël, Bondhugula Uday, Bastoul Cédric, Cohen Albert, Ramanujam J., Sadayappan P., and Vasilache Nicolas. 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’11). ACM, New York, NY, 549–562. DOI:Google ScholarDigital Library
[52] Pouchet Louis-Noël and Yuki Tomofumi. 2016. PolyBench/C 4.2. Retrieved December 7, 2023 from https://sourceforge.net/projects/polybenchGoogle Scholar
[53] Pugh William and Rosser Evan. 2000. Iteration space slicing for locality. In Languages and Compilers for Parallel Computing, Carter Larry and Ferrante Jeanne (Eds.). Springer, Berlin, Germany, 164–184. Google ScholarCross Ref
[54] Ragan-Kelley Jonathan, Barnes Connelly, Adams Andrew, Paris Sylvain, Durand Frédo, and Amarasinghe Saman. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, 519–530. DOI:Google ScholarDigital Library
[55] Rajbhandari Samyam, Kim Jinsung, Krishnamoorthy Sriram, Pouchet Louis-Noël, Rastello Fabrice, Harrison Robert J., and Sadayappan P.. 2016. On fusing recursive traversals of K-d trees. In Proceedings of the 25th International Conference on Compiler Construction (CC ’16). ACM, New York, NY, 152–162. DOI:Google ScholarDigital Library
[56] Redmon Joseph and Farhadi Ali. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). 6517–6525. DOI:Google ScholarCross Ref
[57] Simonyan Karen and Zisserman Andrew. 2014. Very Deep convolutional networks for large-scale image recognition. arXiv:cs.CV/1409.1556 (2014).Google Scholar
[58] Upadrasta Ramakrishna and Cohen Albert. 2013. Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’13). ACM, New York, NY, 483–496. DOI:Google ScholarDigital Library
[59] Vasilache Nicolas, Zinenko Oleksandr, Theodoridis Theodoros, Goyal Priya, Devito Zachary, Moses William S., Verdoolaege Sven, Adams Andrew, and Cohen Albert. 2019. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Transactions on Architecture and Code Optimization 16, 4 (Oct. 2019), Article 38, 26 pages. DOI:Google ScholarDigital Library
[60] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17). 6000–6010. Google ScholarDigital Library
[61] Verdoolaege Sven. 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS’10). Springer-Verlag, Berlin, Heidelberg, 299–302. Google ScholarCross Ref
[62] Verdoolaege Sven, Juega Juan Carlos, Cohen Albert, Gómez José Ignacio, Tenllado Christian, and Catthoor Francky. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article 54 (Jan. 2013), 23 pages. DOI:Google ScholarDigital Library
[63] Verdoolaege Sven and Janssens Gerda. 2017. Scheduling for PPCG. Report CW 706. KU Leuven.Google Scholar
[64] Zhao Jie and Cohen Albert. 2019. Flextended tiles: A flexible extension of overlapped tiles for polyhedral compilation. ACM Transactions on Architecture and Code Optimization 16, 4 (Dec. 2019), Article 47, 25 pages. DOI:Google ScholarDigital Library
[65] Zhao Jie and Di Peng. 2020. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’20). IEEE, Los Alamitos, CA, 427–441. DOI:Google ScholarCross Ref
[66] Zhao Jie, Gao Xiong, Xia Ruijie, Zhang Zhaochuang, Chen Deshi, Chen Lei, Zhang Renwei, Geng Zhen, Cheng Bin, and Jin Xuefeng. 2022. Apollo: Automatic partition-based operator fusion through layer by layer optimization. In Proceedings of Machine Learning and Systems, Vol. 4. 1–19.Google Scholar
[67] Zhao Jie, Kruse Michael, and Cohen Albert. 2018. A polyhedral compilation framework for loops with dynamic data-dependent bounds. In Proceedings of the 27th International Conference on Compiler Construction (CC ’18). ACM, New York, NY, 14–24. DOI:Google ScholarDigital Library
[68] Zhao Jie, Li Bojie, Nie Wang, Geng Zhen, Zhang Renwei, Gao Xiong, Cheng Bin, Wu Chen, Cheng Yun, Li Zheng, Di Peng, Zhang Kun, and Jin Xuefeng. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 1233–1248. DOI:Google ScholarDigital Library
[69] Zheng Lianmin, Jia Chengfan, Sun Minmin, Wu Zhao, Yu Cody Hao, Haj-Ali Ameer, Wang Yida, Yang Jun, Zhuo Danyang, Sen Koushik, Gonzalez Joseph E., and Stoica Ion. 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 863–879. https://www.usenix.org/conference/osdi20/presentation/zhengGoogle Scholar
[70] Zinenko Oleksandr, Verdoolaege Sven, Reddy Chandan, Shirako Jun, Grosser Tobias, Sarkar Vivek, and Cohen Albert. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC ’18). ACM, New York, NY, 3–13. DOI:Google ScholarDigital Library

Index Terms

Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Neural networks
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Translator writing systems and compiler generators

Recommendations

Reuse-Driven Tiling for Improving Data Locality

This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data ...
Read More
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
Read More
Joint affine transformation and loop pipelining for mapping nested loop on CGRAs
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition

Coarse-Grained Reconfigurable Architectures (CGRAs) are the promising architectures with high performance, high power- efficiency and attractions of flexibility. The computation-intensive portions of application, i.e. loops, are often implemented on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 41, Issue 1-4
November 2023
188 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/3637801
Editor:
Michael Swift
University of Wisconsin, Madison, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 January 2024
- Online AM: 1 December 2023
- Accepted: 1 November 2023
- Revised: 11 April 2023
- Received: 21 March 2022
Published in tocs Volume 41, Issue 1-4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Tiling
fusion
data locality
parallelism
redundant computation
memory hierarchy
polyhedral model
optimizing compilers
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 431
  Total Downloads
- Downloads (Last 12 months)431
- Downloads (Last 6 weeks)132
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations

ACM Transactions on Computer Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Reuse-Driven Tiling for Improving Data Locality

PLUTO+: near-complete modeling of affine transformations for parallelism and locality

Joint affine transformation and loop pipelining for mapping nested loop on CGRAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations

ACM Transactions on Computer Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Reuse-Driven Tiling for Improving Data Locality

PLUTO+: near-complete modeling of affine transformations for parallelism and locality

Joint affine transformation and loop pipelining for mapping nested loop on CGRAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media