skip to main content
research-article

Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations

Authors Info & Claims
Published:15 January 2024Publication History
Skip Abstract Section

Abstract

Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers.

In this article, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upward exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability.

We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8× over cuBLAS/cuDNN and 1.1× over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.

REFERENCES

  1. [1] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, Kudlur Manjunath, Levenberg Josh, Monga Rajat, Moore Sherry, Murray Derek G., Steiner Benoit, Tucker Paul, Vasudevan Vijay, Warden Pete, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadiGoogle ScholarGoogle Scholar
  2. [2] Acharya Aravind, Bondhugula Uday, and Cohen Albert. 2018. Polyhedral auto-transformation with no integer linear programming. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’18). ACM, New York, NY, 529542. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Acharya Aravind, Bondhugula Uday, and Cohen Albert. 2020. Effective loop fusion in polyhedral compilation using fusion conflict graphs. ACM Transactions on Architecture and Code Optimization 17, 4 (Sept. 2020), Article 26, 26 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Adams Andrew, Ma Karima, Anderson Luke, Baghdadi Riyadh, Li Tzu-Mao, Gharbi Michaël, Steiner Benoit, Johnson Steven, Fatahalian Kayvon, Durand Frédo, and Ragan-Kelley Jonathan. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics 38, 4 (July 2019), Article 121, 12 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Ansel Jason, Kamil Shoaib, Veeramachaneni Kalyan, Ragan-Kelley Jonathan, Bosboom Jeffrey, O’Reilly Una-May, and Amarasinghe Saman. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, 303316. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Baghdadi Riyadh, Ray Jessica, Romdhane Malek Ben, Sozzo Emanuele Del, Akkas Abdurrahman, Zhang Yunming, Suriana Patricia, Kamil Shoaib, and Amarasinghe Saman. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’19). 193205. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bao Hesheng, Bielak Jacobo, Ghattas Omar, Kallivokas Loukas F., O’Hallaron David R., Shewchuk Jonathan R., and Xu Jifeng. 1998. Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers. Computer Methods in Applied Mechanics and Engineering 152, 1 (1998), 85102. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bastoul Cédric. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT ’04). IEEE, Los Alamitos, CA, 716. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Bhaskaracharya Somashekaracharya G., Demouth Julien, and Grover Vinod. 2020. Automatic kernel generation for Volta tensor cores. arXiv:cs.PL/2006.12645 (2020).Google ScholarGoogle Scholar
  10. [10] Bondhugula Uday. 2015. PolyMage Benchmarks. (commit d20264ef). Retrieved December 7, 2023 from https://github.com/bondhugula/polymage-benchmarksGoogle ScholarGoogle Scholar
  11. [11] Bondhugula Uday, Bandishti Vinayaka, and Pananilath Irshad. 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (Oct. 2017), 12851298. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Bondhugula Uday, Gunluk Oktay, Dash Sanjeeb, and Renganarayanan Lakshminarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT ’10). ACM, New York, NY, 343352. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Bondhugula Uday, Hartono Albert, Ramanujam J., and Sadayappan P.. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, 101113. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Shen Haichen, Cowan Meghan, Wang Leyuan, Hu Yuwei, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). 578–594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle ScholarGoogle Scholar
  15. [15] Chen Tianqi, Zheng Lianmin, Yan Eddie, Jiang Ziheng, Moreau Thierry, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 33893400.Google ScholarGoogle Scholar
  16. [16] Cheng Heng-Tze, Koc Levent, Harmsen Jeremiah, Shaked Tal, Chandra Tushar, Aradhye Hrishi, Anderson Glen, Corrado Greg, Chai Wei, Ispir Mustafa, Anil Rohan, Haque Zakaria, Hong Lichan, Jain Vihan, Liu Xiaobing, and Shah Hemal. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS ’16). ACM, New York, NY, 710. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Chetlur Sharan, Woolley Cliff, Vandermersch Philippe, Cohen Jonathan, Tran John, Catanzaro Bryan, and Shelhamer Evan. 2014. cuDNN: Efficient primitives for deep learning. arXiv:cs.NE/1410.0759 (2014).Google ScholarGoogle Scholar
  18. [18] Corporation Standard Performance Evaluation. 2007. SPEC CPU2000 V1.3. Retrieved December 7, 2023 from http://www.spec.org/cpu2000/Google ScholarGoogle Scholar
  19. [19] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Feautrier Paul. 1991. Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 1 (Feb. 1991), 2353. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Feautrier Paul and Lengauer Christian. 2011. Polyhedron model. In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 15811592. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Google. 2017. XLA: Optimizing Compiler for Machine Learning. Retrieved December 7, 2023 from https://www.tensorflow.org/xlaGoogle ScholarGoogle Scholar
  23. [23] Grosser Tobias, Cohen Albert, Holewinski Justin, Sadayappan P., and Verdoolaege Sven. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’14). ACM, New York, NY, Article 66, 10 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Grosser Tobias, Groesslinger Armin, and Lengauer Christian. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4 (2012), 1250010.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Grosser Tobias, Verdoolaege Sven, and Cohen Albert. 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4, Article 12 (July 2015), 50 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Halide. 2013. Halide Benchmarks (commit 8c23a197). Retrieved December 7, 2023 from https://github.com/halide/HalideGoogle ScholarGoogle Scholar
  27. [27] Harris Chris and Stephens Mike. 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 147151.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Hartono Albert, Baskaran Muthu Manikandan, Bastoul Cédric, Cohen Albert, Krishnamoorthy Sriram, Norris Boyana, Ramanujam J., and Sadayappan P.. 2009. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). ACM, New York, NY, 147157. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 770778. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:cs.CV/1704.04861 (2017).Google ScholarGoogle Scholar
  31. [31] Huawei. 2020. MindSpore. Retrieved December 7, 2023 from https://www.mindspore.cn/enGoogle ScholarGoogle Scholar
  32. [32] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML ’15), Vol. 37. 448–456.Google ScholarGoogle Scholar
  33. [33] Irigoin François and Triolet Rémi. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’88). ACM, New York, NY, 319329. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Jangda Abhinav and Bondhugula Uday. 2020. An effective fusion and tile size model for PolyMage. ACM Transactions on Programming Languages and Systems 42, 3 (Nov. 2020), Article 12, 27 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Jangda Abhinav and Guha Arjun. 2020. Model-based warp overlapped tiling for image processing programs on GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’20). ACM, New York, NY, 317328. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Karr Michael. 1976. Affine relationships among variables of a program. Acta Informatica 6, 2 (June 1976), 133151. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Kennedy Ken and McKinley Kathryn S.. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. 301–320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Kim DaeGon, Renganarayanan Lakshminarayanan, Rostron Dave, Rajopadhye Sanjay, and Strout Michelle Mills. 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC ’07). ACM, New York, NY, Article 51, 12 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Krishnamoorthy Sriram, Baskaran Muthu, Bondhugula Uday, Ramanujam J., Rountev Atanas, and Sadayappan P.. 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, 235244. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Lattner Chris, Amini Mehdi, Bondhugula Uday, Cohen Albert, Davis Andy, Pienaar Jacques, Riddle River, Shpeisman Tatiana, Vasilache Nicolas, and Zinenko Oleksandr. 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’21). 214. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Lecun Y., Bottou L., Bengio Y., and Haffner P.. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 22782324. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Liao Heng, Tu Jiajin, Xia Jing, Liu Hu, Zhou Xiping, Yuan Honghui, and Hu Yuxing. 2021. Ascend: A scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA ’21). 789801. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Mehta Sanyam, Beeraka Gautham, and Yew Pen-Chung. 2013. Tile size selection revisited. ACM Transactions on Architecture and Code Optimization 10, 4 (Dec. 2013), Article 35, 27 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Mullapudi Ravi Teja, Vasista Vinay, and Bondhugula Uday. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, 429443. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Niu Wei, Guan Jiexiong, Wang Yanzhi, Agrawal Gagan, and Ren Bin. 2021. DNNFusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 883898. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] NVIDIA. 2013. cuBLAS. Retrieved December 7, 2023 from https://developer.nvidia.com/cublasGoogle ScholarGoogle Scholar
  47. [47] Paris Sylvain, Hasinoff Samuel W., and Kautz Jan. 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. Communications of the ACM 58, 3 (Feb. 2015), 8191. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Paris Sylvain, Kornprobst Pierre, and Tumblin Jack. 2009. Bilateral Filtering. Now Publishers, Hanover, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Alban Desmaison, Andreas Kopf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 80268037.Google ScholarGoogle Scholar
  50. [50] Pitchanathan Arjun, Ulmann Christian, Weber Michel, Hoefler Torsten, and Grosser Tobias. 2021. FPL: Fast presburger arithmetic through transprecision. Proceedings of the ACM on Programming Languages 5, OOPSLA (Oct. 2021), Article 162, 26 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Pouchet Louis-Noël, Bondhugula Uday, Bastoul Cédric, Cohen Albert, Ramanujam J., Sadayappan P., and Vasilache Nicolas. 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’11). ACM, New York, NY, 549562. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Pouchet Louis-Noël and Yuki Tomofumi. 2016. PolyBench/C 4.2. Retrieved December 7, 2023 from https://sourceforge.net/projects/polybenchGoogle ScholarGoogle Scholar
  53. [53] Pugh William and Rosser Evan. 2000. Iteration space slicing for locality. In Languages and Compilers for Parallel Computing, Carter Larry and Ferrante Jeanne (Eds.). Springer, Berlin, Germany, 164184. Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Ragan-Kelley Jonathan, Barnes Connelly, Adams Andrew, Paris Sylvain, Durand Frédo, and Amarasinghe Saman. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, 519530. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Rajbhandari Samyam, Kim Jinsung, Krishnamoorthy Sriram, Pouchet Louis-Noël, Rastello Fabrice, Harrison Robert J., and Sadayappan P.. 2016. On fusing recursive traversals of K-d trees. In Proceedings of the 25th International Conference on Compiler Construction (CC ’16). ACM, New York, NY, 152162. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Redmon Joseph and Farhadi Ali. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). 65176525. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Simonyan Karen and Zisserman Andrew. 2014. Very Deep convolutional networks for large-scale image recognition. arXiv:cs.CV/1409.1556 (2014).Google ScholarGoogle Scholar
  58. [58] Upadrasta Ramakrishna and Cohen Albert. 2013. Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’13). ACM, New York, NY, 483496. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Vasilache Nicolas, Zinenko Oleksandr, Theodoridis Theodoros, Goyal Priya, Devito Zachary, Moses William S., Verdoolaege Sven, Adams Andrew, and Cohen Albert. 2019. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Transactions on Architecture and Code Optimization 16, 4 (Oct. 2019), Article 38, 26 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17). 60006010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Verdoolaege Sven. 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS’10). Springer-Verlag, Berlin, Heidelberg, 299302. Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Verdoolaege Sven, Juega Juan Carlos, Cohen Albert, Gómez José Ignacio, Tenllado Christian, and Catthoor Francky. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article 54 (Jan. 2013), 23 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Verdoolaege Sven and Janssens Gerda. 2017. Scheduling for PPCG. Report CW 706. KU Leuven.Google ScholarGoogle Scholar
  64. [64] Zhao Jie and Cohen Albert. 2019. Flextended tiles: A flexible extension of overlapped tiles for polyhedral compilation. ACM Transactions on Architecture and Code Optimization 16, 4 (Dec. 2019), Article 47, 25 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Zhao Jie and Di Peng. 2020. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’20). IEEE, Los Alamitos, CA, 427441. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zhao Jie, Gao Xiong, Xia Ruijie, Zhang Zhaochuang, Chen Deshi, Chen Lei, Zhang Renwei, Geng Zhen, Cheng Bin, and Jin Xuefeng. 2022. Apollo: Automatic partition-based operator fusion through layer by layer optimization. In Proceedings of Machine Learning and Systems, Vol. 4. 1–19.Google ScholarGoogle Scholar
  67. [67] Zhao Jie, Kruse Michael, and Cohen Albert. 2018. A polyhedral compilation framework for loops with dynamic data-dependent bounds. In Proceedings of the 27th International Conference on Compiler Construction (CC ’18). ACM, New York, NY, 1424. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Zhao Jie, Li Bojie, Nie Wang, Geng Zhen, Zhang Renwei, Gao Xiong, Cheng Bin, Wu Chen, Cheng Yun, Li Zheng, Di Peng, Zhang Kun, and Jin Xuefeng. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 12331248. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Zheng Lianmin, Jia Chengfan, Sun Minmin, Wu Zhao, Yu Cody Hao, Haj-Ali Ameer, Wang Yida, Yang Jun, Zhuo Danyang, Sen Koushik, Gonzalez Joseph E., and Stoica Ion. 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 863–879. https://www.usenix.org/conference/osdi20/presentation/zhengGoogle ScholarGoogle Scholar
  70. [70] Zinenko Oleksandr, Verdoolaege Sven, Reddy Chandan, Shirako Jun, Grosser Tobias, Sarkar Vivek, and Cohen Albert. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC ’18). ACM, New York, NY, 313. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Computer Systems
          ACM Transactions on Computer Systems  Volume 41, Issue 1-4
          November 2023
          188 pages
          ISSN:0734-2071
          EISSN:1557-7333
          DOI:10.1145/3637801
          • Editor:
          • Michael Swift
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 January 2024
          • Online AM: 1 December 2023
          • Accepted: 1 November 2023
          • Revised: 11 April 2023
          • Received: 21 March 2022
          Published in tocs Volume 41, Issue 1-4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)431
          • Downloads (Last 6 weeks)132

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text