Abstract
Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers.
In this article, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upward exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability.
We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8× over cuBLAS/cuDNN and 1.1× over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.
- [1] . 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16). 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadiGoogle Scholar
- [2] . 2018. Polyhedral auto-transformation with no integer linear programming. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’18). ACM, New York, NY, 529–542.
DOI: Google ScholarDigital Library - [3] . 2020. Effective loop fusion in polyhedral compilation using fusion conflict graphs. ACM Transactions on Architecture and Code Optimization 17, 4 (Sept. 2020), Article 26, 26 pages.
DOI: Google ScholarDigital Library - [4] . 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics 38, 4 (July 2019), Article 121, 12 pages.
DOI: Google ScholarDigital Library - [5] . 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, 303–316.
DOI: Google ScholarDigital Library - [6] . 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’19). 193–205.
DOI: Google ScholarCross Ref - [7] . 1998. Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers. Computer Methods in Applied Mechanics and Engineering 152, 1 (1998), 85–102.
DOI: Google ScholarCross Ref - [8] . 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT ’04). IEEE, Los Alamitos, CA, 7–16.
DOI: Google ScholarCross Ref - [9] . 2020. Automatic kernel generation for Volta tensor cores. arXiv:cs.PL/2006.12645 (2020).Google Scholar
- [10] . 2015. PolyMage Benchmarks. (commit d20264ef). Retrieved December 7, 2023 from https://github.com/bondhugula/polymage-benchmarksGoogle Scholar
- [11] . 2017. Diamond tiling: Tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28, 5 (
Oct. 2017), 1285–1298.DOI: Google ScholarDigital Library - [12] . 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT ’10). ACM, New York, NY, 343–352.
DOI: Google ScholarDigital Library - [13] . 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08). ACM, New York, NY, 101–113.
DOI: Google ScholarDigital Library - [14] . 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). 578–594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle Scholar
- [15] . 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389–3400.Google Scholar
- [16] . 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS ’16). ACM, New York, NY, 7–10.
DOI: Google ScholarDigital Library - [17] . 2014. cuDNN: Efficient primitives for deep learning. arXiv:cs.NE/1410.0759 (2014).Google Scholar
- [18] . 2007. SPEC CPU2000 V1.3. Retrieved December 7, 2023 from http://www.spec.org/cpu2000/Google Scholar
- [19] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
DOI: Google ScholarCross Ref - [20] . 1991. Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 1 (Feb. 1991), 23–53.
DOI: Google ScholarCross Ref - [21] . 2011. Polyhedron model. In Encyclopedia of Parallel Computing. Springer US, Boston, MA, 1581–1592.
DOI: Google ScholarCross Ref - [22] . 2017. XLA: Optimizing Compiler for Machine Learning. Retrieved December 7, 2023 from https://www.tensorflow.org/xlaGoogle Scholar
- [23] . 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’14). ACM, New York, NY, Article
66 , 10 pages.DOI: Google ScholarDigital Library - [24] . 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4 (2012), 1250010.Google ScholarCross Ref
- [25] . 2015. Polyhedral AST generation is more than scanning polyhedra. ACM Transactions on Programming Languages and Systems 37, 4, Article
12 (July 2015), 50 pages.DOI: Google ScholarDigital Library - [26] . 2013. Halide Benchmarks (commit 8c23a197). Retrieved December 7, 2023 from https://github.com/halide/HalideGoogle Scholar
- [27] . 1988. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Vol. 15. 147–151.Google ScholarCross Ref
- [28] . 2009. Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). ACM, New York, NY, 147–157.
DOI: Google ScholarDigital Library - [29] . 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 770–778.
DOI: Google ScholarCross Ref - [30] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:cs.CV/1704.04861 (2017).Google Scholar
- [31] . 2020. MindSpore. Retrieved December 7, 2023 from https://www.mindspore.cn/enGoogle Scholar
- [32] . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML ’15), Vol. 37. 448–456.Google Scholar
- [33] . 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’88). ACM, New York, NY, 319–329.
DOI: Google ScholarDigital Library - [34] . 2020. An effective fusion and tile size model for PolyMage. ACM Transactions on Programming Languages and Systems 42, 3 (Nov. 2020), Article 12, 27 pages.
DOI: Google ScholarDigital Library - [35] . 2020. Model-based warp overlapped tiling for image processing programs on GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’20). ACM, New York, NY, 317–328.
DOI: Google ScholarDigital Library - [36] . 1976. Affine relationships among variables of a program. Acta Informatica 6, 2 (June 1976), 133–151.
DOI: Google ScholarDigital Library - [37] . 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. 301–320. Google ScholarDigital Library
- [38] . 2007. Multi-level tiling: M for the price of one. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC ’07). ACM, New York, NY, Article
51 , 12 pages.DOI: Google ScholarDigital Library - [39] . 2007. Effective automatic parallelization of stencil computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, 235–244.
DOI: Google ScholarDigital Library - [40] . 2021. MLIR: Scaling compiler infrastructure for domain specific computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’21). 2–14.
DOI: Google ScholarDigital Library - [41] . 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
DOI: Google ScholarCross Ref - [42] . 2021. Ascend: A scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA ’21). 789–801.
DOI: Google ScholarCross Ref - [43] . 2013. Tile size selection revisited. ACM Transactions on Architecture and Code Optimization 10, 4 (Dec. 2013), Article 35, 27 pages.
DOI: Google ScholarDigital Library - [44] . 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, 429–443.
DOI: Google ScholarDigital Library - [45] . 2021. DNNFusion: Accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 883–898.
DOI: Google ScholarDigital Library - [46] . 2013. cuBLAS. Retrieved December 7, 2023 from https://developer.nvidia.com/cublasGoogle Scholar
- [47] . 2015. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. Communications of the ACM 58, 3 (
Feb. 2015), 81–91.DOI: Google ScholarDigital Library - [48] . 2009. Bilateral Filtering. Now Publishers, Hanover, MA. Google ScholarDigital Library
- [49] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8026–8037.Google Scholar
- [50] . 2021. FPL: Fast presburger arithmetic through transprecision. Proceedings of the ACM on Programming Languages 5, OOPSLA (Oct. 2021), Article 162, 26 pages.
DOI: Google ScholarDigital Library - [51] . 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’11). ACM, New York, NY, 549–562.
DOI: Google ScholarDigital Library - [52] . 2016. PolyBench/C 4.2. Retrieved December 7, 2023 from https://sourceforge.net/projects/polybenchGoogle Scholar
- [53] . 2000. Iteration space slicing for locality. In Languages and Compilers for Parallel Computing, and (Eds.). Springer, Berlin, Germany, 164–184. Google ScholarCross Ref
- [54] . 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, 519–530.
DOI: Google ScholarDigital Library - [55] . 2016. On fusing recursive traversals of K-d trees. In Proceedings of the 25th International Conference on Compiler Construction (CC ’16). ACM, New York, NY, 152–162.
DOI: Google ScholarDigital Library - [56] . 2017. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). 6517–6525.
DOI: Google ScholarCross Ref - [57] . 2014. Very Deep convolutional networks for large-scale image recognition. arXiv:cs.CV/1409.1556 (2014).Google Scholar
- [58] . 2013. Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’13). ACM, New York, NY, 483–496.
DOI: Google ScholarDigital Library - [59] . 2019. The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated GPU kernels, automatically. ACM Transactions on Architecture and Code Optimization 16, 4 (Oct. 2019), Article 38, 26 pages.
DOI: Google ScholarDigital Library - [60] . 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS ’17). 6000–6010. Google ScholarDigital Library
- [61] . 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS’10). Springer-Verlag, Berlin, Heidelberg, 299–302. Google ScholarCross Ref
- [62] . 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article
54 (Jan. 2013), 23 pages.DOI: Google ScholarDigital Library - [63] . 2017. Scheduling for PPCG. Report CW 706. KU Leuven.Google Scholar
- [64] . 2019. Flextended tiles: A flexible extension of overlapped tiles for polyhedral compilation. ACM Transactions on Architecture and Code Optimization 16, 4 (Dec. 2019), Article 47, 25 pages.
DOI: Google ScholarDigital Library - [65] . 2020. Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’20). IEEE, Los Alamitos, CA, 427–441.
DOI: Google ScholarCross Ref - [66] . 2022. Apollo: Automatic partition-based operator fusion through layer by layer optimization. In Proceedings of Machine Learning and Systems, Vol. 4. 1–19.Google Scholar
- [67] . 2018. A polyhedral compilation framework for loops with dynamic data-dependent bounds. In Proceedings of the 27th International Conference on Compiler Construction (CC ’18). ACM, New York, NY, 14–24.
DOI: Google ScholarDigital Library - [68] . 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 1233–1248.
DOI: Google ScholarDigital Library - [69] . 2020. Ansor: Generating high-performance tensor programs for deep learning. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20). 863–879. https://www.usenix.org/conference/osdi20/presentation/zhengGoogle Scholar
- [70] . 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction (CC ’18). ACM, New York, NY, 3–13.
DOI: Google ScholarDigital Library
Index Terms
- Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations
Recommendations
Reuse-Driven Tiling for Improving Data Locality
This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data ...
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAffine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler ...
Joint affine transformation and loop pipelining for mapping nested loop on CGRAs
DATE '15: Proceedings of the 2015 Design, Automation & Test in Europe Conference & ExhibitionCoarse-Grained Reconfigurable Architectures (CGRAs) are the promising architectures with high performance, high power- efficiency and attractions of flexibility. The computation-intensive portions of application, i.e. loops, are often implemented on ...
Comments