当前位置: X-MOL 学术ACM Trans. Math. Softw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM
ACM Transactions on Mathematical Software ( IF 2.7 ) Pub Date : 2024-03-16 , DOI: 10.1145/3638532
Guillermo Alaejos 1 , Adrián Castelló 1 , Pedro Alonso-Jordá 1 , Francisco D. Igual 2 , Héctor Martínez 3 , Enrique S. Quintana-Ortí 1
Affiliation  

We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS, and OpenBLAS, to obtain high-performance blocked formulations of the general matrix multiplication (gemm). In addition, we fully automatize the generation process by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for gemm. This is in contrast with the convention in high-performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. In global, the combination of our TVM-generated blocked algorithms and micro-kernels for gemm (1) improves portability, maintainability, and, globally, streamlines the software life cycle; (2) provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries; and (3) features a small memory footprint.



中文翻译:

算法 1039:使用 Apache TVM 的一系列矩阵乘法例程的自动生成器

我们探索利用 Apache TVM 开源框架自动生成一系列算法,这些算法遵循流行的线性代数库(例如 GotoBLAS2、BLIS 和 OpenBLAS)所采用的方法,以获得通用矩阵乘法的高性能分块公式(宝石)。此外,我们还利用 Apache TVM 框架来完全自动化生成过程,为gemm派生出各种特定于处理器的微内核。这与高性能库中的惯例形成鲜明对比,高性能库使用汇编代码对每个架构手动编码单个微内核。在全球范围内,我们的 TVM 生成的阻塞算法和gemm的微内核的结合(1) 提高了可移植性、可维护性,并且在全球范围内简化了软件生命周期;(2) 提供高度的灵活性,可以轻松地针对不同的数据类型、处理器架构和矩阵操作数形状定制和优化解决方案,从而产生与手动调整库相当的性能(对于特定矩阵形状甚至更优);(3) 内存占用小。

更新日期:2024-03-16
down
wechat
bug