当前位置: X-MOL 学术IEEE Trans. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accelerating Sparse DNNs Based on Tiled GEMM
IEEE Transactions on Computers ( IF 3.7 ) Pub Date : 2024-02-14 , DOI: 10.1109/tc.2024.3365942
Cong Guo 1 , Fengchen Xue 1 , Jingwen Leng 1 , Yuxian Qiu 2 , Yue Guan 1 , Weihao Cui 1 , Quan Chen 1 , Minyi Guo 1
Affiliation  

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured sparse models cannot achieve meaningful speedup on commodity hardware built for dense matrix computations. Accelerators are usually modified or designed with structured sparsity-optimized architectures for exploiting sparsity. For example, the Ampere architecture introduces a sparse tensor core, which adopts the 2:4 sparsity pattern. We propose a pruning method that builds upon the insight that matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We present the “tile-wise” sparsity pattern, which maintains a structured sparsity pattern at the tile level for efficient execution but allows for irregular pruning at the global scale to maintain high accuracy. In addition, the tile-wise sparsity is implemented at the global memory level, and the 2:4 sparsity executes at the register level inside the sparse tensor core. We can combine these two patterns into a “tile-vector-wise” ( TVW ) sparsity pattern to explore more fine-grained sparsity and further accelerate the sparse DNN models. We evaluate the TVW on the GPU, achieving averages of 1.85 $\boldsymbol{\times}$ , 2.75 $\boldsymbol{\times}$ , and 22.18 $\boldsymbol{\times}$ speedups over the dense model, block sparsity, and unstructured sparsity.

中文翻译:

基于 Tiled GEMM 加速稀疏 DNN

网络剪枝可以降低深度神经网络(DNN)模型的计算成本。然而,稀疏模型通常会产生随机分布的权重以保持准确性,从而导致计算不规则。因此,非结构化稀疏模型无法在为密集矩阵计算构建的商用硬件上实现有意义的加速。加速器通常采用结构化稀疏优化架构进行修改或设计,以利用稀疏性。例如,Ampere架构引入了稀疏张量核心,采用2:4的稀疏模式。我们提出了一种修剪方法,该方法基于矩阵乘法通常将大矩阵分解为多个较小的块以并行执行的见解。我们提出了“分块式”稀疏模式,它在分块级别保持结构化稀疏模式以实现高效执行,但允许在全局范围内进行不规则修剪以保持高精度。此外,分块稀疏性是在全局内存级别实现的,2:4 稀疏性是在稀疏张量核心内的寄存器级别执行的。我们可以将这两种模式组合成“瓦片矢量方式”( 电视网 )稀疏模式,探索更细粒度的稀疏性并进一步加速稀疏 DNN 模型。我们评估电视网在 GPU 上,平均得分为 1.85 $\boldsymbol{\times}$ , 2.75 $\boldsymbol{\times}$ 、22.18 $\boldsymbol{\times}$比密集模型、块稀疏性和非结构化稀疏性加速。
更新日期:2024-02-14
down
wechat
bug