ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2024-03-21 , DOI: 10.1145/3653363
Ching-Jui Lee, Tsung Tai Yeh

Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model.

This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.

中文翻译：

ReSA：用于多个微小 DNN 张量的可重构脉动阵列

脉动阵列架构显着加速了深度神经网络 (DNN)。脉动阵列包含多个可执行乘法累加 (MAC) 的处理元件 (PE)。传统上，脉动阵列可以在每个周期同时执行一定量的与脉动阵列的大小相匹配的张量数据。然而，DNN 模型的超参数在每一层都不同，导致每一层的张量大小不同。将这些不规则张量映射到收缩阵列，同时充分利用收缩阵列中的整个 PE 是具有挑战性的。此外，现代 DNN 脉动加速器通常采用单个数据流。然而，这样的数据流并不适合每个 DNN 模型。

这项工作提出了 ReSA，这是一种可重新配置的数据流架构，旨在通过将微小张量映射到空间分区的脉动阵列上来最小化 DNN 模型的执行时间。与传统的脉动阵列架构不同，ReSA 数据路径控制器支持在 PE 上执行输入、权重和输出静态数据流。 ReSA还将粗粒度脉动阵列分解为多个小阵列，以减少张量映射上的碎片问题。每个小型脉动子阵列单元都依赖于我们的数据仲裁器通过简单的互连网络相互调度张量。此外，ReSA 重新排序内存访问以重叠内存加载和执行阶段，以隐藏处理微小张量时的内存延迟。最后，ReSA将每一层的张量分割成多个小张量，并在主机端为每个张量搜索最佳数据流。然后，ReSA 在我们提出的指令中对预定义的数据流进行编码，以通知脉动阵列正确切换数据流。因此，我们对脉动阵列架构的优化在 9 个不同的 DNN 模型中实现了比权重静止脉动阵列架构高 1.87 倍的几何平均加速。

更新日期：2024-03-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>