当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs
arXiv - CS - Hardware Architecture Pub Date : 2024-04-17 , DOI: arxiv-2404.11066
Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX, employ significantly different architectural approaches. This paper presents novel systematic frameworks to optimize the performance of General Matrix Multiplication (GEMM), a fundamental operation in DL workloads, by exploiting the unique and distinct architectural characteristics of each FPGA. Our evaluation on GEMM workloads for int8 precision shows up to 77 and 68 TOPs (int8) throughput, with up to 0.94 and 1.35 TOPs/W energy efficiency for Versal VC1902 and Stratix 10 NX, respectively. This work provides insights and guidelines for optimizing GEMM-based applications on both platforms, while also delving into their programmability trade-offs and associated challenges.

中文翻译:

在领先的 AI 优化 FPGA 上进行 GEMM 加速的有效方法

FPGA 因其高性能、低功耗和可重配置性而成为加速深度学习 (DL) 应用的有前途的平台。最近,领先的 FPGA 供应商增强了其架构,以更有效地支持深度学习工作负载的计算需求。然而,两种最著名的人工智能优化 FPGA,即 AMD/Xilinx Versal ACAP 和英特尔 Stratix 10 NX,采用了截然不同的架构方法。本文提出了新颖的系统框架,通过利用每个 FPGA 独特且独特的架构特征来优化通用矩阵乘法 (GEMM) 的性能,这是深度学习工作负载中的基本操作。我们对 int8 精度的 GEMM 工作负载进行的评估显示,Versal VC1902 和 Stratix 10 NX 的吞吐量分别高达 77 和 68 TOPs/W,能源效率分别高达 0.94 和 1.35 TOPs/W。这项工作提供了在两个平台上优化基于 GEMM 的应用程序的见解和指南,同时还深入研究了它们的可编程性权衡和相关挑战。
更新日期:2024-04-18
down
wechat
bug