当前位置: X-MOL 学术Int. J. High Perform. Comput. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DGEMM on integer matrix multiplication unit
The International Journal of High Performance Computing Applications ( IF 3.1 ) Pub Date : 2024-03-16 , DOI: 10.1177/10943420241239588
Hiroyuki Ootomo 1 , Katsuhisa Ozaki 2 , Rio Yokota 1
Affiliation  

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on computing double-precision equivalent matrix multiplication using the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.85 while maintaining the FP64 accuracy.

中文翻译:

整数矩阵乘法单元上的 DGEMM

深度学习硬件通过降低计算精度并专门从事矩阵乘法来实现高吞吐量和低功耗。对于机器学习推理,定点值计算很常见,其中输入和输出值以及模型参数被量化。因此,许多处理器现在配备了快速整数矩阵乘法单元(IMMU)。找到一种方法来利用这些 IMMU 来提高 HPC 应用程序的性能,同时保持准确性,这一点非常有意义。我们重点关注使用 Ozaki 方案计算双精度等效矩阵乘法,该方案使用较低精度的计算单元计算高精度矩阵乘法,并展示使用 IMMU 的优点和缺点。使用整数张量核心的实验表明,我们可以比 cuBLAS 和 NVIDIA 消费类 GPU 上的 FP16 张量核心上的现有 Ozaki 方案实现更快地计算双精度矩阵乘法。此外,我们还展示了在保持 FP64 精度的同时将量子电路模拟加速高达 4.85。
更新日期:2024-03-16
down
wechat
bug