Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2024-01-19 , DOI: 10.1145/3632956
Xueying Wang ₁ , Guangli Li ₂ , Zhen Jia ₃ , Xiaobing Feng ₂ , Yida Wang ₃

Affiliation

Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this article, we propose an effective quantized Winograd convolution, named LoWino, which employs an in-side quantization method in the Winograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate LoWino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84× and 1.91× operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.

中文翻译：

快速卷积与低精度的结合：探索现代 CPU 上的高效量化 Winograd 卷积

低精度计算已成为加速卷积神经网络最有效的技术之一，并获得了现代硬件的广泛支持。尽管其在加速卷积神经网络方面非常有效，但由于数值问题，低精度计算尚未普遍应用于快速卷积，例如 Winograd 算法。在本文中，我们提出了一种有效的量化 Winograd 卷积，名为 LoWino，它在 Winograd 域中采用内部量化方法来减少变换引起的精度损失。同时，我们提出了一种集成了精心设计的优化技术的高效实现，使我们能够充分利用现代CPU上的低精度计算能力。我们在两个具有代表性卷积层和神经网络模型的英特尔至强可扩展处理器平台上评估 LoWino。实验结果表明，与供应商库中最先进的实现相比，我们的方法平均可以实现 1.84 倍和 1.91 倍的算子加速，同时将精度损失保持在合理的水平。

更新日期：2024-01-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>