A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device,Design Automation for Embedded Systems

当前位置： X-MOL 学术 › Des. Autom. Embed. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device
Design Automation for Embedded Systems ( IF 1.4 ) Pub Date : 2023-04-26 , DOI: 10.1007/s10617-023-09274-8
Rama Muni Reddy Yanamala , Muralidhar Pullakandam

The most recent deep learning technique used in many applications is the convolutional neural network (CNN). Recent years have seen a rise in demand for real-time CNN implementations on various embedded devices with restricted resources. The CNN models should be implemented using field-programmable gate arrays to ensure flexible programmability and speed up the development process. However, the CNN acceleration is hampered by complex computations, limited bandwidth, and on-chip memory storage. In this paper, a reusable quantized hardware architecture was proposed to accelerate deep CNN models by solving the above issues. Twenty-five processing elements are employed for the computation of convolutions in the CNN model. Pipelining, loop unrolling, and array partitioning are the techniques for increasing the speed of computations in both the convolution layers and fully connected layers. This design is tested with MNIST handwritten digit image classification on a low-cost, low-memory Xilinx PYNQ-Z2 system on chip edge device. The inference speed of the proposed hardware design achieved 92.7% higher than INTEL core3 CPU, 90.7% more than Haswell core2 CPU, 87.7% more than NVIDIA Tesla K80 GPU, and 84.9% better when compared to the conventional hardware accelerator with one processing element. The proposed quantized architecture design has achieved the performance of 4.4 GOP/s without compromising the accuracy and it was 2 times more than the conventional architecture.

中文翻译：

受限边缘设备上 CNN 的高速可重用量化硬件加速器设计

许多应用中使用的最新深度学习技术是卷积神经网络 (CNN)。近年来，对在资源受限的各种嵌入式设备上实施实时 CNN 的需求不断增加。CNN 模型应使用现场可编程门阵列来实现，以确保灵活的可编程性并加快开发过程。然而，CNN 加速受到复杂计算、有限带宽和片上内存存储的阻碍。在本文中，提出了一种可重用的量化硬件架构，通过解决上述问题来加速深度 CNN 模型。CNN 模型中的卷积计算采用了 25 个处理元素。流水线，循环展开，和数组分区是提高卷积层和全连接层计算速度的技术。该设计在低成本、低内存的 Xilinx PYNQ-Z2 片上系统边缘设备上使用 MNIST 手写数字图像分类进行了测试。所提出的硬件设计的推理速度比 INTEL core3 CPU 高 92.7%，比 Haswell core2 CPU 高 90.7%，比 NVIDIA Tesla K80 GPU 高 87.7%，与具有一个处理单元的传统硬件加速器相比高出 84.9%。所提出的量化架构设计在不影响精度的情况下实现了 4.4 GOP/s 的性能，是传统架构的 2 倍。低内存 Xilinx PYNQ-Z2 片上系统边缘设备。所提出的硬件设计的推理速度比 INTEL core3 CPU 高 92.7%，比 Haswell core2 CPU 高 90.7%，比 NVIDIA Tesla K80 GPU 高 87.7%，与具有一个处理单元的传统硬件加速器相比高出 84.9%。所提出的量化架构设计在不影响精度的情况下实现了 4.4 GOP/s 的性能，是传统架构的 2 倍。低内存 Xilinx PYNQ-Z2 片上系统边缘设备。所提出的硬件设计的推理速度比 INTEL core3 CPU 高 92.7%，比 Haswell core2 CPU 高 90.7%，比 NVIDIA Tesla K80 GPU 高 87.7%，与具有一个处理单元的传统硬件加速器相比高出 84.9%。所提出的量化架构设计在不影响精度的情况下实现了 4.4 GOP/s 的性能，是传统架构的 2 倍。

更新日期：2023-04-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>