Abstract
The most recent deep learning technique used in many applications is the convolutional neural network (CNN). Recent years have seen a rise in demand for real-time CNN implementations on various embedded devices with restricted resources. The CNN models should be implemented using field-programmable gate arrays to ensure flexible programmability and speed up the development process. However, the CNN acceleration is hampered by complex computations, limited bandwidth, and on-chip memory storage. In this paper, a reusable quantized hardware architecture was proposed to accelerate deep CNN models by solving the above issues. Twenty-five processing elements are employed for the computation of convolutions in the CNN model. Pipelining, loop unrolling, and array partitioning are the techniques for increasing the speed of computations in both the convolution layers and fully connected layers. This design is tested with MNIST handwritten digit image classification on a low-cost, low-memory Xilinx PYNQ-Z2 system on chip edge device. The inference speed of the proposed hardware design achieved 92.7% higher than INTEL core3 CPU, 90.7% more than Haswell core2 CPU, 87.7% more than NVIDIA Tesla K80 GPU, and 84.9% better when compared to the conventional hardware accelerator with one processing element. The proposed quantized architecture design has achieved the performance of 4.4 GOP/s without compromising the accuracy and it was 2 times more than the conventional architecture.
Similar content being viewed by others
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
References
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–10
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545
Strigl D, Kofler K, Podlipnig S (2010) Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro conference on parallel, distributed and network-based processing. IEEE, pp 317–324
Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In: 2016 21st Asia and South Pacific design automation conference (ASP-DAC). IEEE, pp 575–580
Sankaradas M, Jakkula V, Cadambi S, Chakradhar S, Durdanovic I, Cosatto E, Graf HP (2009) A massively parallel coprocessor for convolutional neural networks. In: 2009 20th IEEE international conference on application-specific systems, architectures and processors. IEEE, pp 53–60
Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput Aided Des Integr Circuits Syst 38(11):2072–2085
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2014) Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput Arch News 42(1):269–284
Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, Li L, Chen T, Xu Z, Sun, N (2014) Dadiannao: a machine-learning supercomputer. In: 2014 47th annual IEEE/ACM international symposium on microarchitecture. IEEE, pp 609–622
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen Y (2015) Pudiannao: a polyvalent machine learning accelerator. ACM SIGARCH Comput Arch News 43(1):369–381
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 26–35
Cui W, Lu Q, Qureshi AM, Li W, Wu K (2021) An adaptive lenet-5 model for anomaly detection. Information Security Journal: A Global Perspective 30(1):19–29
Wang M, Sun T, Song K, Li S, Jiang J, Sun L (2022) An efficient sparse pruning method for human pose estimation. Connect Sci 34(1):960–974
Maraoui A, Messaoud S, Bouaafia S, Ammari AC, Khriji L, Machhout M (2021) PYNQ FPGA hardware implementation of LeNet-5-based traffic sign recognition application. In: 2021 18th international multi-conference on systems, signals & devices (SSD). IEEE, pp 1004–1009
Huynh TV (2022) FPGA-based acceleration for convolutional neural networks on PYNQ-Z2. Int J Comput Digit Syst 11(1):441–450
Choudhari O, Chopade M, Chopde S, Dabhadkar S, Ingale V (2020) Hardware accelerator: implementation of CNN on FPGA for digit recognition. In: 2020 24th international symposium on VLSI design and test (VDAT). IEEE, pp 1–6
Ghaffari S, Sharifian S (2016) FPGA-based convolutional neural network accelerator design using high level synthesize. In: 2016 2nd international conference of signal processing and intelligent systems (ICSPIS). IEEE, pp 1–6
Rongshi D, Yongming T (2019) Accelerator implementation of LeNet-5 convolution neural network based on FPGA with HLS. In: 2019 3rd international conference on circuits, system and simulation (ICCSS). IEEE, pp 64–67
Li J, Hu Y, Li J (2021) Rapid risk assessment of emergency evacuation based on deep learning. IEEE Trans Comput Soc Syst 9(3):940–947
Chen Y-H, Fan C-P, Chang RC-H (2020) Prototype of low complexity CNN hardware accelerator with FPGA-based PYNQ platform for dual-mode biometrics recognition. In: 2020 international SoC design conference (ISOCC). IEEE, pp 189–190
Mawaddah AH, Sari CA, Rachmawanto EH (2020) Handwriting recognition of Hiragana characters using convolutional neural network. In: 2020 international seminar on application for technology of information and communication (iSemantic). IEEE, pp 79–82
Acharya S, Pant AK, Gyawali PK (2015) Deep learning based large scale handwritten Devanagari character recognition. In: 2015 9th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 1–6
Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X (2016) DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 36(3):513–517
Blott M, Fraser NJ, Gambardella G, Halder L, Kath J, Neveu Z, Umuroglu Y, Vasilciuc A, Leeser M, Doyle L (2020) Evaluation of optimized CNNs on heterogeneous accelerators using a novel benchmarking approach. IEEE Trans Comput 70(10):1654–1669
Liang Y, Lu L, Xie J (2020) OMNI: a framework for integrating hardware and software optimizations for sparse CNNs. IEEE Trans Comput Aided Des Integr Circuits Syst 40(8):1648–1661
Shah N, Chaudhari P, Varghese K (2018) Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans Neural Netw Learn Syst 29(12):5922–5934
Alawad M (2018) Scalable FPGA accelerator for deep convolutional neural networks with stochastic streaming. IEEE Trans Multi-Scale Comput Syst 4(4):888–899
Li S, Luo Y, Sun K, Yadav N, Choi KK (2020) A novel FPGA accelerator design for real-time and ultra-low power deep convolutional neural networks compared with titan X GPU. IEEE Access 8:105455–105471
Zhu J, Wang L, Liu H, Tian S, Deng Q, Li J (2020) An efficient task assignment framework to accelerate DPU-based convolutional neural network inference on FPGAs. IEEE Access 8:83224–83237
Bao C, Xie T, Feng W, Chang L, Yu C (2020) A power-efficient optimizing framework FPGA accelerator based on winograd for YOLO. IEEE Access 8:94307–94317
Rakanovic DM, Vranjkovic V, Struharik RJR (2021) Argus CNN accelerator based on kernel clustering and resource-aware pruning. Elektronika ir Elektrotechnika 27(3):57–70
Aimar A, Mostafa H, Calabrese E, Rios-Navarro A, Tapiador-Morales R, Lungu IA, Milde MB, Corradi F, Linares-Barranco A, Liu SC, Delbruck T (2018) Nullhop: a flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans Neural Netw Learn Syst 30(3):644–656
Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance CNN accelerator on FPGA using unified winograd-GEMM architecture. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(12):2816–2828
Wang X, Wang C, Cao J, Gong L, Zhou X (2020) WinoNN: optimizing FPGA-based convolutional neural network accelerators using sparse Winograd algorithm. IEEE Trans Comput Aided Des Integr Circuits Syst 39(11):4290–4302
Blott M, Preußer TB, Fraser NJ, Gambardella G, O’brien K, Umuroglu Y, Leeser M, Vissers K (2018) FINN-R: an end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans Reconfig Technol Syst (TRETS) 11(3):1–23
TUL Technology Unlimited, TUL PYNQ\(^\text{TM}\)-Z2 board. https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html. Accessed 4 July 2022
PYNQ Homepage. http://www.pynq.io/home.html. Accessed 4 July 2022
PYNQ Overlay. https://pynq.readthedocs.io/en/latest/pynq_overlays.html. Accessed 4 July 2022
Feng G, Hu Z, Chen S, Wu F (2016) Energy-efficient and high-throughput FPGA-based accelerator for convolutional neural networks. In: 2016 13th IEEE international conference on solid-state and integrated circuit technology (ICSICT). IEEE, pp 624–626
Xiao T, Tao M (2021) Research on FPGA based convolutional neural network acceleration method. In: 2021 IEEE international conference on artificial intelligence and computer applications (ICAICA). IEEE, pp 289–292
Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp 161–170
Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput Aided Des Integr Circuits Syst 38(11):2072–2085
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S, Wang Y (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 26–35
Li Z, Wang L, Guo S, Deng Y, Dou Q, Zhou H, Lu W (2017) Laius: an 8-bit fixed-point CNN hardware inference engine. In: 2017 IEEE international symposium on parallel and distributed processing with applications and 2017 IEEE international conference on ubiquitous computing and communications (ISPA/IUCC). IEEE, pp 143–150
Wang H, Xu W, Zhang Z, You X, Zhang C (2021) An efficient stochastic convolution architecture based on fast FIR algorithm. IEEE Trans Circuits Syst II Express Briefs 69(3):984–988
Faraji SR, Najafi MH, Li B, Lilja DJ, Bazargan K (2019) Energy-efficient convolutional neural networks with deterministic bit-stream processing. In: 2019 design, automation & test in Europe conference & exhibition (DATE). IEEE, pp 1757–1762
Ren A, Li Z, Ding C, Qiu Q, Wang Y, Li J, Qian X, Yuan B (2017) SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. ACM SIGPLAN Not 52(4):405–418
Ansari A, Ogunfunmi T (2022) Hardware acceleration of a generalized fast 2-D convolution method for deep neural networks. IEEE Access 10:16843–16858
Hailesellasie MT, Hasan SR (2019) Mulnet: a flexible CNN processor with higher resource utilization efficiency for constrained devices. IEEE Access 7:47509–47524
Maraoui A, Messaoud S, Bouaafia S, Ammari AC, Khriji L, Machhout M (2021) PYNQ FPGA hardware implementation of LeNet-5-based traffic sign recognition application. In: 2021 18th international multi-conference on systems, signals & devices (SSD). IEEE, pp 1004–1009
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yanamala, R.M.R., Pullakandam, M. A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device. Des Autom Embed Syst 27, 165–189 (2023). https://doi.org/10.1007/s10617-023-09274-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10617-023-09274-8