Skip to main content

Advertisement

Log in

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

  • Published:
Design Automation for Embedded Systems Aims and scope Submit manuscript

Abstract

The most recent deep learning technique used in many applications is the convolutional neural network (CNN). Recent years have seen a rise in demand for real-time CNN implementations on various embedded devices with restricted resources. The CNN models should be implemented using field-programmable gate arrays to ensure flexible programmability and speed up the development process. However, the CNN acceleration is hampered by complex computations, limited bandwidth, and on-chip memory storage. In this paper, a reusable quantized hardware architecture was proposed to accelerate deep CNN models by solving the above issues. Twenty-five processing elements are employed for the computation of convolutions in the CNN model. Pipelining, loop unrolling, and array partitioning are the techniques for increasing the speed of computations in both the convolution layers and fully connected layers. This design is tested with MNIST handwritten digit image classification on a low-cost, low-memory Xilinx PYNQ-Z2 system on chip edge device. The inference speed of the proposed hardware design achieved 92.7% higher than INTEL core3 CPU, 90.7% more than Haswell core2 CPU, 87.7% more than NVIDIA Tesla K80 GPU, and 84.9% better when compared to the conventional hardware accelerator with one processing element. The proposed quantized architecture design has achieved the performance of 4.4 GOP/s without compromising the accuracy and it was 2 times more than the conventional architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  2. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  3. Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–10

  4. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229

  5. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  6. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  7. Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545

    Article  Google Scholar 

  8. Strigl D, Kofler K, Podlipnig S (2010) Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro conference on parallel, distributed and network-based processing. IEEE, pp 317–324

  9. Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In: 2016 21st Asia and South Pacific design automation conference (ASP-DAC). IEEE, pp 575–580

  10. Sankaradas M, Jakkula V, Cadambi S, Chakradhar S, Durdanovic I, Cosatto E, Graf HP (2009) A massively parallel coprocessor for convolutional neural networks. In: 2009 20th IEEE international conference on application-specific systems, architectures and processors. IEEE, pp 53–60

  11. Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput Aided Des Integr Circuits Syst 38(11):2072–2085

    Article  Google Scholar 

  12. Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2014) Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput Arch News 42(1):269–284

    Article  Google Scholar 

  13. Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, Li L, Chen T, Xu Z, Sun, N (2014) Dadiannao: a machine-learning supercomputer. In: 2014 47th annual IEEE/ACM international symposium on microarchitecture. IEEE, pp 609–622

  14. Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen Y (2015) Pudiannao: a polyvalent machine learning accelerator. ACM SIGARCH Comput Arch News 43(1):369–381

    Article  Google Scholar 

  15. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 26–35

  16. Cui W, Lu Q, Qureshi AM, Li W, Wu K (2021) An adaptive lenet-5 model for anomaly detection. Information Security Journal: A Global Perspective 30(1):19–29

    Google Scholar 

  17. Wang M, Sun T, Song K, Li S, Jiang J, Sun L (2022) An efficient sparse pruning method for human pose estimation. Connect Sci 34(1):960–974

    Article  Google Scholar 

  18. Maraoui A, Messaoud S, Bouaafia S, Ammari AC, Khriji L, Machhout M (2021) PYNQ FPGA hardware implementation of LeNet-5-based traffic sign recognition application. In: 2021 18th international multi-conference on systems, signals & devices (SSD). IEEE, pp 1004–1009

  19. Huynh TV (2022) FPGA-based acceleration for convolutional neural networks on PYNQ-Z2. Int J Comput Digit Syst 11(1):441–450

    Article  Google Scholar 

  20. Choudhari O, Chopade M, Chopde S, Dabhadkar S, Ingale V (2020) Hardware accelerator: implementation of CNN on FPGA for digit recognition. In: 2020 24th international symposium on VLSI design and test (VDAT). IEEE, pp 1–6

  21. Ghaffari S, Sharifian S (2016) FPGA-based convolutional neural network accelerator design using high level synthesize. In: 2016 2nd international conference of signal processing and intelligent systems (ICSPIS). IEEE, pp 1–6

  22. Rongshi D, Yongming T (2019) Accelerator implementation of LeNet-5 convolution neural network based on FPGA with HLS. In: 2019 3rd international conference on circuits, system and simulation (ICCSS). IEEE, pp 64–67

  23. Li J, Hu Y, Li J (2021) Rapid risk assessment of emergency evacuation based on deep learning. IEEE Trans Comput Soc Syst 9(3):940–947

    Article  Google Scholar 

  24. Chen Y-H, Fan C-P, Chang RC-H (2020) Prototype of low complexity CNN hardware accelerator with FPGA-based PYNQ platform for dual-mode biometrics recognition. In: 2020 international SoC design conference (ISOCC). IEEE, pp 189–190

  25. Mawaddah AH, Sari CA, Rachmawanto EH (2020) Handwriting recognition of Hiragana characters using convolutional neural network. In: 2020 international seminar on application for technology of information and communication (iSemantic). IEEE, pp 79–82

  26. Acharya S, Pant AK, Gyawali PK (2015) Deep learning based large scale handwritten Devanagari character recognition. In: 2015 9th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 1–6

  27. Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X (2016) DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 36(3):513–517

    Google Scholar 

  28. Blott M, Fraser NJ, Gambardella G, Halder L, Kath J, Neveu Z, Umuroglu Y, Vasilciuc A, Leeser M, Doyle L (2020) Evaluation of optimized CNNs on heterogeneous accelerators using a novel benchmarking approach. IEEE Trans Comput 70(10):1654–1669

    MATH  Google Scholar 

  29. Liang Y, Lu L, Xie J (2020) OMNI: a framework for integrating hardware and software optimizations for sparse CNNs. IEEE Trans Comput Aided Des Integr Circuits Syst 40(8):1648–1661

    Article  Google Scholar 

  30. Shah N, Chaudhari P, Varghese K (2018) Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans Neural Netw Learn Syst 29(12):5922–5934

    Article  Google Scholar 

  31. Alawad M (2018) Scalable FPGA accelerator for deep convolutional neural networks with stochastic streaming. IEEE Trans Multi-Scale Comput Syst 4(4):888–899

    Article  Google Scholar 

  32. Li S, Luo Y, Sun K, Yadav N, Choi KK (2020) A novel FPGA accelerator design for real-time and ultra-low power deep convolutional neural networks compared with titan X GPU. IEEE Access 8:105455–105471

    Article  Google Scholar 

  33. Zhu J, Wang L, Liu H, Tian S, Deng Q, Li J (2020) An efficient task assignment framework to accelerate DPU-based convolutional neural network inference on FPGAs. IEEE Access 8:83224–83237

    Article  Google Scholar 

  34. Bao C, Xie T, Feng W, Chang L, Yu C (2020) A power-efficient optimizing framework FPGA accelerator based on winograd for YOLO. IEEE Access 8:94307–94317

    Article  Google Scholar 

  35. Rakanovic DM, Vranjkovic V, Struharik RJR (2021) Argus CNN accelerator based on kernel clustering and resource-aware pruning. Elektronika ir Elektrotechnika 27(3):57–70

    Article  Google Scholar 

  36. Aimar A, Mostafa H, Calabrese E, Rios-Navarro A, Tapiador-Morales R, Lungu IA, Milde MB, Corradi F, Linares-Barranco A, Liu SC, Delbruck T (2018) Nullhop: a flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans Neural Netw Learn Syst 30(3):644–656

    Article  Google Scholar 

  37. Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance CNN accelerator on FPGA using unified winograd-GEMM architecture. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(12):2816–2828

    Article  Google Scholar 

  38. Wang X, Wang C, Cao J, Gong L, Zhou X (2020) WinoNN: optimizing FPGA-based convolutional neural network accelerators using sparse Winograd algorithm. IEEE Trans Comput Aided Des Integr Circuits Syst 39(11):4290–4302

    Article  Google Scholar 

  39. Blott M, Preußer TB, Fraser NJ, Gambardella G, O’brien K, Umuroglu Y, Leeser M, Vissers K (2018) FINN-R: an end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans Reconfig Technol Syst (TRETS) 11(3):1–23

    Article  Google Scholar 

  40. TUL Technology Unlimited, TUL PYNQ\(^\text{TM}\)-Z2 board. https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html. Accessed 4 July 2022

  41. PYNQ Homepage. http://www.pynq.io/home.html. Accessed 4 July 2022

  42. PYNQ Overlay. https://pynq.readthedocs.io/en/latest/pynq_overlays.html. Accessed 4 July 2022

  43. Feng G, Hu Z, Chen S, Wu F (2016) Energy-efficient and high-throughput FPGA-based accelerator for convolutional neural networks. In: 2016 13th IEEE international conference on solid-state and integrated circuit technology (ICSICT). IEEE, pp 624–626

  44. Xiao T, Tao M (2021) Research on FPGA based convolutional neural network acceleration method. In: 2021 IEEE international conference on artificial intelligence and computer applications (ICAICA). IEEE, pp 289–292

  45. Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp 161–170

  46. Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput Aided Des Integr Circuits Syst 38(11):2072–2085

    Article  Google Scholar 

  47. Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S, Wang Y (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 26–35

  48. Li Z, Wang L, Guo S, Deng Y, Dou Q, Zhou H, Lu W (2017) Laius: an 8-bit fixed-point CNN hardware inference engine. In: 2017 IEEE international symposium on parallel and distributed processing with applications and 2017 IEEE international conference on ubiquitous computing and communications (ISPA/IUCC). IEEE, pp 143–150

  49. Wang H, Xu W, Zhang Z, You X, Zhang C (2021) An efficient stochastic convolution architecture based on fast FIR algorithm. IEEE Trans Circuits Syst II Express Briefs 69(3):984–988

    Google Scholar 

  50. Faraji SR, Najafi MH, Li B, Lilja DJ, Bazargan K (2019) Energy-efficient convolutional neural networks with deterministic bit-stream processing. In: 2019 design, automation & test in Europe conference & exhibition (DATE). IEEE, pp 1757–1762

  51. Ren A, Li Z, Ding C, Qiu Q, Wang Y, Li J, Qian X, Yuan B (2017) SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. ACM SIGPLAN Not 52(4):405–418

    Article  Google Scholar 

  52. Ansari A, Ogunfunmi T (2022) Hardware acceleration of a generalized fast 2-D convolution method for deep neural networks. IEEE Access 10:16843–16858

    Article  Google Scholar 

  53. Hailesellasie MT, Hasan SR (2019) Mulnet: a flexible CNN processor with higher resource utilization efficiency for constrained devices. IEEE Access 7:47509–47524

    Article  Google Scholar 

  54. Maraoui A, Messaoud S, Bouaafia S, Ammari AC, Khriji L, Machhout M (2021) PYNQ FPGA hardware implementation of LeNet-5-based traffic sign recognition application. In: 2021 18th international multi-conference on systems, signals & devices (SSD). IEEE, pp 1004–1009

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rama Muni Reddy Yanamala.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yanamala, R.M.R., Pullakandam, M. A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device. Des Autom Embed Syst 27, 165–189 (2023). https://doi.org/10.1007/s10617-023-09274-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10617-023-09274-8

Keywords

Navigation