A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

Yanamala, Rama Muni Reddy; Pullakandam, Muralidhar

doi:10.1007/s10617-023-09274-8

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

Published: 26 April 2023

Volume 27, pages 165–189, (2023)
Cite this article

Design Automation for Embedded Systems Aims and scope Submit manuscript

Rama Muni Reddy Yanamala¹ &
Muralidhar Pullakandam¹

562 Accesses
2 Citations
Explore all metrics

Abstract

The most recent deep learning technique used in many applications is the convolutional neural network (CNN). Recent years have seen a rise in demand for real-time CNN implementations on various embedded devices with restricted resources. The CNN models should be implemented using field-programmable gate arrays to ensure flexible programmability and speed up the development process. However, the CNN acceleration is hampered by complex computations, limited bandwidth, and on-chip memory storage. In this paper, a reusable quantized hardware architecture was proposed to accelerate deep CNN models by solving the above issues. Twenty-five processing elements are employed for the computation of convolutions in the CNN model. Pipelining, loop unrolling, and array partitioning are the techniques for increasing the speed of computations in both the convolution layers and fully connected layers. This design is tested with MNIST handwritten digit image classification on a low-cost, low-memory Xilinx PYNQ-Z2 system on chip edge device. The inference speed of the proposed hardware design achieved 92.7% higher than INTEL core3 CPU, 90.7% more than Haswell core2 CPU, 87.7% more than NVIDIA Tesla K80 GPU, and 84.9% better when compared to the conventional hardware accelerator with one processing element. The proposed quantized architecture design has achieved the performance of 4.4 GOP/s without compromising the accuracy and it was 2 times more than the conventional architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

Article 18 October 2022

An FPGA Based Hardware Accelerator for Classification of Handwritten Digits

A Reconfigurable Convolutional Neural Networks Accelerator Based on FPGA

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–10
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545
Article Google Scholar
Strigl D, Kofler K, Podlipnig S (2010) Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro conference on parallel, distributed and network-based processing. IEEE, pp 317–324
Motamedi M, Gysel P, Akella V, Ghiasi S (2016) Design space exploration of FPGA-based deep convolutional neural networks. In: 2016 21st Asia and South Pacific design automation conference (ASP-DAC). IEEE, pp 575–580
Sankaradas M, Jakkula V, Cadambi S, Chakradhar S, Durdanovic I, Cosatto E, Graf HP (2009) A massively parallel coprocessor for convolutional neural networks. In: 2009 20th IEEE international conference on application-specific systems, architectures and processors. IEEE, pp 53–60
Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput Aided Des Integr Circuits Syst 38(11):2072–2085
Article Google Scholar
Chen T, Du Z, Sun N, Wang J, Wu C, Chen Y, Temam O (2014) Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput Arch News 42(1):269–284
Article Google Scholar
Chen Y, Luo T, Liu S, Zhang S, He L, Wang J, Li L, Chen T, Xu Z, Sun, N (2014) Dadiannao: a machine-learning supercomputer. In: 2014 47th annual IEEE/ACM international symposium on microarchitecture. IEEE, pp 609–622
Liu D, Chen T, Liu S, Zhou J, Zhou S, Teman O, Feng X, Zhou X, Chen Y (2015) Pudiannao: a polyvalent machine learning accelerator. ACM SIGARCH Comput Arch News 43(1):369–381
Article Google Scholar
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 26–35
Cui W, Lu Q, Qureshi AM, Li W, Wu K (2021) An adaptive lenet-5 model for anomaly detection. Information Security Journal: A Global Perspective 30(1):19–29
Google Scholar
Wang M, Sun T, Song K, Li S, Jiang J, Sun L (2022) An efficient sparse pruning method for human pose estimation. Connect Sci 34(1):960–974
Article Google Scholar
Maraoui A, Messaoud S, Bouaafia S, Ammari AC, Khriji L, Machhout M (2021) PYNQ FPGA hardware implementation of LeNet-5-based traffic sign recognition application. In: 2021 18th international multi-conference on systems, signals & devices (SSD). IEEE, pp 1004–1009
Huynh TV (2022) FPGA-based acceleration for convolutional neural networks on PYNQ-Z2. Int J Comput Digit Syst 11(1):441–450
Article Google Scholar
Choudhari O, Chopade M, Chopde S, Dabhadkar S, Ingale V (2020) Hardware accelerator: implementation of CNN on FPGA for digit recognition. In: 2020 24th international symposium on VLSI design and test (VDAT). IEEE, pp 1–6
Ghaffari S, Sharifian S (2016) FPGA-based convolutional neural network accelerator design using high level synthesize. In: 2016 2nd international conference of signal processing and intelligent systems (ICSPIS). IEEE, pp 1–6
Rongshi D, Yongming T (2019) Accelerator implementation of LeNet-5 convolution neural network based on FPGA with HLS. In: 2019 3rd international conference on circuits, system and simulation (ICCSS). IEEE, pp 64–67
Li J, Hu Y, Li J (2021) Rapid risk assessment of emergency evacuation based on deep learning. IEEE Trans Comput Soc Syst 9(3):940–947
Article Google Scholar
Chen Y-H, Fan C-P, Chang RC-H (2020) Prototype of low complexity CNN hardware accelerator with FPGA-based PYNQ platform for dual-mode biometrics recognition. In: 2020 international SoC design conference (ISOCC). IEEE, pp 189–190
Mawaddah AH, Sari CA, Rachmawanto EH (2020) Handwriting recognition of Hiragana characters using convolutional neural network. In: 2020 international seminar on application for technology of information and communication (iSemantic). IEEE, pp 79–82
Acharya S, Pant AK, Gyawali PK (2015) Deep learning based large scale handwritten Devanagari character recognition. In: 2015 9th international conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 1–6
Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X (2016) DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans Comput Aided Des Integr Circuits Syst 36(3):513–517
Google Scholar
Blott M, Fraser NJ, Gambardella G, Halder L, Kath J, Neveu Z, Umuroglu Y, Vasilciuc A, Leeser M, Doyle L (2020) Evaluation of optimized CNNs on heterogeneous accelerators using a novel benchmarking approach. IEEE Trans Comput 70(10):1654–1669
MATH Google Scholar
Liang Y, Lu L, Xie J (2020) OMNI: a framework for integrating hardware and software optimizations for sparse CNNs. IEEE Trans Comput Aided Des Integr Circuits Syst 40(8):1648–1661
Article Google Scholar
Shah N, Chaudhari P, Varghese K (2018) Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network. IEEE Trans Neural Netw Learn Syst 29(12):5922–5934
Article Google Scholar
Alawad M (2018) Scalable FPGA accelerator for deep convolutional neural networks with stochastic streaming. IEEE Trans Multi-Scale Comput Syst 4(4):888–899
Article Google Scholar
Li S, Luo Y, Sun K, Yadav N, Choi KK (2020) A novel FPGA accelerator design for real-time and ultra-low power deep convolutional neural networks compared with titan X GPU. IEEE Access 8:105455–105471
Article Google Scholar
Zhu J, Wang L, Liu H, Tian S, Deng Q, Li J (2020) An efficient task assignment framework to accelerate DPU-based convolutional neural network inference on FPGAs. IEEE Access 8:83224–83237
Article Google Scholar
Bao C, Xie T, Feng W, Chang L, Yu C (2020) A power-efficient optimizing framework FPGA accelerator based on winograd for YOLO. IEEE Access 8:94307–94317
Article Google Scholar
Rakanovic DM, Vranjkovic V, Struharik RJR (2021) Argus CNN accelerator based on kernel clustering and resource-aware pruning. Elektronika ir Elektrotechnika 27(3):57–70
Article Google Scholar
Aimar A, Mostafa H, Calabrese E, Rios-Navarro A, Tapiador-Morales R, Lungu IA, Milde MB, Corradi F, Linares-Barranco A, Liu SC, Delbruck T (2018) Nullhop: a flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans Neural Netw Learn Syst 30(3):644–656
Article Google Scholar
Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance CNN accelerator on FPGA using unified winograd-GEMM architecture. IEEE Trans Very Large Scale Integr (VLSI) Syst 27(12):2816–2828
Article Google Scholar
Wang X, Wang C, Cao J, Gong L, Zhou X (2020) WinoNN: optimizing FPGA-based convolutional neural network accelerators using sparse Winograd algorithm. IEEE Trans Comput Aided Des Integr Circuits Syst 39(11):4290–4302
Article Google Scholar
Blott M, Preußer TB, Fraser NJ, Gambardella G, O’brien K, Umuroglu Y, Leeser M, Vissers K (2018) FINN-R: an end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans Reconfig Technol Syst (TRETS) 11(3):1–23
Article Google Scholar
TUL Technology Unlimited, TUL PYNQ\(^\text{TM}\)-Z2 board. https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html. Accessed 4 July 2022
PYNQ Homepage. http://www.pynq.io/home.html. Accessed 4 July 2022
PYNQ Overlay. https://pynq.readthedocs.io/en/latest/pynq_overlays.html. Accessed 4 July 2022
Feng G, Hu Z, Chen S, Wu F (2016) Energy-efficient and high-throughput FPGA-based accelerator for convolutional neural networks. In: 2016 13th IEEE international conference on solid-state and integrated circuit technology (ICSICT). IEEE, pp 624–626
Xiao T, Tao M (2021) Research on FPGA based convolutional neural network acceleration method. In: 2021 IEEE international conference on artificial intelligence and computer applications (ICAICA). IEEE, pp 289–292
Zhang C, Li P, Sun G, Guan Y, Xiao B, Cong J (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, pp 161–170
Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput Aided Des Integr Circuits Syst 38(11):2072–2085
Article Google Scholar
Qiu J, Wang J, Yao S, Guo K, Li B, Zhou E, Yu J, Tang T, Xu N, Song S, Wang Y (2016) Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 26–35
Li Z, Wang L, Guo S, Deng Y, Dou Q, Zhou H, Lu W (2017) Laius: an 8-bit fixed-point CNN hardware inference engine. In: 2017 IEEE international symposium on parallel and distributed processing with applications and 2017 IEEE international conference on ubiquitous computing and communications (ISPA/IUCC). IEEE, pp 143–150
Wang H, Xu W, Zhang Z, You X, Zhang C (2021) An efficient stochastic convolution architecture based on fast FIR algorithm. IEEE Trans Circuits Syst II Express Briefs 69(3):984–988
Google Scholar
Faraji SR, Najafi MH, Li B, Lilja DJ, Bazargan K (2019) Energy-efficient convolutional neural networks with deterministic bit-stream processing. In: 2019 design, automation & test in Europe conference & exhibition (DATE). IEEE, pp 1757–1762
Ren A, Li Z, Ding C, Qiu Q, Wang Y, Li J, Qian X, Yuan B (2017) SC-DCNN: highly-scalable deep convolutional neural network using stochastic computing. ACM SIGPLAN Not 52(4):405–418
Article Google Scholar
Ansari A, Ogunfunmi T (2022) Hardware acceleration of a generalized fast 2-D convolution method for deep neural networks. IEEE Access 10:16843–16858
Article Google Scholar
Hailesellasie MT, Hasan SR (2019) Mulnet: a flexible CNN processor with higher resource utilization efficiency for constrained devices. IEEE Access 7:47509–47524
Article Google Scholar
Maraoui A, Messaoud S, Bouaafia S, Ammari AC, Khriji L, Machhout M (2021) PYNQ FPGA hardware implementation of LeNet-5-based traffic sign recognition application. In: 2021 18th international multi-conference on systems, signals & devices (SSD). IEEE, pp 1004–1009

Download references

Author information

Authors and Affiliations

Electronics and Communication Engineering, National Institute of Technology Warangal, Warangal, Telangana, 506004, India
Rama Muni Reddy Yanamala & Muralidhar Pullakandam

Authors

Rama Muni Reddy Yanamala
View author publications
You can also search for this author in PubMed Google Scholar
Muralidhar Pullakandam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rama Muni Reddy Yanamala.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yanamala, R.M.R., Pullakandam, M. A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device. Des Autom Embed Syst 27, 165–189 (2023). https://doi.org/10.1007/s10617-023-09274-8

Download citation

Received: 12 December 2022
Accepted: 17 April 2023
Published: 26 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10617-023-09274-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

Abstract

Access this article

Similar content being viewed by others

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

An FPGA Based Hardware Accelerator for Classification of Handwritten Digits

A Reconfigurable Convolutional Neural Networks Accelerator Based on FPGA

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A high-speed reusable quantized hardware accelerator design for CNN on constrained edge device

Abstract

Access this article

Similar content being viewed by others

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

An FPGA Based Hardware Accelerator for Classification of Handwritten Digits

A Reconfigurable Convolutional Neural Networks Accelerator Based on FPGA

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation