Abstract
As the massive usage of artificial intelligence techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM devices to compute matrix-vector multiplication in O(1) time complexity is a promising approach, but it is true that these implementations often fail to cover the diversity of non-linearities required for modern NN applications. In this work, we propose a novel approach where Resistive RAMs themselves can be reprogrammed to compute not only the required matrix multiplications but also the activation functions, Softmax, and pooling layers, reducing energy in complex NNs. This approach offers more versatility for researching novel NN layouts compared to custom logic. Results show that our device outperforms analog and digital field-programmable approaches by up to 8.5× in experiments on real-world human activity recognition and language modeling datasets with convolutional neural network, generative pre-trained Transformer, and long short-term memory models.
- [1] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [2] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
- [3] . 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.Google ScholarCross Ref
- [4] . 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1251–1258.Google ScholarCross Ref
- [5] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 1–11.Google Scholar
- [6] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- [7] 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- [8] . 2018. The future of electronics based on memristive systems. Nature Electronics 1, 1 (2018), 22–29.Google ScholarCross Ref
- [9] . 2020. Panther: A programmable architecture for neural network training harnessing energy-efficient reram. IEEE Transactions on Computers 69, 8 (2020), 1128–1142.Google ScholarDigital Library
- [10] . 2016. Acceleration of deep neural network training with resistive cross-point devices: Design considerations. Frontiers in Neuroscience 10 (2016), 333.Google ScholarCross Ref
- [11] . 2019. FPSA: A full system stack solution for reconfigurable ReRAM-based NN accelerator architecture. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 733–747.Google ScholarDigital Library
- [12] . 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14–26.Google ScholarDigital Library
- [13] . 2016. DaDianNao: A neural network supercomputer. IEEE Transactions on Computers 66, 1 (2016), 73–88.Google ScholarDigital Library
- [14] . 2018. ReRAM-based processing-in-memory architecture for recurrent neural network acceleration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 12 (2018), 2781–2794.Google ScholarCross Ref
- [15] . 2019. ERA-LSTM: An efficient ReRAM-based architecture for long short-term memory. IEEE Transactions on Parallel and Distributed Systems 31, 6 (2019), 1328–1342.Google ScholarDigital Library
- [16] . 2015. Fully parallel write/read in resistive synaptic array for accelerating on-chip learning. Nanotechnology 26, 45 (2015), 455204.Google ScholarCross Ref
- [17] . 2018. Experimental investigation of 4-kb RRAM arrays programming conditions suitable for TCAM. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 12 (2018), 2599–2607.Google ScholarCross Ref
- [18] . 2021. Generalised analog LSTMs recurrent modules for neural computing. Frontiers in Computational Neuroscience 15 (2021), 85.Google Scholar
- [19] . 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 75–84.Google ScholarDigital Library
- [20] . 2018. C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 11–20.Google ScholarDigital Library
- [21] . 2022. The potential of SoC FPAAs for emerging ultra-low-power machine learning. Journal of Low Power Electronics and Applications 12, 2 (2022), 33.Google ScholarCross Ref
- [22] . 2016. Prime: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News 44, 3 (2016), 27–39.Google ScholarDigital Library
- [23] . 2015. RRAM-based analog approximate computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 12 (2015), 1905–1917.Google ScholarDigital Library
- [24] . 2021. Analog circuit architecture for max and min pooling methods on image. Analog Integrated Circuits and Signal Processing 108, 1 (2021), 119–124.Google ScholarDigital Library
- [25] . 2020. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems 33 (2020), 20378–20389.Google Scholar
- [26] . 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16). IEEE, Los Alamitos, CA, 1–12.Google ScholarCross Ref
- [27] . 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News 45, 2 (2017), 548–560.Google ScholarDigital Library
- [28] . 2018. TETRIS: TilE-matching the TRemendous Irregular Sparsity. Advances in Neural Information Processing Systems 31 (2018), 2041.Google Scholar
- [29] . 2020. DNN+ NeuroSim V2.0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 40, 11 (2020), 2306–2319.Google ScholarDigital Library
- [30] . 2013. A 3.1 mW 8b 1.2 GS/s single-channel asynchronous SAR ADC with alternate comparators for enhanced speed in 32 nm digital SOI CMOS. IEEE Journal of Solid-State Circuits 48, 12 (2013), 3049–3058.Google ScholarCross Ref
- [31] . 2011. Analysis of power consumption and linearity in capacitive digital-to-analog converters used in successive approximation ADCs. IEEE Transactions on Circuits and Systems I: Regular Papers 58, 8 (2011), 1736–1748.Google ScholarCross Ref
- [32] . 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’07). IEEE, Los Alamitos, CA, 3–14.Google ScholarDigital Library
- [33] . 2019. A short note on the Kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019).Google Scholar
- [34] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- [35] . 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84–90.Google ScholarDigital Library
- [36] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- [37] . 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.Google Scholar
- [38] . 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (2019), 249–266.Google ScholarCross Ref
- [39] . 2021. A cluster of FPAAs to recognize images using neural networks. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 11 (2021), 3391–3395.Google Scholar
- [40] . 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. 326–331.Google ScholarDigital Library
Index Terms
- Reprogrammable Non-Linear Circuits Using ReRAM for NN Accelerators
Recommendations
Data and Computation Reuse in CNNs Using Memristor TCAMs
Exploiting computational and data reuse in CNNs is crucial for the successful design of resource-constrained platforms. In image recognition applications, high levels of input locality and redundancy present in CNNs have become the golden goose for ...
Trained Biased Number Representation for ReRAM-Based Neural Network Accelerators
Special Issue on HALO for Energy-Constrained On-Chip Machine LearningRecent works have demonstrated the promise of using resistive random access memory (ReRAM) to perform neural network computations in memory. In particular, ReRAM-based crossbar structures can perform matrix-vector multiplication directly in the analog ...
A survey of FPGA-based accelerators for convolutional neural networks
AbstractDeep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks, and due to this, they have received significant interest from the researchers. Given the high computational demands of CNNs, ...
Comments