Abstract
Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.
This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.
We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.
This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.
- Guillaume Aupy, Julien Herrmann, Paul Hovland, and Yves Robert. 2016. Optimal Multistage Algorithm for Adjoint Computation. SIAM Journal on Scientific Computing 38, 3 (2016), 232–255.Google ScholarDigital Library
- Olivier Beaumont, Lionel Eyraud-Dubois, and Alena Shilova. 2019a. Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training. (Oct. 2019). https://hal.inria.fr/hal-02316266 working paper or preprint.Google Scholar
- Olivier Beaumont, Lionel Eyraud-Dubois, and Alena Shilova. 2021. Efficient Combination of Rematerialization and Offloading for Training DNNs. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 23844–23857. https://proceedings.neurips.cc/paper/2021/file/c8461bf13fca8a2b9912ab2eb1668e4b-Paper.pdfGoogle Scholar
- Olivier Beaumont, Julien Herrmann, Guillaume Pallez, and Alena Shilova. 2019b. Optimal Memory-aware Backpropagation of Deep Join Networks. Research Report RR-9273. Inria. https://hal.inria.fr/hal-02131552Google Scholar
- Jose Carranza-Rojas, Herve Goeau, Pierre Bonnet, Erick Mata-Montero, and Alexis Joly. 2017. Going deeper in the automated identification of Herbarium specimens. BMC Evolutionary Biology 17, 1 (2017), 181.Google ScholarCross Ref
- Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).Google Scholar
- Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).Google Scholar
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223–1231.Google Scholar
- Jianwei Feng and Dong Huang. 2018. Optimal Gradient Checkpoint Search for Arbitrary Computation Graphs. arXiv:1808.00079 [cs.LG]Google Scholar
- Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. 2018. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 264–272.Google ScholarCross Ref
- Michael R Garey and David S Johnson. 1979. Computers and intractability. Vol. 174. freeman San Francisco.Google Scholar
- Sambit Ghadai, Xian Yeow Lee, Aditya Balu, Soumik Sarkar, and Adarsh Krishnamurthy. 2019. Multi-level 3D CNN for Learning Multi-scale Spatial Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 0–0.Google ScholarCross Ref
- Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. 2017. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems. 2214–2224.Google Scholar
- Priya Goyal. 2018. PyTorch Memory optimizations via gradient checkpointing. https://github.com/prigoyal/pytorch_memongerGoogle Scholar
- Andreas Griewank. 1989. On automatic differentiation. Mathematical Programming: Recent Developments and Applications 6, 6 (1989), 83–107.Google Scholar
- Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS) 26, 1 (2000), 19–45.Google ScholarDigital Library
- Andreas Griewank and Andrea Walther. 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105. Siam.Google Scholar
- Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems. 4125–4133.Google Scholar
- Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google Scholar
- Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. In Computer Vision – ECCV 2016. Springer International Publishing, 630–645. https://arxiv.org/tb/1603.05027Google ScholarCross Ref
- HiePACS team. 2019. Rotor. https://gitlab.inria.fr/hiepacs/rotorGoogle Scholar
- Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google Scholar
- Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. 103–112.Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 1 (2017), 6869–6898.Google ScholarDigital Library
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).Google ScholarDigital Library
- Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. 2019. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. arXiv:1910.02653 [cs.LG]Google Scholar
- Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616 (2020).Google Scholar
- Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. 2019. Efficient rematerialization for deep networks. In Advances in Neural Information Processing Systems. 15146–15155.Google Scholar
- Mitsuru Kusumoto, Takuya Inoue, Gentaro Watanabe, Takuya Akiba, and Masanori Koyama. 2019. A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation. arXiv preprint arXiv:1905.11722 (2019).Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444.Google Scholar
- Johannes Lotz, Uwe Naumann, and Sumit Mitra. 2016a. Mixed Integer Programming for Call Tree Reversal. 83–91. https://doi.org/10.1137/1.9781611974690.ch9 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611974690.ch9Google ScholarCross Ref
- Johannes Lotz, Uwe Naumann, and Sumit Mitra. 2016b. Mixed integer programming for call tree reversal. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. SIAM, 83–91.Google ScholarCross Ref
- Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.Google ScholarDigital Library
- Uwe Naumann. 2008. Call Tree Reversal is NP-Complete. In Advances in Automatic Differentiation, Christian H. Bischof, H. Martin Bücker, Paul Hovland, Uwe Naumann, and Jean Utke (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 13–22.Google Scholar
- Uwe Naumann. 2009. DAG reversal is NP-complete. Journal of Discrete Algorithms 7, 4 (2009), 402–410.Google ScholarDigital Library
- Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990 (2017).Google Scholar
- Pytorch contributors. 2018. Periodic Checkpointing in PyTorch. https://pytorch.org/docs/stable/checkpoint.htmlGoogle Scholar
- Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.Google ScholarDigital Library
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525–542.Google ScholarCross Ref
- Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 18.Google ScholarDigital Library
- Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. 2018. In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5639–5647.Google ScholarCross Ref
- Shriram S B, Anshuj Garg, and Purushottam Kulkarni. 2016. Dynamic Memory Management for GPU-based training of Deep Neural Networks. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Press.Google Scholar
- Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734–5743.Google ScholarCross Ref
- Jeffrey Mark Siskind and Barak A. Pearlmutter. 2018. Divide-and-conquer checkpointing for arbitrary programs with no user annotation. Optimization Methods and Software 33, 4-6 (2018), 1288–1330. https://doi.org/10.1080/10556788.2018.1459621 arXiv:https://doi.org/10.1080/10556788.2018.1459621Google ScholarCross Ref
- Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision. 945–953.Google ScholarDigital Library
- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]Google Scholar
- Marian Verhelst and Bert Moons. 2017. Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to iot and edge devices. IEEE Solid-State Circuits Magazine 9, 4 (2017), 55–65.Google ScholarCross Ref
- Andrea Walther and Andreas Griewank. 2004. Advantages of binomial checkpointing for memory-reduced adjoint calculations. In Numerical mathematics and advanced applications. Springer, 834–843.Google Scholar
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google ScholarCross Ref
Index Terms
- Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory
Recommendations
Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training
Euro-Par 2020: Parallel ProcessingAbstractTraining Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored ...
Optimal Oscillator Memory Networks
NICE '22: Proceedings of the 2022 Annual Neuro-Inspired Computational Elements ConferenceAssociative memory [4, 11, 15] is an important building block in neural computing, neuromorphic engineering, and, in general, collective-state computing. Based on phasor associative memories (PAM) [9], a type of phasor neural network (PNN), we present ...
Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks
Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-...
Comments