skip to main content
research-article
Free Access
Just Accepted

Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory

Online AM:05 March 2024Publication History
Skip Abstract Section

Abstract

Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.

This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.

We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.

This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.

References

  1. Guillaume Aupy, Julien Herrmann, Paul Hovland, and Yves Robert. 2016. Optimal Multistage Algorithm for Adjoint Computation. SIAM Journal on Scientific Computing 38, 3 (2016), 232–255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Olivier Beaumont, Lionel Eyraud-Dubois, and Alena Shilova. 2019a. Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training. (Oct. 2019). https://hal.inria.fr/hal-02316266 working paper or preprint.Google ScholarGoogle Scholar
  3. Olivier Beaumont, Lionel Eyraud-Dubois, and Alena Shilova. 2021. Efficient Combination of Rematerialization and Offloading for Training DNNs. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 23844–23857. https://proceedings.neurips.cc/paper/2021/file/c8461bf13fca8a2b9912ab2eb1668e4b-Paper.pdfGoogle ScholarGoogle Scholar
  4. Olivier Beaumont, Julien Herrmann, Guillaume Pallez, and Alena Shilova. 2019b. Optimal Memory-aware Backpropagation of Deep Join Networks. Research Report RR-9273. Inria. https://hal.inria.fr/hal-02131552Google ScholarGoogle Scholar
  5. Jose Carranza-Rojas, Herve Goeau, Pierre Bonnet, Erick Mata-Montero, and Alexis Joly. 2017. Going deeper in the automated identification of Herbarium specimens. BMC Evolutionary Biology 17, 1 (2017), 181.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. Reversible architectures for arbitrarily deep residual neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  7. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).Google ScholarGoogle Scholar
  8. Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).Google ScholarGoogle Scholar
  9. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223–1231.Google ScholarGoogle Scholar
  10. Jianwei Feng and Dong Huang. 2018. Optimal Gradient Checkpoint Search for Arbitrary Computation Graphs. arXiv:1808.00079 [cs.LG]Google ScholarGoogle Scholar
  11. Yifan Feng, Zizhao Zhang, Xibin Zhao, Rongrong Ji, and Yue Gao. 2018. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 264–272.Google ScholarGoogle ScholarCross RefCross Ref
  12. Michael R Garey and David S Johnson. 1979. Computers and intractability. Vol. 174. freeman San Francisco.Google ScholarGoogle Scholar
  13. Sambit Ghadai, Xian Yeow Lee, Aditya Balu, Soumik Sarkar, and Adarsh Krishnamurthy. 2019. Multi-level 3D CNN for Learning Multi-scale Spatial Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 0–0.Google ScholarGoogle ScholarCross RefCross Ref
  14. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. 2017. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems. 2214–2224.Google ScholarGoogle Scholar
  15. Priya Goyal. 2018. PyTorch Memory optimizations via gradient checkpointing. https://github.com/prigoyal/pytorch_memongerGoogle ScholarGoogle Scholar
  16. Andreas Griewank. 1989. On automatic differentiation. Mathematical Programming: Recent Developments and Applications 6, 6 (1989), 83–107.Google ScholarGoogle Scholar
  17. Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS) 26, 1 (2000), 19–45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andreas Griewank and Andrea Walther. 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105. Siam.Google ScholarGoogle Scholar
  19. Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems. 4125–4133.Google ScholarGoogle Scholar
  20. Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google ScholarGoogle Scholar
  21. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. In Computer Vision – ECCV 2016. Springer International Publishing, 630–645. https://arxiv.org/tb/1603.05027Google ScholarGoogle ScholarCross RefCross Ref
  23. HiePACS team. 2019. Rotor. https://gitlab.inria.fr/hiepacs/rotorGoogle ScholarGoogle Scholar
  24. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).Google ScholarGoogle Scholar
  25. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. 103–112.Google ScholarGoogle Scholar
  26. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 1 (2017), 6869–6898.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. 2019. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. arXiv:1910.02653 [cs.LG]Google ScholarGoogle Scholar
  29. Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616 (2020).Google ScholarGoogle Scholar
  30. Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, and Joshua Wang. 2019. Efficient rematerialization for deep networks. In Advances in Neural Information Processing Systems. 15146–15155.Google ScholarGoogle Scholar
  31. Mitsuru Kusumoto, Takuya Inoue, Gentaro Watanabe, Takuya Akiba, and Masanori Koyama. 2019. A Graph Theoretic Framework of Recomputation Algorithms for Memory-Efficient Backpropagation. arXiv preprint arXiv:1905.11722 (2019).Google ScholarGoogle Scholar
  32. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444.Google ScholarGoogle Scholar
  33. Johannes Lotz, Uwe Naumann, and Sumit Mitra. 2016a. Mixed Integer Programming for Call Tree Reversal. 83–91. https://doi.org/10.1137/1.9781611974690.ch9 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611974690.ch9Google ScholarGoogle ScholarCross RefCross Ref
  34. Johannes Lotz, Uwe Naumann, and Sumit Mitra. 2016b. Mixed integer programming for call tree reversal. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. SIAM, 83–91.Google ScholarGoogle ScholarCross RefCross Ref
  35. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Uwe Naumann. 2008. Call Tree Reversal is NP-Complete. In Advances in Automatic Differentiation, Christian H. Bischof, H. Martin Bücker, Paul Hovland, Uwe Naumann, and Jean Utke (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 13–22.Google ScholarGoogle Scholar
  37. Uwe Naumann. 2009. DAG reversal is NP-complete. Journal of Discrete Algorithms 7, 4 (2009), 402–410.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990 (2017).Google ScholarGoogle Scholar
  39. Pytorch contributors. 2018. Periodic Checkpointing in PyTorch. https://pytorch.org/docs/stable/checkpoint.htmlGoogle ScholarGoogle Scholar
  40. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525–542.Google ScholarGoogle ScholarCross RefCross Ref
  42. Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. 2018. In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5639–5647.Google ScholarGoogle ScholarCross RefCross Ref
  44. Shriram S B, Anshuj Garg, and Purushottam Kulkarni. 2016. Dynamic Memory Management for GPU-based training of Deep Neural Networks. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Press.Google ScholarGoogle Scholar
  45. Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734–5743.Google ScholarGoogle ScholarCross RefCross Ref
  46. Jeffrey Mark Siskind and Barak A. Pearlmutter. 2018. Divide-and-conquer checkpointing for arbitrary programs with no user annotation. Optimization Methods and Software 33, 4-6 (2018), 1288–1330. https://doi.org/10.1080/10556788.2018.1459621 arXiv:https://doi.org/10.1080/10556788.2018.1459621Google ScholarGoogle ScholarCross RefCross Ref
  47. Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision. 945–953.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]Google ScholarGoogle Scholar
  49. Marian Verhelst and Bert Moons. 2017. Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to iot and edge devices. IEEE Solid-State Circuits Magazine 9, 4 (2017), 55–65.Google ScholarGoogle ScholarCross RefCross Ref
  50. Andrea Walther and Andreas Griewank. 2004. Advantages of binomial checkpointing for memory-reduced adjoint calculations. In Numerical mathematics and advanced applications. Springer, 834–843.Google ScholarGoogle Scholar
  51. Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Mathematical Software
          ACM Transactions on Mathematical Software Just Accepted
          ISSN:0098-3500
          EISSN:1557-7295
          Table of Contents

          Copyright © 2024 Copyright held by the owner/author(s).

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Online AM: 5 March 2024
          • Revised: 22 January 2024
          • Accepted: 22 January 2024
          • Received: 14 November 2020
          Published in toms Just Accepted

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)106
          • Downloads (Last 6 weeks)61

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader