Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory,ACM Transactions on Mathematical Software

当前位置： X-MOL 学术 › ACM Trans. Math. Softw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimal Re-Materialization Strategies for Heterogeneous Chains: How to Train Deep Neural Networks with Limited Memory
ACM Transactions on Mathematical Software ( IF 2.7 ) Pub Date : 2024-03-05 , DOI: 10.1145/3648633
Olivier Beaumont ₁ , Lionel Eyraud-Dubois ₁ , Julien Herrmann ₂ , Alexis Joly ₃ , Alena Shilova ₄

Affiliation

Training in Feed Forward Deep Neural Networks is a memory-intensive operation which is usually performed on GPUs with limited memory capacities. This may force data scientists to limit the depth of the models or the resolution of the input data if data does not fit in the GPU memory. The re-materialization technique, whose idea comes from the checkpointing strategies developed in the Automatic Differentiation literature, allows data scientists to limit the memory requirements related to the storage of intermediate data (activations), at the cost of an increase in the computational cost.

This paper introduces a new strategy of re-materialization of activations that significantly reduces memory usage. It consists in selecting which activations are saved and which activations are deleted during the forward phase, and then recomputing the deleted activations when they are needed during the backward phase.

We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. This paper focuses on the fully heterogeneous case, where the computation time and the memory requirement of each layer is different. We prove that finding the optimal solution is NP-hard and that classical techniques from Automatic Differentiation literature do not apply. Moreover, the classical assumption of memory persistence of materialized activations, used to simplify the search of optimal solutions, does not hold anymore. Thus, we propose a weak memory persistence property and provide a Dynamic Program to compute the optimal sequence of computations.

This algorithm is made available through the Rotor software, a PyTorch plug-in dealing with any network consisting of a sequence of layers, each of them having an arbitrarily complex structure. Through extensive experiments, we show that our implementation consistently outperforms existing re-materialization approaches for a large class of networks, image sizes and batch sizes.

中文翻译：

异构链的最优重新物化策略：如何用有限的内存训练深度神经网络

前馈深度神经网络的训练是一种内存密集型操作，通常在内存容量有限的 GPU 上执行。如果数据不适合 GPU 内存，这可能会迫使数据科学家限制模型的深度或输入数据的分辨率。重新物化技术的思想来自于自动微分文献中开发的检查点策略，它允许数据科学家限制与中间数据（激活）存储相关的内存需求，但代价是计算成本增加。

本文介绍了一种新的激活重新物化策略，可显着减少内存使用。它包括选择在前向阶段保存哪些激活以及删除哪些激活，然后在后向阶段需要时重新计算已删除的激活。

我们提出了一种原始的计算模型，它结合了两种类型的激活节省：要么仅存储层输入，要么记录产生输出的操作的完整历史记录。本文重点关注完全异构的情况，其中每层的计算时间和内存需求都不同。我们证明寻找最优解是 NP 困难的，并且自动微分文献中的经典技术不适用。此外，用于简化最佳解决方案搜索的物化激活记忆持久性的经典假设不再成立。因此，我们提出了弱内存持久性属性，并提供动态程序来计算最佳计算序列。

该算法可通过Rotor软件实现，Rotor 软件是一个 PyTorch 插件，可处理由一系列层组成的任何网络，每个层都具有任意复杂的结构。通过大量的实验，我们表明，对于大量网络、图像大小和批量大小，我们的实现始终优于现有的重新物化方法。

更新日期：2024-03-06

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南