当前位置: X-MOL 学术Neural Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion
Neural Computation ( IF 2.9 ) Pub Date : 2024-01-01 , DOI: 10.1162/neco_a_01626
Daniel Kunin 1 , Javier Sagastuy-Brena 2 , Lauren Gillespie 3 , Eshed Margalit 4 , Hidenori Tanaka 5 , Surya Ganguli 6, 7 , Daniel L. K. Yamins 8
Affiliation  

In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance traveled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction among the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase-space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents that cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD. Understanding the limiting dynamics of SGD, and its dependence on various important hyperparameters like batch size, learning rate, and momentum, can serve as a basis for future work that can turn these insights into algorithmic gains.



中文翻译:

SGD 的极限动力学:修正损耗、相空间振荡和反常扩散

在这项工作中,我们探索了用随机梯度下降(SGD)训练的深度神经网络的极限动力学。正如之前所观察到的,在性能收敛很久之后,网络继续通过异常扩散过程在参数空间中移动,其中行进距离随着梯度更新数量的幂律增长,并具有不平凡的指数。我们揭示了优化超参数、梯度噪声结构以及训练结束时的 Hessian 矩阵之间复杂的相互作用,解释了这种异常扩散。为了建立这种理解,我们首先推导出具有有限学习率和批量大小的 SGD 连续时间模型作为欠阻尼朗之万方程。我们在线性回归的设置中研究这个方程,我们可以导出参数的相空间动力学及其从初始化到平稳的瞬时速度的精确解析表达式。使用福克-普朗克方程,我们表明驱动这些动态的关键因素不是原始训练损失,而是修正损失的组合,它隐式地正则了速度和导致相空间振荡的概率电流。我们在 ImageNet 上训练的 ResNet-18 模型的动力学中确定了该理论的定性和定量预测。通过统计物理学的视角,我们揭示了用 SGD 训练的深度神经网络的异常极限动力学的机械起源。了解 SGD 的限制动态及其对各种重要超参数(如批量大小、学习率和动量)的依赖,可以作为未来工作的基础,将这些见解转化为算法增益。

更新日期:2023-12-14
down
wechat
bug