Abstract
This article provides a survey of the results of several research studies [12–14, 26], in which open questions related to the high-probability convergence analysis of stochastic first-order optimization methods under mild assumptions on the noise were gradually addressed. In the beginning, we introduce the concept of gradient clipping, which plays a pivotal role in the development of stochastic methods for successful operation in the case of heavy-tailed distributions. Next, we examine the importance of obtaining the high-probability convergence guarantees and their connection with in-expectation convergence guarantees. The concluding sections of the article are dedicated to presenting the primary findings related to minimization problems and the results of numerical experiments.
Similar content being viewed by others
Notes
The code is available at https://github.com/ClippedStochasticMethods/clipped-SSTM.
REFERENCES
E. Gorbunov, M. Danilova, and A. Gasnikov, “Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,” Adv. Neural Inf. Process. Syst. 33, 15042–15053 (2020).
A. V. Nazin, A. S. Nemirovsky, A. B. Tsybakov, and A. B. Juditsky, “Algorithms of robust stochastic optimization based on mirror descent method,” Autom. Remote Control 80, 1607–1627 (2019). https://doi.org/10.1134/s0005117919090042
D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim. 29, 207–239 (2021). https://doi.org/10.1137/18m1178244
A. Cutkosky and H. Mehta, “High-probability bounds for non-convex stochastic optimization with heavy tails,” Adv. Neural Inf. Process. Syst. 34 (2021).
T. D. Nguyen, T. H. Nguyen, A. Ene, and H. L. Nguyen, “High probability convergence of clipped-sgd under heavy-tailed noise,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2302.05437
Z. Liu and Z. Zhou, “Stochastic nonsmooth convex optimization with heavy-tailed noises,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2303.12277
Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High probability convergence of stochastic gradient methods,” Proc. Math. Learn. Res. 202, 21884–21914 (2023).
E. Gorbunov, M. Danilova, I. Shibaev, P. Dvurechensky, and A. Gasnikov, “Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise,” (2021). https://doi.org/10.48550/arXiv.2106.05958
E. Gorbunov, M. Danilova, D. Dobre, P. Dvurechenskii, A. Gasnikov, and G. Gidel, “Clipped stochastic methods for variational inequalities with heavy-tailed noise,” Adv. Neural Inf. Process. Syst. 35, 31319–31332 (2022).
A. Sadiev, M. Danilova, E. Gorbunov, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, “High-probability bounds for stochastic optimization and variational inequalities: The case of unbounded variance,” Proc. Mach. Learn. Res., 29563–29648 (2023).
R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” Proc. Mach. Learn. Res. 28, 1310–1318 (2013).
S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,” in Int. Conf. on Learning Representations (2018).
M. Peters, W. Ammar, C. Bhagavatula, and R. Power, “Semi-supervised sequence tagging with bidirectional language models,” in Proc. 55th Annu. Meeting of the Association for Computational Linguistics, Vancouver, 2017, Ed. by R. Barzilay and M.-Ye. Kan (Association for Computational Linguistics, 2017), Vol. 1, pp. 1756–1765. https://doi.org/10.18653/v1/p17-1161
M. Mosbach, M. Andriushchenko, and D. Klakow, “On the stability of fine-tuning BERT: Misconceptions, explanations, and strong baselines,” in Int. Conf. on Learning Representations (2020).
J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, “Why are adaptive methods good for attention models?,” Adv. Neural Inf. Process. Syst. 33, 15383–15393 (2020).
D. P. Kingma and B. J. Adam, “A method for stochastic optimization,” arXiv Preprint (2014). https://doi.org/10.48550/arXiv.1412.6980
S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM J. Optim. 23, 2341–2368 (2013). https://doi.org/10.1137/120880811
O. Devolder, F. Glineur, and Yu. Nesterov, “First-order methods of smooth convex optimization with inexact oracle,” Math. Program. 146, 37–75 (2011). https://doi.org/10.1007/s10107-013-0677-5
S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework,” SIAM J. Optim. 22, 1469–1492 (2012). https://doi.org/10.1137/110848864
G. Bennett, “Probability inequalities for the sum of independent random variables,” J. Am. Stat. Assoc. 57 (297), 33–45 (1962). https://doi.org/10.1080/01621459.1962.10482149
K. Dzhaparidze and J. H. Van Zanten, “On Bernstein-type inequalities for martingales,” Stochastic Processes Their Appl. 93, 109–117 (2001). https://doi.org/10.1016/s0304-4149(00)00086-7
D. A. Freedman, “On tail probabilities for martingales,” Ann. Probab. 3, 100–118 (1975). https://doi.org/10.1214/aop/1176996452
A. V. Gasnikov and Yu. E. Nesterov, “Universal method for stochastic composite optimization problems,” Comput. Math. Math. Phys. 58, 48–64 (2016). https://doi.org/10.1134/s0965542518010050
P. T. Harker and J.-Sh. Pang, “Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications,” Math. Program. 48, 161–220 (1990). https://doi.org/10.1007/bf01582255
E. K. Ryu and W. Yin, Large-Scale Convex Optimization: Algorithms and Analyses via Monotone Operators (Cambridge Univ. Press, 2022). https://doi.org/10.1017/9781009160865
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Yo. Bengio, “Generative adversarial nets,” Adv. Neural Inf. Process. Syst. 27 (2014).
G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “A variational inequality perspective on generative adversarial networks,” in International Conference on Learning Representations (2019).
ACKNOWLEDGMENTS
This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 no. 70-2021-00138.
Funding
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The author of this work declares that she has no conflicts of interest.
Additional information
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Danilova, M. Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise. Dokl. Math. 108 (Suppl 2), S248–S256 (2023). https://doi.org/10.1134/S1064562423701144
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064562423701144