Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise

Danilova, M.

doi:10.1134/S1064562423701144

Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise

Published: 11 March 2024

Volume 108, pages S248–S256, (2023)
Cite this article

Doklady Mathematics Aims and scope Submit manuscript

M. Danilova¹

18 Accesses
Explore all metrics

Abstract

This article provides a survey of the results of several research studies [12–14, 26], in which open questions related to the high-probability convergence analysis of stochastic first-order optimization methods under mild assumptions on the noise were gradually addressed. In the beginning, we introduce the concept of gradient clipping, which plays a pivotal role in the development of stochastic methods for successful operation in the case of heavy-tailed distributions. Next, we examine the importance of obtaining the high-probability convergence guarantees and their connection with in-expectation convergence guarantees. The concluding sections of the article are dedicated to presenting the primary findings related to minimization problems and the results of numerical experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Article 28 August 2023

Stochastic Gradient Method with Barzilai–Borwein Step for Unconstrained Nonlinear Optimization

Article 01 January 2021

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

Article 27 September 2023

Notes

The code is available at https://github.com/ClippedStochasticMethods/clipped-SSTM.

REFERENCES

E. Gorbunov, M. Danilova, and A. Gasnikov, “Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,” Adv. Neural Inf. Process. Syst. 33, 15042–15053 (2020).
Google Scholar
A. V. Nazin, A. S. Nemirovsky, A. B. Tsybakov, and A. B. Juditsky, “Algorithms of robust stochastic optimization based on mirror descent method,” Autom. Remote Control 80, 1607–1627 (2019). https://doi.org/10.1134/s0005117919090042
Article MathSciNet Google Scholar
D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim. 29, 207–239 (2021). https://doi.org/10.1137/18m1178244
Article MathSciNet Google Scholar
A. Cutkosky and H. Mehta, “High-probability bounds for non-convex stochastic optimization with heavy tails,” Adv. Neural Inf. Process. Syst. 34 (2021).
T. D. Nguyen, T. H. Nguyen, A. Ene, and H. L. Nguyen, “High probability convergence of clipped-sgd under heavy-tailed noise,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2302.05437
Z. Liu and Z. Zhou, “Stochastic nonsmooth convex optimization with heavy-tailed noises,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2303.12277
Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High probability convergence of stochastic gradient methods,” Proc. Math. Learn. Res. 202, 21884–21914 (2023).
Google Scholar
E. Gorbunov, M. Danilova, I. Shibaev, P. Dvurechensky, and A. Gasnikov, “Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise,” (2021). https://doi.org/10.48550/arXiv.2106.05958
E. Gorbunov, M. Danilova, D. Dobre, P. Dvurechenskii, A. Gasnikov, and G. Gidel, “Clipped stochastic methods for variational inequalities with heavy-tailed noise,” Adv. Neural Inf. Process. Syst. 35, 31319–31332 (2022).
Google Scholar
A. Sadiev, M. Danilova, E. Gorbunov, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, “High-probability bounds for stochastic optimization and variational inequalities: The case of unbounded variance,” Proc. Mach. Learn. Res., 29563–29648 (2023).
R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” Proc. Mach. Learn. Res. 28, 1310–1318 (2013).
Google Scholar
S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,” in Int. Conf. on Learning Representations (2018).
M. Peters, W. Ammar, C. Bhagavatula, and R. Power, “Semi-supervised sequence tagging with bidirectional language models,” in Proc. 55th Annu. Meeting of the Association for Computational Linguistics, Vancouver, 2017, Ed. by R. Barzilay and M.-Ye. Kan (Association for Computational Linguistics, 2017), Vol. 1, pp. 1756–1765. https://doi.org/10.18653/v1/p17-1161
M. Mosbach, M. Andriushchenko, and D. Klakow, “On the stability of fine-tuning BERT: Misconceptions, explanations, and strong baselines,” in Int. Conf. on Learning Representations (2020).
J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, “Why are adaptive methods good for attention models?,” Adv. Neural Inf. Process. Syst. 33, 15383–15393 (2020).
Google Scholar
D. P. Kingma and B. J. Adam, “A method for stochastic optimization,” arXiv Preprint (2014). https://doi.org/10.48550/arXiv.1412.6980
S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM J. Optim. 23, 2341–2368 (2013). https://doi.org/10.1137/120880811
Article MathSciNet Google Scholar
O. Devolder, F. Glineur, and Yu. Nesterov, “First-order methods of smooth convex optimization with inexact oracle,” Math. Program. 146, 37–75 (2011). https://doi.org/10.1007/s10107-013-0677-5
Article MathSciNet Google Scholar
S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework,” SIAM J. Optim. 22, 1469–1492 (2012). https://doi.org/10.1137/110848864
Article MathSciNet Google Scholar
G. Bennett, “Probability inequalities for the sum of independent random variables,” J. Am. Stat. Assoc. 57 (297), 33–45 (1962). https://doi.org/10.1080/01621459.1962.10482149
Article Google Scholar
K. Dzhaparidze and J. H. Van Zanten, “On Bernstein-type inequalities for martingales,” Stochastic Processes Their Appl. 93, 109–117 (2001). https://doi.org/10.1016/s0304-4149(00)00086-7
Article MathSciNet Google Scholar
D. A. Freedman, “On tail probabilities for martingales,” Ann. Probab. 3, 100–118 (1975). https://doi.org/10.1214/aop/1176996452
Article MathSciNet Google Scholar
A. V. Gasnikov and Yu. E. Nesterov, “Universal method for stochastic composite optimization problems,” Comput. Math. Math. Phys. 58, 48–64 (2016). https://doi.org/10.1134/s0965542518010050
Article MathSciNet Google Scholar
P. T. Harker and J.-Sh. Pang, “Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications,” Math. Program. 48, 161–220 (1990). https://doi.org/10.1007/bf01582255
Article MathSciNet Google Scholar
E. K. Ryu and W. Yin, Large-Scale Convex Optimization: Algorithms and Analyses via Monotone Operators (Cambridge Univ. Press, 2022). https://doi.org/10.1017/9781009160865
Book Google Scholar
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Yo. Bengio, “Generative adversarial nets,” Adv. Neural Inf. Process. Syst. 27 (2014).
G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “A variational inequality perspective on generative adversarial networks,” in International Conference on Learning Representations (2019).

Download references

ACKNOWLEDGMENTS

This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 no. 70-2021-00138.

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, Moscow, Russia
M. Danilova

Authors

M. Danilova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Danilova.

Ethics declarations

The author of this work declares that she has no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Danilova, M. Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise. Dokl. Math. 108 (Suppl 2), S248–S256 (2023). https://doi.org/10.1134/S1064562423701144

Download citation

Received: 02 September 2023
Revised: 08 October 2023
Accepted: 15 October 2023
Published: 11 March 2024
Issue Date: December 2023
DOI: https://doi.org/10.1134/S1064562423701144

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions