Skip to main content
Log in

Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise

  • Published:
Doklady Mathematics Aims and scope Submit manuscript

Abstract

This article provides a survey of the results of several research studies [12–14, 26], in which open questions related to the high-probability convergence analysis of stochastic first-order optimization methods under mild assumptions on the noise were gradually addressed. In the beginning, we introduce the concept of gradient clipping, which plays a pivotal role in the development of stochastic methods for successful operation in the case of heavy-tailed distributions. Next, we examine the importance of obtaining the high-probability convergence guarantees and their connection with in-expectation convergence guarantees. The concluding sections of the article are dedicated to presenting the primary findings related to minimization problems and the results of numerical experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

Similar content being viewed by others

Notes

  1. The code is available at https://github.com/ClippedStochasticMethods/clipped-SSTM.

REFERENCES

  1. E. Gorbunov, M. Danilova, and A. Gasnikov, “Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,” Adv. Neural Inf. Process. Syst. 33, 15042–15053 (2020).

    Google Scholar 

  2. A. V. Nazin, A. S. Nemirovsky, A. B. Tsybakov, and A. B. Juditsky, “Algorithms of robust stochastic optimization based on mirror descent method,” Autom. Remote Control 80, 1607–1627 (2019). https://doi.org/10.1134/s0005117919090042

    Article  MathSciNet  Google Scholar 

  3. D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim. 29, 207–239 (2021). https://doi.org/10.1137/18m1178244

    Article  MathSciNet  Google Scholar 

  4. A. Cutkosky and H. Mehta, “High-probability bounds for non-convex stochastic optimization with heavy tails,” Adv. Neural Inf. Process. Syst. 34 (2021).

  5. T. D. Nguyen, T. H. Nguyen, A. Ene, and H. L. Nguyen, “High probability convergence of clipped-sgd under heavy-tailed noise,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2302.05437

  6. Z. Liu and Z. Zhou, “Stochastic nonsmooth convex optimization with heavy-tailed noises,” arXiv Preprint (2023). https://doi.org/10.48550/arXiv.2303.12277

  7. Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High probability convergence of stochastic gradient methods,” Proc. Math. Learn. Res. 202, 21884–21914 (2023).

    Google Scholar 

  8. E. Gorbunov, M. Danilova, I. Shibaev, P. Dvurechensky, and A. Gasnikov, “Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise,” (2021). https://doi.org/10.48550/arXiv.2106.05958

  9. E. Gorbunov, M. Danilova, D. Dobre, P. Dvurechenskii, A. Gasnikov, and G. Gidel, “Clipped stochastic methods for variational inequalities with heavy-tailed noise,” Adv. Neural Inf. Process. Syst. 35, 31319–31332 (2022).

    Google Scholar 

  10. A. Sadiev, M. Danilova, E. Gorbunov, S. Horváth, G.   Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, “High-probability bounds for stochastic optimization and variational inequalities: The case of unbounded variance,” Proc. Mach. Learn. Res., 29563–29648 (2023).

  11. R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” Proc. Mach. Learn. Res. 28, 1310–1318 (2013).

    Google Scholar 

  12. S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,” in Int. Conf. on Learning Representations (2018).

  13. M. Peters, W. Ammar, C. Bhagavatula, and R. Power, “Semi-supervised sequence tagging with bidirectional language models,” in Proc. 55th Annu. Meeting of the Association for Computational Linguistics, Vancouver, 2017, Ed. by R. Barzilay and M.-Ye. Kan (Association for Computational Linguistics, 2017), Vol. 1, pp. 1756–1765. https://doi.org/10.18653/v1/p17-1161

  14. M. Mosbach, M. Andriushchenko, and D. Klakow, “On the stability of fine-tuning BERT: Misconceptions, explanations, and strong baselines,” in Int. Conf. on Learning Representations (2020).

  15. J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, “Why are adaptive methods good for attention models?,” Adv. Neural Inf. Process. Syst. 33, 15383–15393 (2020).

    Google Scholar 

  16. D. P. Kingma and B. J. Adam, “A method for stochastic optimization,” arXiv Preprint (2014). https://doi.org/10.48550/arXiv.1412.6980

  17. S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM J. Optim. 23, 2341–2368 (2013). https://doi.org/10.1137/120880811

    Article  MathSciNet  Google Scholar 

  18. O. Devolder, F. Glineur, and Yu. Nesterov, “First-order methods of smooth convex optimization with inexact oracle,” Math. Program. 146, 37–75 (2011). https://doi.org/10.1007/s10107-013-0677-5

    Article  MathSciNet  Google Scholar 

  19. S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework,” SIAM J. Optim. 22, 1469–1492 (2012). https://doi.org/10.1137/110848864

    Article  MathSciNet  Google Scholar 

  20. G. Bennett, “Probability inequalities for the sum of independent random variables,” J. Am. Stat. Assoc. 57 (297), 33–45 (1962). https://doi.org/10.1080/01621459.1962.10482149

    Article  Google Scholar 

  21. K. Dzhaparidze and J. H. Van Zanten, “On Bernstein-type inequalities for martingales,” Stochastic Processes Their Appl. 93, 109–117 (2001). https://doi.org/10.1016/s0304-4149(00)00086-7

    Article  MathSciNet  Google Scholar 

  22. D. A. Freedman, “On tail probabilities for martingales,” Ann. Probab. 3, 100–118 (1975). https://doi.org/10.1214/aop/1176996452

    Article  MathSciNet  Google Scholar 

  23. A. V. Gasnikov and Yu. E. Nesterov, “Universal method for stochastic composite optimization problems,” Comput. Math. Math. Phys. 58, 48–64 (2016). https://doi.org/10.1134/s0965542518010050

    Article  MathSciNet  Google Scholar 

  24. P. T. Harker and J.-Sh. Pang, “Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications,” Math. Program. 48, 161–220 (1990). https://doi.org/10.1007/bf01582255

    Article  MathSciNet  Google Scholar 

  25. E. K. Ryu and W. Yin, Large-Scale Convex Optimization: Algorithms and Analyses via Monotone Operators (Cambridge Univ. Press, 2022). https://doi.org/10.1017/9781009160865

    Book  Google Scholar 

  26. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Yo. Bengio, “Generative adversarial nets,” Adv. Neural Inf. Process. Syst. 27 (2014).

  27. G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “A variational inequality perspective on generative adversarial networks,” in International Conference on Learning Representations (2019).

Download references

ACKNOWLEDGMENTS

This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 no. 70-2021-00138.

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Danilova.

Ethics declarations

The author of this work declares that she has no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Danilova, M. Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise. Dokl. Math. 108 (Suppl 2), S248–S256 (2023). https://doi.org/10.1134/S1064562423701144

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064562423701144

Keywords:

Navigation