Skip to main content
Log in

Abstract

Motivated by the conspicuous use of momentum-based algorithms in deep learning, we study a nonsmooth nonconvex stochastic heavy ball method and show its convergence. Our approach builds upon semialgebraic (definable) assumptions commonly met in practical situations and combines a nonsmooth calculus with a differential inclusion method. Additionally, we provide general conditions for the sample distribution to ensure the convergence of the objective function. Our results are general enough to justify the use of subgradient sampling in modern implementations that heuristically apply rules of differential calculus on nonsmooth functions, such as backpropagation or implicit differentiation. As for the stochastic subgradient method, our analysis highlights that subgradient sampling can make the stochastic heavy ball method converge to artificial critical points. Thanks to the semialgebraic setting, we address this concern showing that these artifacts are almost surely avoided when initializations are randomized, leading the method to converge to Clarke critical points.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arscott, F.M., Filippov, A.F.: Differential Equations with Discontinuous Righthand Sides: Control Systems. Mathematics and its Applications. Springer, Netherlands (1988)

    Google Scholar 

  2. Aubin, J.P., Cellina, A.: Differential Inclusions: Set-Valued Maps and Viability Theory. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg (2012)

  3. Bai, S., Kolter, J.Z., Koltun, V.: Deep equilibrium models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 690–701. Curran Associates, Inc (2019)

  4. Bai, S., Kolter, J.Z., Koltun, V.: Deep equilibrium models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc (2019)

  5. Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, pp. 1–68. Springer (1999)

  6. Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control. Optim. 44(1), 328–348 (2005). https://doi.org/10.1137/S0363012904439301

    Article  MathSciNet  Google Scholar 

  7. Bertrand, Q., Klopfenstein, Q., Blondel, M., Vaiter, S., Gramfort, A., Salmon, J.: Implicit differentiation of lasso-type models for hyperparameter optimization. In: H.D. III, A. Singh (eds.) Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, pp. 810–821. PMLR (2020)

  8. Bianchi, P., Rios-Zertuche, R.: A closed-measure approach to stochastic approximation. arXiv preprint arXiv:2112.05482 (2021)

  9. Bianchi, P., Hachem, W., Schechtman, S.: Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Set-Valued Var. Anal. 30(3), 1117–1147 (2022). https://doi.org/10.1007/s11228-022-00638-z

    Article  MathSciNet  Google Scholar 

  10. Bolte, J., Le, T., Pauwels, E., Silveti-Falls, T.: Nonsmooth implicit differentiation for machine-learning and optimization. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 13537–13549. Curran Associates, Inc (2021)

  11. Bolte, J., Le, T., Pauwels, E.: Subgradient sampling for nonsmooth nonconvex minimization. arXiv preprint arXiv:2202.13744 (2022)

  12. Bolte, J., Pauwels, E.: A mathematical model for automatic differentiation in machine learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 10809–10819. Curran Associates, Inc (2020)

  13. Bolte, J., Pauwels, E.: Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. 188(1), 19–51 (2021). https://doi.org/10.1007/s10107-020-01501-5

    Article  MathSciNet  Google Scholar 

  14. Bolte, J., Pauwels, E., Ríos-Zertuche, R.: Long term dynamics of the subgradient method for lipschitz path differentiable functions. J. Eur. Math. Soc. (2022). https://doi.org/10.4171/JEMS/1285

    Article  Google Scholar 

  15. Castera, C., Bolte, J., Févotte, C., Pauwels, E.: An inertial newton algorithm for deep learning. J. Mach. Learn. Res. 22(134), 1–31 (2021)

    MathSciNet  Google Scholar 

  16. Clarke, F.H.: Optimization and Nonsmooth Analysis. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (1990)

  17. Cluckers, R., Miller, D.J.: Stability under integration of sums of products of real globally subanalytic functions and their logarithms. Duke Math. J. 156(2), 311–348 (2011). https://doi.org/10.1215/00127094-2010-213

    Article  MathSciNet  Google Scholar 

  18. Coste, M.: An Introduction to O-Minimal Geometry. Institut de Recherche Mathématique de Rennes (1999)

  19. Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020). https://doi.org/10.1007/s10208-018-09409-5

    Article  MathSciNet  Google Scholar 

  20. Ermoliev, Y., Norkin, V.: Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization. Cybern. Syst. Anal. 34(2), 196–215 (1998)

    Article  MathSciNet  Google Scholar 

  21. Figurnov, M., Mohamed, S., Mnih, A.: Implicit reparameterization gradients. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc (2018)

  22. Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electron. J. Stat. 12(1), 461–529 (2018)

    Article  MathSciNet  Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR arXiv:1412.6980 (2014)

  24. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc (2012)

  25. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539

    Article  ADS  CAS  PubMed  Google Scholar 

  26. Majewski, S., Miasojedow, B., Moulines, É.: Analysis of nonsmooth stochastic approximation: the differential inclusion approach. arXiv Optimization and Control (2018)

  27. Norkin, V.: Generalized-differentiable functions. Cybern. Syst. Anal. 16, 10–12 (1980). https://doi.org/10.1007/BF01099354

    Article  MathSciNet  Google Scholar 

  28. Norkin, V.: Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization. Cybern. Syst. Anal. 22, 804–809 (1986). https://doi.org/10.1007/BF01068698

    Article  Google Scholar 

  29. Orvieto, A., Kohler, J., Lucchi, A.: The role of memory in stochastic optimization. In: Adams, R.P., Gogate, V. (eds.) Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, Proceedings of Machine Learning Research, vol. 115, pp. 356–366. PMLR (2020)

  30. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964). https://doi.org/10.1016/0041-5553(64)90137-5

    Article  Google Scholar 

  31. Ruszczyński, A.: Convergence of a stochastic subgradient method with averaging for nonsmooth nonconvex constrained optimization. Optim. Lett. 14(7), 1615–1625 (2020). https://doi.org/10.1007/s11590-020-01537-8

    Article  MathSciNet  Google Scholar 

  32. Sebbouh, O., Gower, R.M., Defazio, A.: Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In: Belkin, M., Kpotufe, S. (eds.) Proceedings of Thirty Fourth Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 134, pp. 3935–3971. PMLR (2021)

  33. Shikhman, V.: Topological Aspects of Nonsmooth Optimization, Nonconvex Optimization and Its Applications, vol. 64. Springer (2012). https://doi.org/10.1007/978-1-4614-1897-9

  34. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 28, pp. 1139–1147. PMLR, Atlanta, Georgia, USA (2013)

  35. van den Dries, L., Miller, C.: Geometric categories and o-minimal structures. Duke Math. J. 84(2), 497–540 (1996). https://doi.org/10.1215/S0012-7094-96-08416-1

    Article  MathSciNet  Google Scholar 

  36. Yang, T., Lin, Q., Li, Z.: Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv Optimization and Control (2016)

Download references

Acknowledgements

The author would like to thank Jérôme Bolte and Edouard Pauwels for their precious feedback. This work has benefitted from the AI Interdisciplinary Institute ANITI. ANITI is funded by the French “Investing for the Future – PIA3" program under the Grant agreement no ANR-19-PI3A-0004.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tam Le.

Additional information

Communicated by Johannes O. Royset.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, T. Nonsmooth Nonconvex Stochastic Heavy Ball. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-024-02408-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10957-024-02408-3

Keywords

Mathematics Subject Classification

Navigation