Skip to main content
Log in

Global stability of first-order methods for coercive tame functions

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider first-order methods with constant step size for minimizing locally Lipschitz coercive functions that are tame in an o-minimal structure on the real field. We prove that if the method is approximated by subgradient trajectories, then the iterates eventually remain in a neighborhood of a connected component of the set of critical points. Under suitable method-dependent regularity assumptions, this result applies to the subgradient method with momentum, the stochastic subgradient method with random reshuffling and momentum, and the random-permutations cyclic coordinate descent method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Fig. 1

Similar content being viewed by others

Notes

  1. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD.

  2. https://pytorch.org/docs/stable/generated/torch.optim.SGD.html.

  3. https://scikit-learn.org/stable/modules/sgd.html.

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org

  2. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  3. Aubin, J.P., Cellina, A.: Differential Inclusions: Set-Valued Maps and Viability Theory, vol. 264. Springer, Berlin (1984)

    MATH  Google Scholar 

  4. Beck, A.: First-Order Methods in Optimization. SIAM, Philadelphia (2017)

    Book  MATH  Google Scholar 

  5. Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  6. Benaïm, M.: Recursive algorithms, urn processes and chaining number of chain recurrent sets. Ergodic Theory Dyn. Syst. 18(1), 53–87 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  7. Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, pp. 1–68. Springer (2006)

  8. Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  9. Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions, part II: applications. Math. Oper. Res. 31(4), 673–695 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  10. Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations, vol. 22. Springer, Berlin (2012)

    MATH  Google Scholar 

  11. Bertsekas, D.: Convex Optimization Algorithms. Athena Scientific, Nashua (2015)

    MATH  Google Scholar 

  12. Bertsekas, D.P., et al.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)

    Google Scholar 

  13. Bianchi, P., Hachem, W., Salim, A.: Constant step stochastic approximations involving differential inclusions: stability, long-run convergence and applications. Stochastics 91(2), 288–320 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  14. Bianchi, P., Hachem, W., Schechtman, S.: Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Set-Valued Var. Anal. 30, 1–31 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  15. Blanton, J.: Foundations of Differential Calculus. Springer, Berlin (2006)

    Google Scholar 

  16. Bochnak, J., Coste, M., Roy, M.F.: Real Algebraic Geometry, vol. 36. Springer, Berlin (2013)

    MATH  Google Scholar 

  17. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  18. Bolte, J., Pauwels, E.: Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. 188, 1–33 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  19. Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)

    Google Scholar 

  20. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  21. Clarke, F.H.: Optimization and Nonsmooth Analysis. SIAM Classics in Applied Mathematics (1990)

  22. Coddington, E.A., Levinson, N.: Theory of Ordinary Differential Equations. Tata McGraw-Hill Education, New York (1955)

    MATH  Google Scholar 

  23. Daniilidis, A., Drusvyatskiy, D.: Pathological subgradient dynamics. SIAM J. Optim. 30(2), 1327–1338 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  24. Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  25. Dontchev, A., Lempio, F.: Difference methods for differential inclusions: a survey. SIAM Rev. 34(2), 263–294 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  26. Drusvyatskiy, D., Ioffe, A.D., Lewis, A.S.: Curves of descent. SIAM J. Control Optim. 53(1), 114–138 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  27. Duchi, J.C., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  28. Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley, New York (2009)

    MATH  Google Scholar 

  29. Euler, L.: Institutiones calculi integralis, vol. 1. impensis Academiae imperialis scientiarum (1792)

  30. Filippov, A.F.: Differential Equations with Discontinuous Righthand Sides: Control Systems, vol. 18. Springer, Berlin (2013)

    Google Scholar 

  31. Ge, R., Lee, J.D., Ma, T.: Matrix Completion has No Spurious Local Minimum. NIPS (2016)

  32. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186(1), 49–84 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  33. Gürbüzbalaban, M., Ozdaglar, A., Vanli, N.D., Wright, S.J.: Randomness and permutations in coordinate descent methods. Math. Program. 181(2), 349–376 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  34. Ioffe, A.D.: An invitation to tame optimization. SIAM J. Optim. 19(4), 1894–1917 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  35. Kohonen, T.: An adaptive associative memory principle. IEEE Trans. Comput. 100(4), 444–445 (1974)

    Article  MATH  Google Scholar 

  36. Kovachki, N.B., Stuart, A.M.: Continuous time analysis of momentum methods. J. Mach. Learn. Res. 22(17), 1–40 (2021)

    MathSciNet  MATH  Google Scholar 

  37. Kurdyka, K.: On gradients of functions definable in o-minimal structures. In: Annales de l’institut Fourier, vol. 48, pp. 769–783 (1998)

  38. Kurtz, T.G.: Solutions of ordinary differential equations as limits of pure jump Markov processes. J. Appl. Probab. 7(1), 49–58 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  39. Kushner, H.: Convergence of recursive adaptive and identification procedures via weak convergence theory. IEEE Trans. Autom. Control 22(6), 921–930 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  40. Kushner, H.J.: General convergence results for stochastic approximations via weak convergence theory. J. Math. Anal. Appl. 61(2), 490–503 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  41. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  42. Lee, C.P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. IMA J. Numer. Anal. 39(3), 1246–1275 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  43. Li, X., Milzarek, A., Qiu, J.: Convergence of random reshuffling under the Kurdyka–Łojasiewicz inequality. SIAM J. Optim. 33(2), 1092–1120 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  44. Li, X., Zhu, Z., So, A.M.C., Vidal, R.: Nonconvex robust low-rank matrix recovery. SIAM J. Optim. 30, 660–686 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  45. Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4), 551–575 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  46. Luo, Z.Q.: On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Comput. 3(2), 226–245 (1991)

    Article  Google Scholar 

  47. Luo, Z.Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72(1), 7–35 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  48. Ma, J., Fattahi, S.: Global convergence of sub-gradient method for robust matrix recovery: small initialization, noisy measurements, and over-parameterization. J. Mach. Learn. Res. 24, 1–84 (2023)

    MathSciNet  Google Scholar 

  49. Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural Inf. Process. Syst. 33, 17309–17320 (2020)

    Google Scholar 

  50. Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  51. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  52. Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)

    MATH  Google Scholar 

  53. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/k^2)\). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)

    MathSciNet  Google Scholar 

  54. Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22(207), 1–44 (2021)

    MathSciNet  MATH  Google Scholar 

  55. Nielsen, O.A.: An Introduction to Integration and Measure Theory, vol. 17. Wiley, New York (1997)

    MATH  Google Scholar 

  56. Ochs, P., Chen, Y., Brox, T., Pock, T.: iPiano: inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 7(2), 1388–1419 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  57. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York (1970)

    MATH  Google Scholar 

  58. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

  59. Pauwels, E.: Incremental without replacement sampling in nonconvex optimization. J. Optim. Theory Appl. 190, 1–26 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  60. Pillay, A., Steinhorn, C.: Definable sets in ordered structures. I. Trans. Am. Math. Soc. 295(2), 565–592 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  61. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  62. Ríos-Zertuche, R.: Examples of pathological dynamics of the subgradient method for Lipschitz path-differentiable functions. Math. Oper. Res. 47(4), 3184–3206 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  63. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  64. Roth, G., Sandholm, W.H.: Stochastic approximations with constant step size and differential inclusions. SIAM J. Control Optim. 51(1), 525–555 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  65. Salim, A.: Random monotone operators and application to stochastic optimization. Ph.D. thesis, Université Paris-Saclay (ComUE) (2018)

  66. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147. PMLR (2013)

  67. Taubert, K.: Converging multistep methods for initial value problems involving multivalued maps. Computing 27(2), 123–136 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  68. Tran, T.H., Nguyen, L.M., Tran-Dinh, Q.: SMG: a shuffling gradient-based method with momentum. In: International Conference on Machine Learning, pp. 10379–10389. PMLR (2021)

  69. Tran, T.H., Scheinberg, K., Nguyen, L.M.: Nesterov accelerated shuffling gradient method for convex optimization. In: International Conference on Machine Learning, pp. 21703–21732. PMLR (2022)

  70. van den Dries, L.: Remarks on Tarski’s problem concerning (\({\mathbb{R}}\),\(+\),\(*\), exp). In: Studies in Logic and the Foundations of Mathematics, vol. 112, pp. 97–121. Elsevier (1984)

  71. van den Dries, L.: Tame Topology and o-minimal Structures, vol. 248. Cambridge University Press, Cambridge (1998)

    Book  MATH  Google Scholar 

  72. van den Dries, L., Miller, C.: Geometric categories and o-minimal structures. Duke Math. J. 84(2), 497–540 (1996)

    MathSciNet  MATH  Google Scholar 

  73. Widrow, B., Hoff, M.E.: Adaptive switching circuits. Technical report, Stanford University, CA, Stanford Electronics Labs (1960)

  74. Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  75. Zavriev, S., Kostyuk, F.: Heavy-ball method in nonconvex optimization problems. Comput. Math. Model. 4(4), 336–341 (1993)

    Article  MATH  Google Scholar 

  76. Zhang, G., Chiu, H.M., Zhang, R.Y.: Accelerating SGD for highly ill-conditioned huge-scale online matrix completion. In: Advances in Neural Information Processing Systems. vol. 35 (2022)

Download references

Acknowledgements

We thank the reviewers and the co-editor for their valuable feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cédric Josz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We acknowledge support from NSF EPCN Grant 2023032 and ONR Grant N00014-21-1-2282.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Josz, C., Lai, L. Global stability of first-order methods for coercive tame functions. Math. Program. (2023). https://doi.org/10.1007/s10107-023-02020-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10107-023-02020-9

Keywords

Mathematics Subject Classification

Navigation