Abstract
We consider first-order methods with constant step size for minimizing locally Lipschitz coercive functions that are tame in an o-minimal structure on the real field. We prove that if the method is approximated by subgradient trajectories, then the iterates eventually remain in a neighborhood of a connected component of the set of critical points. Under suitable method-dependent regularity assumptions, this result applies to the subgradient method with momentum, the stochastic subgradient method with random reshuffling and momentum, and the random-permutations cyclic coordinate descent method.
Similar content being viewed by others
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
Aubin, J.P., Cellina, A.: Differential Inclusions: Set-Valued Maps and Viability Theory, vol. 264. Springer, Berlin (1984)
Beck, A.: First-Order Methods in Optimization. SIAM, Philadelphia (2017)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Benaïm, M.: Recursive algorithms, urn processes and chaining number of chain recurrent sets. Ergodic Theory Dyn. Syst. 18(1), 53–87 (1998)
Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Seminaire de probabilites XXXIII, pp. 1–68. Springer (2006)
Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)
Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions, part II: applications. Math. Oper. Res. 31(4), 673–695 (2006)
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations, vol. 22. Springer, Berlin (2012)
Bertsekas, D.: Convex Optimization Algorithms. Athena Scientific, Nashua (2015)
Bertsekas, D.P., et al.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)
Bianchi, P., Hachem, W., Salim, A.: Constant step stochastic approximations involving differential inclusions: stability, long-run convergence and applications. Stochastics 91(2), 288–320 (2019)
Bianchi, P., Hachem, W., Schechtman, S.: Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. Set-Valued Var. Anal. 30, 1–31 (2022)
Blanton, J.: Foundations of Differential Calculus. Springer, Berlin (2006)
Bochnak, J., Coste, M., Roy, M.F.: Real Algebraic Geometry, vol. 36. Springer, Berlin (2013)
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Bolte, J., Pauwels, E.: Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. 188, 1–33 (2020)
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Clarke, F.H.: Optimization and Nonsmooth Analysis. SIAM Classics in Applied Mathematics (1990)
Coddington, E.A., Levinson, N.: Theory of Ordinary Differential Equations. Tata McGraw-Hill Education, New York (1955)
Daniilidis, A., Drusvyatskiy, D.: Pathological subgradient dynamics. SIAM J. Optim. 30(2), 1327–1338 (2020)
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020)
Dontchev, A., Lempio, F.: Difference methods for differential inclusions: a survey. SIAM Rev. 34(2), 263–294 (1992)
Drusvyatskiy, D., Ioffe, A.D., Lewis, A.S.: Curves of descent. SIAM J. Control Optim. 53(1), 114–138 (2015)
Duchi, J.C., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)
Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley, New York (2009)
Euler, L.: Institutiones calculi integralis, vol. 1. impensis Academiae imperialis scientiarum (1792)
Filippov, A.F.: Differential Equations with Discontinuous Righthand Sides: Control Systems, vol. 18. Springer, Berlin (2013)
Ge, R., Lee, J.D., Ma, T.: Matrix Completion has No Spurious Local Minimum. NIPS (2016)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186(1), 49–84 (2021)
Gürbüzbalaban, M., Ozdaglar, A., Vanli, N.D., Wright, S.J.: Randomness and permutations in coordinate descent methods. Math. Program. 181(2), 349–376 (2020)
Ioffe, A.D.: An invitation to tame optimization. SIAM J. Optim. 19(4), 1894–1917 (2009)
Kohonen, T.: An adaptive associative memory principle. IEEE Trans. Comput. 100(4), 444–445 (1974)
Kovachki, N.B., Stuart, A.M.: Continuous time analysis of momentum methods. J. Mach. Learn. Res. 22(17), 1–40 (2021)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. In: Annales de l’institut Fourier, vol. 48, pp. 769–783 (1998)
Kurtz, T.G.: Solutions of ordinary differential equations as limits of pure jump Markov processes. J. Appl. Probab. 7(1), 49–58 (1970)
Kushner, H.: Convergence of recursive adaptive and identification procedures via weak convergence theory. IEEE Trans. Autom. Control 22(6), 921–930 (1977)
Kushner, H.J.: General convergence results for stochastic approximations via weak convergence theory. J. Math. Anal. Appl. 61(2), 490–503 (1977)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Lee, C.P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. IMA J. Numer. Anal. 39(3), 1246–1275 (2019)
Li, X., Milzarek, A., Qiu, J.: Convergence of random reshuffling under the Kurdyka–Łojasiewicz inequality. SIAM J. Optim. 33(2), 1092–1120 (2023)
Li, X., Zhu, Z., So, A.M.C., Vidal, R.: Nonconvex robust low-rank matrix recovery. SIAM J. Optim. 30, 660–686 (2019)
Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4), 551–575 (1977)
Luo, Z.Q.: On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Comput. 3(2), 226–245 (1991)
Luo, Z.Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72(1), 7–35 (1992)
Ma, J., Fattahi, S.: Global convergence of sub-gradient method for robust matrix recovery: small initialization, noisy measurements, and over-parameterization. J. Mach. Learn. Res. 24, 1–84 (2023)
Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural Inf. Process. Syst. 33, 17309–17320 (2020)
Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer, Berlin (2018)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/k^2)\). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22(207), 1–44 (2021)
Nielsen, O.A.: An Introduction to Integration and Measure Theory, vol. 17. Wiley, New York (1997)
Ochs, P., Chen, Y., Brox, T., Pock, T.: iPiano: inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 7(2), 1388–1419 (2014)
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York (1970)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Pauwels, E.: Incremental without replacement sampling in nonconvex optimization. J. Optim. Theory Appl. 190, 1–26 (2021)
Pillay, A., Steinhorn, C.: Definable sets in ordered structures. I. Trans. Am. Math. Soc. 295(2), 565–592 (1986)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Ríos-Zertuche, R.: Examples of pathological dynamics of the subgradient method for Lipschitz path-differentiable functions. Math. Oper. Res. 47(4), 3184–3206 (2022)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Roth, G., Sandholm, W.H.: Stochastic approximations with constant step size and differential inclusions. SIAM J. Control Optim. 51(1), 525–555 (2013)
Salim, A.: Random monotone operators and application to stochastic optimization. Ph.D. thesis, Université Paris-Saclay (ComUE) (2018)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147. PMLR (2013)
Taubert, K.: Converging multistep methods for initial value problems involving multivalued maps. Computing 27(2), 123–136 (1981)
Tran, T.H., Nguyen, L.M., Tran-Dinh, Q.: SMG: a shuffling gradient-based method with momentum. In: International Conference on Machine Learning, pp. 10379–10389. PMLR (2021)
Tran, T.H., Scheinberg, K., Nguyen, L.M.: Nesterov accelerated shuffling gradient method for convex optimization. In: International Conference on Machine Learning, pp. 21703–21732. PMLR (2022)
van den Dries, L.: Remarks on Tarski’s problem concerning (\({\mathbb{R}}\),\(+\),\(*\), exp). In: Studies in Logic and the Foundations of Mathematics, vol. 112, pp. 97–121. Elsevier (1984)
van den Dries, L.: Tame Topology and o-minimal Structures, vol. 248. Cambridge University Press, Cambridge (1998)
van den Dries, L., Miller, C.: Geometric categories and o-minimal structures. Duke Math. J. 84(2), 497–540 (1996)
Widrow, B., Hoff, M.E.: Adaptive switching circuits. Technical report, Stanford University, CA, Stanford Electronics Labs (1960)
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Zavriev, S., Kostyuk, F.: Heavy-ball method in nonconvex optimization problems. Comput. Math. Model. 4(4), 336–341 (1993)
Zhang, G., Chiu, H.M., Zhang, R.Y.: Accelerating SGD for highly ill-conditioned huge-scale online matrix completion. In: Advances in Neural Information Processing Systems. vol. 35 (2022)
Acknowledgements
We thank the reviewers and the co-editor for their valuable feedback.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We acknowledge support from NSF EPCN Grant 2023032 and ONR Grant N00014-21-1-2282.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Josz, C., Lai, L. Global stability of first-order methods for coercive tame functions. Math. Program. (2023). https://doi.org/10.1007/s10107-023-02020-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107-023-02020-9