Skip to main content
Log in

Abstract

This work considers the non-convex finite-sum minimization problem. There are several algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner based on Hutchinson’s approach to approximating the diagonal of the Hessian and couple it with several gradient-based methods to give new ‘scaled’ algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented. We prove linear convergence when both smoothness and the PL-condition are assumed. Our adaptively scaled methods use approximate partial second-order curvature information and, therefore, can better mitigate the impact of badly scaled problems. This improved practical performance is demonstrated in the numerical experiments also presented in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. i.e., the components of the \(z_t\) are \(\pm 1\) with equal probability.

  2. Note that, while PAGE allows minibatches for either option in the update Step 6, most of the theoretical results presented in [18] require the full gradient to be computed as the first option in Step 6.

  3. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/.

References

  1. Bekas, C., Kokiopoulou, E., Saad, Y.: An estimator for the diagonal of a matrix. Appl. Numer. Math. 57(11), 1214–1229 (2007). (Numerical Algorithms, Parallelism and Applications (2))

    Article  MathSciNet  Google Scholar 

  2. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information and Processing Systems, vol. 1, pp. 1646–1654 (2014)

  3. Défossez, A., Bottou, L., Bach, F., Usunier, N.: A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395 (2020)

  4. Dennis, J.E., Jr., Moré, J.J.: Quasi-newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)

    Article  MathSciNet  Google Scholar 

  5. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)

    MathSciNet  Google Scholar 

  6. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, New York (1987)

    Google Scholar 

  7. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  Google Scholar 

  8. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 267–305 (2016)

    Article  MathSciNet  Google Scholar 

  9. Hofmann, T., Lucchi, A., Lacoste-Julien, S., McWilliams, B.: Variance reduced stochastic gradient descent with neighbors. In: Proceedings of the 28th International Conference on Neural Information and Processing Systems, pp. 2305–2313 (2015)

  10. Jahani, M., Nazari, M., Rusakov, S., Berahas, A., Takáč, M.: Scaling up quasi-Newton algorithms: communication efficient distributed SR1. In: International Conference on Machine Learning, Optimization and Data Science, PMLR, pp. 41–54 (2020).

  11. Jahani, M., Nazari, M., Tappenden, R., Berahas, A., Takáč, M.: SONIA: a symmetric blockwise truncated optimization algorithm. In: International Conference on Artificial Intelligence and Statistics, pp. 487–495 (2021). PMLR

  12. Jahani, M., Rusakov, S., Shi, Z., Richtárik, P., Mahoney, M., Takáč, M.: Doubly adaptive scaled algorithm for machine learning using second-order information. arXiv preprint arXiv:2109.05198v1 (2021)

  13. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 26th International Conference on Neural Information and Processing Systems, vol. 1, pp. 315–323 (2013)

  14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 2305–2313 (2015)

  15. Konečný, J., Richtárik, P.: Semi-stochastic gradient descent methods. Front. Appl. Math. Stat. 3, 9 (2017)

    Article  Google Scholar 

  16. Le Roux, N.L.R., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Proceedings of the 25th International Conference on Neural Information and Processing Systems, vol. 2, pp. 2663–2671 (2012)

  17. Lei, L., Ju, C., Chen, J., Jordan, M.: Non-convex finite-sum optimization via SCSG methods. In: Proceedings of the 31st International Conference on Neural Information and Processing Systems, pp. 2345–2355 (2017)

  18. Li, Z., Bao, H., Zhang, X., Richtarik, P.: PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, pp. 6286–6295 (2021)

  19. Li, Z., Hanzely, S., Richtárik, P.: ZeroSARAH: efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447 (2021)

  20. Li, Z., Richtárik, P.: A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013 (2020)

  21. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, PMLR 70, pp. 2613–2621 (2017)

  22. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research, 2nd edn. Springer, Cham (2006)

    Google Scholar 

  23. Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: Proxsarah: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21(110), 1–48 (2020)

    MathSciNet  Google Scholar 

  24. Qian, X., Qu, Z., Richtárik, P.: L-SVRG and L-katyusha with arbitrary sampling. J. Mach. Learn. Res. 22, 1–49 (2021)

    MathSciNet  Google Scholar 

  25. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237 (2019)

  26. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  27. Sadiev, A., Beznosikov, A., Almansoori, A.J., Kamzolov, D., Tappenden, R., Takáč, M.: Stochastic gradient methods with preconditioned updates. arXiv preprint arXiv:2206.00285 (2022)

  28. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

    Book  Google Scholar 

  29. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 567–599 (2013)

    MathSciNet  Google Scholar 

  30. Tieleman, T., Hinton, G., et al.: Lecture 65-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  31. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

    Article  MathSciNet  Google Scholar 

  32. Yao, Z., Gholami, A., Shen, S., Keutzer, K., Mahoney, M.W.: ADAHESSIAN: an adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719 (2020)

  33. Zou, F., Shen, L., Jie, Z., Sun, J., Liu, W.: Weighted adagrad with unified momentum. arXiv preprint arXiv:1808.03408 (2018)

Download references

Acknowledgements

The work of A. Sadiev was supported by a Grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002 ) and the agreement with the Ivannikov Institute for System Programming of the Russian Academy of Sciences dated November 2, 2021, No. 70-2021-00142.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksandr Beznosikov.

Additional information

Communicated by Alexander Vladimirovich Gasnikov.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadiev, A., Beznosikov, A., Almansoori, A.J. et al. Stochastic Gradient Methods with Preconditioned Updates. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-023-02365-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10957-023-02365-3

Keywords

Navigation