Stochastic Gradient Methods with Preconditioned Updates

Sadiev, Abdurakhmon; Beznosikov, Aleksandr; Almansoori, Abdulla Jasem; Kamzolov, Dmitry; Tappenden, Rachael; Takáč, Martin

doi:10.1007/s10957-023-02365-3

Abdurakhmon Sadiev^1,2,3,
Aleksandr Beznosikov ORCID: orcid.org/0000-0002-3217-3614^2,3,
Abdulla Jasem Almansoori³,
Dmitry Kamzolov³,
Rachael Tappenden⁴ &
…
Martin Takáč³

120 Accesses
1 Altmetric
Explore all metrics

Abstract

This work considers the non-convex finite-sum minimization problem. There are several algorithms for such problems, but existing methods often work poorly when the problem is badly scaled and/or ill-conditioned, and a primary goal of this work is to introduce methods that alleviate this issue. Thus, here we include a preconditioner based on Hutchinson’s approach to approximating the diagonal of the Hessian and couple it with several gradient-based methods to give new ‘scaled’ algorithms: Scaled SARAH and Scaled L-SVRG. Theoretical complexity guarantees under smoothness assumptions are presented. We prove linear convergence when both smoothness and the PL-condition are assumed. Our adaptively scaled methods use approximate partial second-order curvature information and, therefore, can better mitigate the impact of badly scaled problems. This improved practical performance is demonstrated in the numerical experiments also presented in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Notes

i.e., the components of the \(z_t\) are \(\pm 1\) with equal probability.
Note that, while PAGE allows minibatches for either option in the update Step 6, most of the theoretical results presented in [18] require the full gradient to be computed as the first option in Step 6.
https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/.

References

Bekas, C., Kokiopoulou, E., Saad, Y.: An estimator for the diagonal of a matrix. Appl. Numer. Math. 57(11), 1214–1229 (2007). (Numerical Algorithms, Parallelism and Applications (2))
Article MathSciNet Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information and Processing Systems, vol. 1, pp. 1646–1654 (2014)
Défossez, A., Bottou, L., Bach, F., Usunier, N.: A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395 (2020)
Dennis, J.E., Jr., Moré, J.J.: Quasi-newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Article MathSciNet Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)
MathSciNet Google Scholar
Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, New York (1987)
Google Scholar
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet Google Scholar
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 267–305 (2016)
Article MathSciNet Google Scholar
Hofmann, T., Lucchi, A., Lacoste-Julien, S., McWilliams, B.: Variance reduced stochastic gradient descent with neighbors. In: Proceedings of the 28th International Conference on Neural Information and Processing Systems, pp. 2305–2313 (2015)
Jahani, M., Nazari, M., Rusakov, S., Berahas, A., Takáč, M.: Scaling up quasi-Newton algorithms: communication efficient distributed SR1. In: International Conference on Machine Learning, Optimization and Data Science, PMLR, pp. 41–54 (2020).
Jahani, M., Nazari, M., Tappenden, R., Berahas, A., Takáč, M.: SONIA: a symmetric blockwise truncated optimization algorithm. In: International Conference on Artificial Intelligence and Statistics, pp. 487–495 (2021). PMLR
Jahani, M., Rusakov, S., Shi, Z., Richtárik, P., Mahoney, M., Takáč, M.: Doubly adaptive scaled algorithm for machine learning using second-order information. arXiv preprint arXiv:2109.05198v1 (2021)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 26th International Conference on Neural Information and Processing Systems, vol. 1, pp. 315–323 (2013)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 2305–2313 (2015)
Konečný, J., Richtárik, P.: Semi-stochastic gradient descent methods. Front. Appl. Math. Stat. 3, 9 (2017)
Article Google Scholar
Le Roux, N.L.R., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Proceedings of the 25th International Conference on Neural Information and Processing Systems, vol. 2, pp. 2663–2671 (2012)
Lei, L., Ju, C., Chen, J., Jordan, M.: Non-convex finite-sum optimization via SCSG methods. In: Proceedings of the 31st International Conference on Neural Information and Processing Systems, pp. 2345–2355 (2017)
Li, Z., Bao, H., Zhang, X., Richtarik, P.: PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, pp. 6286–6295 (2021)
Li, Z., Hanzely, S., Richtárik, P.: ZeroSARAH: efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447 (2021)
Li, Z., Richtárik, P.: A unified analysis of stochastic gradient methods for nonconvex federated optimization. arXiv preprint arXiv:2006.07013 (2020)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, PMLR 70, pp. 2613–2621 (2017)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research, 2nd edn. Springer, Cham (2006)
Google Scholar
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: Proxsarah: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21(110), 1–48 (2020)
MathSciNet Google Scholar
Qian, X., Qu, Z., Richtárik, P.: L-SVRG and L-katyusha with arbitrary sampling. J. Mach. Learn. Res. 22, 1–49 (2021)
MathSciNet Google Scholar
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237 (2019)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet Google Scholar
Sadiev, A., Beznosikov, A., Almansoori, A.J., Kamzolov, D., Tappenden, R., Takáč, M.: Stochastic gradient methods with preconditioned updates. arXiv preprint arXiv:2206.00285 (2022)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 567–599 (2013)
MathSciNet Google Scholar
Tieleman, T., Hinton, G., et al.: Lecture 65-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Google Scholar
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
Article MathSciNet Google Scholar
Yao, Z., Gholami, A., Shen, S., Keutzer, K., Mahoney, M.W.: ADAHESSIAN: an adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719 (2020)
Zou, F., Shen, L., Jie, Z., Sun, J., Liu, W.: Weighted adagrad with unified momentum. arXiv preprint arXiv:1808.03408 (2018)

Download references

Acknowledgements

The work of A. Sadiev was supported by a Grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002 ) and the agreement with the Ivannikov Institute for System Programming of the Russian Academy of Sciences dated November 2, 2021, No. 70-2021-00142.

Author information

Authors and Affiliations

Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS), Moscow, Russia
Abdurakhmon Sadiev
Moscow Institute of Physics and Technology (MIPT), Moscow, Russia
Abdurakhmon Sadiev & Aleksandr Beznosikov
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
Abdurakhmon Sadiev, Aleksandr Beznosikov, Abdulla Jasem Almansoori, Dmitry Kamzolov & Martin Takáč
University of Canterbury, Christchurch, New Zealand
Rachael Tappenden

Authors

Abdurakhmon Sadiev
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandr Beznosikov
View author publications
You can also search for this author in PubMed Google Scholar
Abdulla Jasem Almansoori
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Kamzolov
View author publications
You can also search for this author in PubMed Google Scholar
Rachael Tappenden
View author publications
You can also search for this author in PubMed Google Scholar
Martin Takáč
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksandr Beznosikov.

Additional information

Communicated by Alexander Vladimirovich Gasnikov.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sadiev, A., Beznosikov, A., Almansoori, A.J. et al. Stochastic Gradient Methods with Preconditioned Updates. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-023-02365-3

Download citation

Received: 15 May 2023
Accepted: 11 December 2023
Published: 20 March 2024
DOI: https://doi.org/10.1007/s10957-023-02365-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Gradient Methods with Preconditioned Updates

Abstract

Access this article

Similar content being viewed by others

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stochastic Gradient Methods with Preconditioned Updates

Abstract

Access this article

Similar content being viewed by others

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation