Abstract
The optimized gradient method (OGM) provides a factor-\(\sqrt{2}\) speedup upon Nesterov’s celebrated accelerated gradient method in the convex (but non-strongly convex) setup. However, this improved acceleration mechanism has not been well understood; prior analyses of OGM relied on a computer-assisted proof methodology, so the proofs were opaque for humans despite being verifiable and correct. In this work, we present a new analysis of OGM based on a Lyapunov function and linear coupling. These analyses are developed and presented without the assistance of computers and are understandable by humans. Furthermore, we generalize OGM’s acceleration mechanism and obtain a factor-\(\sqrt{2}\) speedup in other setups: acceleration with a simpler rational stepsize, the strongly convex setup, and the mirror descent setup.
Similar content being viewed by others
References
Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. COLT (2020)
Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. STOC (2017)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. ICML (2016)
Allen-Zhu, Z., Orecchia, L.: Linear coupling: An ultimate unification of gradient and mirror descent. ITCS (2017)
Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. SODA (2016)
Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. ICML (2016)
Aujol, J., Dossal, C.: Optimal rate of convergence of an ODE associated to the fast gradient descent schemes for \(b> 0\). HAL Archives Ouvertes (2017)
Aujol, J.F., Dossal, C., Fort, G., Moulines, É.: Rates of convergence of perturbed FISTA-based algorithms. HAL Archives Ouvertes (2019)
Aujol, J.F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov acceleration. SIAM J. Optim. 29(4), 3131–3153 (2019)
Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the heavy-ball method for quasi-strongly convex optimization. SIAM J. Optim. 32(3), 1817–1842 (2021)
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Baes, M.: Estimate sequence methods: extensions and approximations. Tech. rep, Institute for Operations Research, ETH, Zürich, Switzerland (2009)
Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory Comput. 15(4), 1–32 (2019)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
De Klerk, E., Glineur, F., Taylor, A.B.: Worst-case convergence analysis of inexact gradient and newton methods through semidefinite programming performance estimation. SIAM J. Optim. 30(3), 2053–2082 (2020)
Dragomir, R.A., Taylor, A.B., d’Aspremont, A., Bolte, J.: Optimal complexity and certification of Bregman first-order methods. Mathematical Programming (2021)
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184(1), 183–220 (2020)
Drori, Y., Taylor, A.: On the oracle complexity of smooth strongly convex minimization. J. Complex. 68, 101590 (2022)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Gu, G., Yang, J.: Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems. SIAM J. Optim. 30(3), 1905–1921 (2020)
Kim, D.: Accelerated proximal point method for maximally monotone operators. Math. Program. 190(1–2), 57–87 (2021)
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)
Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. J. Optim. Theory Appl. 178(1), 240–263 (2018)
Kim, D., Fessler, J.A.: Another look at the fast iterative shrinkage/thresholding algorithm (FISTA). SIAM J. Optim. 28(1), 223–250 (2018)
Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM J. Optim. 28(2), 1920–1950 (2018)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Li, B., Coutiño, M., Giannakis, G.B.: Revisit of estimate sequence for accelerated gradient methods. ICASSP (2020)
Lieder, F.: On the convergence rate of the Halpern-iteration. Optim. Lett. 15(2), 405–418 (2020)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Nemirovsky, A.S.: On optimality of Krylov’s information when solving linear operator equations. J. Complex. 7(2), 121–130 (1991)
Nemirovsky, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization. (1983)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(\cal{O} (1/k^2)\). Proc. USSR Acad. Sci. 269, 543–547 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Cham (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y., Stich, S.U.: Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM J. Optim. 27(1), 110–123 (2017)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Ryu, E.K., Yin, W.: Large-scale convex optimization via monotone operators. Draft (2021)
Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection. SIAM J. Optim. 30(3), 2251–2271 (2020)
Shi, B., Du, S.S., Su, W., Jordan, M.I.: Acceleration via symplectic discretization of high-resolution differential equations. NeurIPS (2019)
Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions. (2019) arXiv:1903.05671
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. NeurIPS (2014)
Taylor, A.B., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. COLT (2019)
Taylor, A., Drori, Y.: An optimal gradient method for smooth strongly convex minimization. Math. Program. 199(1–2), 557–594 (2022)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Acknowledgements
JP and EKR were supported by the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02) and the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010]. We thank Gyumin Roh for reviewing the manuscript and providing valuable feedback. We thank Bryan Van Scoy and Suvrit Sra for the discussions regarding the triple momentum method and estimate sequences, respectively.
Funding
Funding was provided by Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02) - National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Method Reference
For reference, we restate all aforementioned methods. In all methods, we assume that f is L-smooth function, \(\{ \theta _k \}_{k=0}^\infty \) and \(\{ \varphi _k \}_{k=0}^\infty \) are the sequences of positive scalars, and \(x_0=y_0=z_0\).
OGM One form of OGM is
and an equivalent form with z-iterates is
for \(k=0,1,\dots \). The last-step modification on the secondary sequence can be written as
where \(k=0,1,\dots \).
OGM-simple OGM-simple is a simpler variant of OGM with \(\theta _k = \frac{k+2}{2}\) and \(\varphi _k = \frac{k+1 +\frac{1}{\sqrt{2}}}{\sqrt{2}}\). One form of OGM-simple is
and an equivalent form with z-iterates is
for \(k=0,1,\dots \). The last-step modification on secondary sequence is written as
where \(k=0,1,\dots \).
SC-OGM Here, we assume that f is a \(\mu \)-strongly convex function, condition number of f is \(\kappa = L/\mu \), and \(\gamma = \frac{\sqrt{8\kappa +1}+3}{2\kappa -2}\). SC-OGM is written as
for \(k=0,1,\dots \).
LC-OGM LC-OGM (Linear Coupling OGM) is defined as
for \(k=0,1,\dots \), where \(V_z(y)\) is a Bregman divergence, \(\{\alpha _k\}_{k=1}^\infty \) and \(\{\tau _k\}_{k=1}^\infty \) are nonnegative sequences defined as \(\alpha _1 = \frac{2}{L}\), \(0 \le \alpha _{k+1}^2L -2\alpha _{k+1} \le \alpha _k^2L\), \(\tau _k = \frac{2}{\alpha _{k+1}L}\), and Q is a positive definite matrix defining \(\left\Vert x\right\Vert ^2 = x^T Q x \).
For last step modification, we define positive sequences \(\{\tilde{\alpha }_k\}_{k=1}^\infty \) and \(\{\tilde{\tau }_k\}_{k=1}^\infty \) as \(\alpha _1 = \frac{1}{L}\), \(0 \le \tilde{\alpha }_{k+1}^2L - \tilde{\alpha _{k+1}} \le \frac{1}{2}\alpha _k^2L\), and \(\tilde{\tau }_k = \frac{1}{\tilde{\alpha }_{k+1}L} \), and also define
for \(k=1,2,\dots \).
Unification of AGM and OGM Using LC-OGM, we can unify AGM and OGM as
for \(k=0,1,\dots \). This is equivalent to
LC-SC-OGM LC-SC-OGM (Linear coupling strongly convex OGM) is
for \(k=0,1,\dots \), where Q is a positive definite matrix.
Appendix B: Co-coercivity Inequality in General Norm
Lemma 7
Let f be a closed convex proper function. Then,
and
Proof
By the definition of the conjugate function,
and
Therefore,
The statement with u follows from the same argument and the fact that \(f^{**} = f\). \(\square \)
Lemma 8
Consider a norm \(\Vert \cdot \Vert \) and its dual norm \(\Vert \cdot \Vert _*\). Then,
and
Proof
This follows from Lemma 7 with \(f(x) = \frac{1}{2}\left\Vert x\right\Vert ^2\) and \(\left( \frac{1}{2}\left\Vert \cdot \right\Vert ^2\right) ^* = \frac{1}{2}\left\Vert \cdot \right\Vert _*^2\). \(\square \)
Lemma 9
Let
Then,
Proof
Let \(z= L (\texttt{Grad}(x) - x)\). By the definition of \(\texttt{Grad}(x)\) and Lemma 8, we have
\(\square \)
Lemma 10
Let \(f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a differentiable convex function such that
for all \(x,y\in {\mathbb {R}}^n\). Then
Proof
Since a differentiable convex function is continuously differentiable [45, Theorem 25.5],
\(\square \)
Lemma 11
(Co-coercivity inequality with general norm) Let \(f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a differentiable convex function such that
for all \(x,y\in {\mathbb {R}}^n\). Then
Proof
Set \(\phi (y) = f(y) - \langle \nabla f(x), y-x\rangle \). Then \(x \in {{\,\mathrm{arg\,min}\,}}\phi \). So by Lemma 9,
Substituting f back in \(\phi \) yields the co-coercivity inequality. \(\square \)
Appendix C: Telescoping Sum Argument
Suppose we established the inequality
for \(i=1,2,\dots \), where \(E_i, F_i, G_i\) are nonnegative quantities and \(a_i\), \(b_i\), \(c_i\), and \(d_i\) are nonnegative scalars. Assume \(c_i \le a_{i-1}\) and \(d_i \le b_{i-1}\). By summing the inequalities for \(i=1, 2, \dots , k\), we obtain
However, note that the
terms are wasted in the analysis. If one has the freedom to do so, it may be good to choose parameters so that
and \(E_i = 0\) for \(i=1,2,\dots \). Not having wasted terms may be an indication that the analysis is tight.
Appendix D: SC-OGM via Linear Coupling
In this section, we analyze SC-OGM through the linear coupling analysis. We consider the linear coupling form
where \(\tau \) is a coupling coefficient to be determined. As an aside, we can view \(z_{k+1}\) as a mirror descent update of the form
which is similar to what was considered in [6].
Lemma 12
Assume (A1), (A2) and (A3). Then,
for \(k = 0,1, \dots \).
Proof
This proof follows steps similar to that of [6, Lemma 5.4].
From the definition of \(z_{k+1}\), we say
By three point equation,
Plugging the definition of \(z_{k+1}\),
Combining results above, we get
\(\square \)
Lemma 13
(Coupling lemma in SC-OGM) Assume (A1), (A2) and (A3). Then
holds for \(k = 1, 2, \dots \)
Proof
We have
where the last inequality is an application of Lemma 12. Note that
To eliminate the \(\langle \nabla f(x_k), \cdot \rangle \) term, we choose \(\tau \) to satisfy
Plugging this in, the inequality above is
In order to make the telescoping form such as
we chose \(B_k = \frac{1}{2L}\) and \(C_k = \frac{\mu }{2}\), which leads to the choice of \(\gamma \) satisfying
We get the desired result by plugging (9) and (10) in the above inequality. \(\square \)
Appendix E: Asymptotic Characterization of \(\theta _k\)
Theorem 7
Let the positive sequence \(\{\theta _k\}_{k=0}^\infty \) satisfy \(\theta _0 = 1\) and \(\theta _{k+1}^2 - \theta _{k+1} - \theta _{k}^2=0\) for \(k = 0,1, \dots \). Then,
Proof
Let \(\theta _k = \frac{k+2}{2} + c_k\log k \). The proof consists of the following 3 steps:
-
1.
If \(c_k < \frac{1}{4}\), then \(c_{k+1} < \frac{1}{4}\).
-
2.
\(c_k \rightarrow \frac{1}{4}\) as \(k \rightarrow \infty \).
-
3.
If \(\theta _k = \frac{k+2}{2} + \frac{\log k}{4} + e_k\), then \(e_k\) is convergent.
First step If \(c_k < \frac{1}{4}\), then \(c_{k+1}<\frac{1}{4}\).
For our convenience, let \(c_0=0\) with \(c_0 \log 0 = 0\). Plugging this in \(\theta _{k+1}^2 - \theta _{k+1} - \theta _{k}^2=0\), we have
so
Assume \(c_{k+1}\ge 1/4\). Then
which proves the first claim.
Second step \(c_k \rightarrow \frac{1}{4}\) as \(k \rightarrow \infty \).
Put \(d_k = \frac{1}{4}-c_k\), then \(0 < d_k \le \frac{1}{4}\).
Therefore
By talyor expansion,
So, By summing all the above inequality from 1 to k,
so \(d_{k+1} < \frac{C}{\log (k+1)}\). In conclusion, as \(k \rightarrow \infty \), \(d_k \rightarrow 0\).
Third step If \(\theta _k = \frac{k+2}{2} + \frac{\log k}{4} + e_k\), then, \(e_k\) converges.
From the previous claim, we can say that for some sufficiently large k, \(|e_k| < \frac{1}{6}\log k\).
Then,
So,
Summing this for \(k=1,\dots ,k\), we get that \(e_{k+1} > D\) for some constant D. Moreover,
which indicates that \(e_{k+1} < e_k\). Since \(\{e_k\}_{k=0}^\infty \) is a decreasing sequence with a lower bound, it converges. \(\square \)
Proof of equality in Section 2.1 We have
which verifies the equality in Sect. 2.1.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Park, C., Park, J. & Ryu, E.K. Factor-\(\sqrt{2}\) Acceleration of Accelerated Gradient Methods. Appl Math Optim 88, 77 (2023). https://doi.org/10.1007/s00245-023-10047-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s00245-023-10047-9