Skip to main content
Log in

Factor-\(\sqrt{2}\) Acceleration of Accelerated Gradient Methods

  • Published:
Applied Mathematics & Optimization Submit manuscript

Abstract

The optimized gradient method (OGM) provides a factor-\(\sqrt{2}\) speedup upon Nesterov’s celebrated accelerated gradient method in the convex (but non-strongly convex) setup. However, this improved acceleration mechanism has not been well understood; prior analyses of OGM relied on a computer-assisted proof methodology, so the proofs were opaque for humans despite being verifiable and correct. In this work, we present a new analysis of OGM based on a Lyapunov function and linear coupling. These analyses are developed and presented without the assistance of computers and are understandable by humans. Furthermore, we generalize OGM’s acceleration mechanism and obtain a factor-\(\sqrt{2}\) speedup in other setups: acceleration with a simpler rational stepsize, the strongly convex setup, and the mirror descent setup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. COLT (2020)

  2. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. STOC (2017)

  3. Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. ICML (2016)

  4. Allen-Zhu, Z., Orecchia, L.: Linear coupling: An ultimate unification of gradient and mirror descent. ITCS (2017)

  5. Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. SODA (2016)

  6. Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. ICML (2016)

  7. Aujol, J., Dossal, C.: Optimal rate of convergence of an ODE associated to the fast gradient descent schemes for \(b> 0\). HAL Archives Ouvertes (2017)

  8. Aujol, J.F., Dossal, C., Fort, G., Moulines, É.: Rates of convergence of perturbed FISTA-based algorithms. HAL Archives Ouvertes (2019)

  9. Aujol, J.F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov acceleration. SIAM J. Optim. 29(4), 3131–3153 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  10. Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the heavy-ball method for quasi-strongly convex optimization. SIAM J. Optim. 32(3), 1817–1842 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  11. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  12. Baes, M.: Estimate sequence methods: extensions and approximations. Tech. rep, Institute for Operations Research, ETH, Zürich, Switzerland (2009)

  13. Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory Comput. 15(4), 1–32 (2019)

    MathSciNet  MATH  Google Scholar 

  14. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  15. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  16. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  17. De Klerk, E., Glineur, F., Taylor, A.B.: Worst-case convergence analysis of inexact gradient and newton methods through semidefinite programming performance estimation. SIAM J. Optim. 30(3), 2053–2082 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  18. Dragomir, R.A., Taylor, A.B., d’Aspremont, A., Bolte, J.: Optimal complexity and certification of Bregman first-order methods. Mathematical Programming (2021)

  19. Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  20. Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184(1), 183–220 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  21. Drori, Y., Taylor, A.: On the oracle complexity of smooth strongly convex minimization. J. Complex. 68, 101590 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  22. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  23. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  24. Gu, G., Yang, J.: Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems. SIAM J. Optim. 30(3), 1905–1921 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  25. Kim, D.: Accelerated proximal point method for maximally monotone operators. Math. Program. 190(1–2), 57–87 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  26. Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  28. Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. J. Optim. Theory Appl. 178(1), 240–263 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kim, D., Fessler, J.A.: Another look at the fast iterative shrinkage/thresholding algorithm (FISTA). SIAM J. Optim. 28(1), 223–250 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  30. Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM J. Optim. 28(2), 1920–1950 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  31. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  32. Li, B., Coutiño, M., Giannakis, G.B.: Revisit of estimate sequence for accelerated gradient methods. ICASSP (2020)

  33. Lieder, F.: On the convergence rate of the Halpern-iteration. Optim. Lett. 15(2), 405–418 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  34. Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  35. Nemirovsky, A.S.: On optimality of Krylov’s information when solving linear operator equations. J. Complex. 7(2), 121–130 (1991)

    Article  MathSciNet  Google Scholar 

  36. Nemirovsky, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)

    Article  MathSciNet  Google Scholar 

  37. Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization. (1983)

  38. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(\cal{O} (1/k^2)\). Proc. USSR Acad. Sci. 269, 543–547 (1983)

    Google Scholar 

  39. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Cham (2004)

    Book  MATH  Google Scholar 

  40. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  41. Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  42. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  43. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  44. Nesterov, Y., Stich, S.U.: Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM J. Optim. 27(1), 110–123 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  45. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    Book  MATH  Google Scholar 

  46. Ryu, E.K., Yin, W.: Large-scale convex optimization via monotone operators. Draft (2021)

  47. Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection. SIAM J. Optim. 30(3), 2251–2271 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  48. Shi, B., Du, S.S., Su, W., Jordan, M.I.: Acceleration via symplectic discretization of high-resolution differential equations. NeurIPS (2019)

  49. Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions. (2019) arXiv:1903.05671

  50. Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. NeurIPS (2014)

  51. Taylor, A.B., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. COLT (2019)

  52. Taylor, A., Drori, Y.: An optimal gradient method for smooth strongly convex minimization. Math. Program. 199(1–2), 557–594 (2022)

    MathSciNet  MATH  Google Scholar 

  53. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  54. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  55. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

JP and EKR were supported by the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02) and the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010]. We thank Gyumin Roh for reviewing the manuscript and providing valuable feedback. We thank Bryan Van Scoy and Suvrit Sra for the discussions regarding the triple momentum method and estimate sequences, respectively.

Funding

Funding was provided by Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02) - National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ernest K. Ryu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Method Reference

For reference, we restate all aforementioned methods. In all methods, we assume that f is L-smooth function, \(\{ \theta _k \}_{k=0}^\infty \) and \(\{ \varphi _k \}_{k=0}^\infty \) are the sequences of positive scalars, and \(x_0=y_0=z_0\).

OGM One form of OGM is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ x_{k+1}&= y_{k+1} + \frac{\theta _{k}-1}{\theta _{k+1}}(y_{k+1}- y_k) + \frac{\theta _k}{\theta _{k+1}}(y_{k+1} - x_{k}) \end{aligned}$$

and an equivalent form with z-iterates is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ z_{k+1}&= z_{k} - \frac{2\theta _k}{L}\nabla {f(x_{k})}\\ x_{k+1}&= \left( 1-\frac{1}{\theta _{k+1}}\right) y_{k+1} + \frac{1}{\theta _{k+1}}z_{k+1} \end{aligned}$$

for \(k=0,1,\dots \). The last-step modification on the secondary sequence can be written as

$$\begin{aligned} \tilde{x}_{k+1}&= y_{k+1} + \frac{\theta _{k}-1}{\varphi _{k+1}}(y_{k+1}- y_k) + \frac{\theta _k}{\varphi _{k+1}}(y_{k+1} - x_{k})\\&= \left( 1-\frac{1}{\varphi _{k+1}}\right) y_{k+1} + \frac{1}{\varphi _{k+1}}z_{k+1} \end{aligned}$$

where \(k=0,1,\dots \).

OGM-simple OGM-simple is a simpler variant of OGM with \(\theta _k = \frac{k+2}{2}\) and \(\varphi _k = \frac{k+1 +\frac{1}{\sqrt{2}}}{\sqrt{2}}\). One form of OGM-simple is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ x_{k+1}&= y_{k+1} + \frac{k}{k+3}(y_{k+1}- y_k) + \frac{k+2}{k+3}(y_{k+1} - x_{k}) \end{aligned}$$

and an equivalent form with z-iterates is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ z_{k+1}&= z_{k} - \frac{k+2}{L}\nabla {f(x_{k})}\\ x_{k+1}&= \left( 1-\frac{2}{{k+3}} \right) y_{k+1} + \frac{2}{k+3}z_{k+1} \end{aligned}$$

for \(k=0,1,\dots \). The last-step modification on secondary sequence is written as

$$\begin{aligned} \tilde{x}_{k+1}&= y_{k+1} + \frac{k}{\sqrt{2}(k+2) + 1}(y_{k+1} - y_{k}) + \frac{k+2}{\sqrt{2}(k+2) + 1}(y_{k+1} - x_k) \end{aligned}$$

where \(k=0,1,\dots \).

SC-OGM Here, we assume that f is a \(\mu \)-strongly convex function, condition number of f is \(\kappa = L/\mu \), and \(\gamma = \frac{\sqrt{8\kappa +1}+3}{2\kappa -2}\). SC-OGM is written as

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ x_{k+1}&= y_{k+1} + \frac{1}{2\gamma + 1}(y_{k+1}-y_k) + \frac{1}{2\gamma + 1}(y_{k+1} - x_{k}) \end{aligned}$$

for \(k=0,1,\dots \).

LC-OGM LC-OGM (Linear Coupling OGM) is defined as

$$\begin{aligned} y_{k+1}&= x_k-L^{-1}Q^{-1}\nabla f(x_k)\\ z_{k+1}&= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y\in {\mathbb {R}}^n}\left\{ V_{z_k}(y)+\langle \alpha _{k+1}\nabla f(x_k),y-x_k\rangle \right\} \\ x_{k+1}&= (1-\tau _{k+1}) y_{k+1} + \tau _{k+1} z_{k+1} \end{aligned}$$

for \(k=0,1,\dots \), where \(V_z(y)\) is a Bregman divergence, \(\{\alpha _k\}_{k=1}^\infty \) and \(\{\tau _k\}_{k=1}^\infty \) are nonnegative sequences defined as \(\alpha _1 = \frac{2}{L}\), \(0 \le \alpha _{k+1}^2L -2\alpha _{k+1} \le \alpha _k^2L\), \(\tau _k = \frac{2}{\alpha _{k+1}L}\), and Q is a positive definite matrix defining \(\left\Vert x\right\Vert ^2 = x^T Q x \).

For last step modification, we define positive sequences \(\{\tilde{\alpha }_k\}_{k=1}^\infty \) and \(\{\tilde{\tau }_k\}_{k=1}^\infty \) as \(\alpha _1 = \frac{1}{L}\), \(0 \le \tilde{\alpha }_{k+1}^2L - \tilde{\alpha _{k+1}} \le \frac{1}{2}\alpha _k^2L\), and \(\tilde{\tau }_k = \frac{1}{\tilde{\alpha }_{k+1}L} \), and also define

$$\begin{aligned} \tilde{x}_k = (1-\tilde{\tau }_k)y_k + \tilde{\tau }_k z_k \end{aligned}$$

for \(k=1,2,\dots \).

Unification of AGM and OGM Using LC-OGM, we can unify AGM and OGM as

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})} \\ z_{k+1}&= z_{k} - \frac{2t\theta _k}{L}\nabla {f(x_{k})}\\ x_{k+1}&= \left( 1-\frac{1}{\theta _{k+1}}\right) y_{k+1} + \frac{1}{\theta _{k+1}}z_{k+1}. \end{aligned}$$

for \(k=0,1,\dots \). This is equivalent to

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})} \\ x_{k+1}&= y_{k+1} + \frac{\theta _{k}-1}{\theta _{k+1}}(y_{k+1}- y_k)+(2t-1)\frac{\theta _k}{\theta _{k+1}}(y_{k+1} - x_{k}). \end{aligned}$$

LC-SC-OGM LC-SC-OGM (Linear coupling strongly convex OGM) is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}Q^{-1}\nabla {f(x_{k})}\\ z_{k+1}&= \frac{1}{1+\gamma }\left( z_k + \gamma x_{k} -\frac{\gamma }{\mu }Q^{-1}\nabla f(x_{k})\right) \\ x_{k+1}&= \tau z_{k+1} + (1-\tau ) y_{k+1}, \end{aligned}$$

for \(k=0,1,\dots \), where Q is a positive definite matrix.

Appendix B: Co-coercivity Inequality in General Norm

Lemma 7

Let f be a closed convex proper function. Then,

$$\begin{aligned} 0 \le f(x) + f^* (u) - \langle x, u \rangle \end{aligned}$$

and

$$\begin{aligned}{} & {} \inf _{x} \{f(x) + f^*(u) - \langle x, u \rangle \} = 0\\{} & {} \inf _{u} \{f(x) + f^*(u) - \langle x, u \rangle \} = 0. \end{aligned}$$

Proof

By the definition of the conjugate function,

$$\begin{aligned} -f^*(u) = \inf _x \left\{ f(x) - \langle x, u \rangle \right\} \end{aligned}$$

and

$$\begin{aligned} \inf _{x} \{f(x) + f^*(u) - \langle x, u \rangle \} = 0. \end{aligned}$$

Therefore,

$$\begin{aligned} 0 \le f(x) + f^* (u) - \langle x, u \rangle \quad \forall x. \end{aligned}$$

The statement with u follows from the same argument and the fact that \(f^{**} = f\). \(\square \)

Lemma 8

Consider a norm \(\Vert \cdot \Vert \) and its dual norm \(\Vert \cdot \Vert _*\). Then,

$$\begin{aligned} 0 \le \frac{1}{2}\left\Vert x\right\Vert ^2 + \frac{1}{2} \left\Vert u\right\Vert _*^2 - \langle x, u \rangle \end{aligned}$$

and

$$\begin{aligned}{} & {} \inf _{x \in {\mathbb {R}}^n} \left\{ \frac{1}{2}\left\Vert x\right\Vert ^2 + \frac{1}{2} \left\Vert u\right\Vert _*^2 - \langle x, u \rangle \right\} =0\\{} & {} \inf _{u \in {\mathbb {R}}^n} \left\{ \frac{1}{2}\left\Vert x\right\Vert ^2 + \frac{1}{2} \left\Vert u\right\Vert _*^2 - \langle x, u \rangle \right\} = 0. \end{aligned}$$

Proof

This follows from Lemma 7 with \(f(x) = \frac{1}{2}\left\Vert x\right\Vert ^2\) and \(\left( \frac{1}{2}\left\Vert \cdot \right\Vert ^2\right) ^* = \frac{1}{2}\left\Vert \cdot \right\Vert _*^2\). \(\square \)

Lemma 9

Let

$$\begin{aligned} \texttt{Grad}(x) = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y \in {\mathbb {R}}^n}\left\{ \frac{L}{2}\left\Vert y-x\right\Vert ^2 + \langle \nabla f(x), y-x \rangle \right\} . \end{aligned}$$

Then,

$$\begin{aligned} \langle \nabla f(x), \texttt{Grad}(x) - x \rangle + \frac{L}{2}\left\Vert \texttt{Grad}(x) - x \right\Vert ^2 = -\frac{1}{2L}\left\Vert \nabla f(x)\right\Vert _*^2. \end{aligned}$$

Proof

Let \(z= L (\texttt{Grad}(x) - x)\). By the definition of \(\texttt{Grad}(x)\) and Lemma 8, we have

$$\begin{aligned} \frac{1}{2L}\left\Vert \nabla f(x)\right\Vert _*^2 +&\frac{L}{2} \left\Vert \texttt{Grad}(x) - x\right\Vert ^2 + \langle \nabla f(x) , \texttt{Grad}(x) - x \rangle \\&= \inf _{z \in {\mathbb {R}}^n} \frac{1}{2L}\left\Vert \nabla f(x)\right\Vert _*^2 + \frac{1}{2L} \left\Vert z\right\Vert ^2 + \frac{1}{L} \langle \nabla f(x), z \rangle \\&= 0. \end{aligned}$$

\(\square \)

Lemma 10

Let \(f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a differentiable convex function such that

$$\begin{aligned} \left\Vert \nabla f(x) - \nabla f(y)\right\Vert _* \le L \left\Vert x-y\right\Vert \end{aligned}$$

for all \(x,y\in {\mathbb {R}}^n\). Then

$$\begin{aligned} f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2}\left\Vert y-x\right\Vert ^2. \end{aligned}$$

Proof

Since a differentiable convex function is continuously differentiable [45, Theorem 25.5],

$$\begin{aligned} f(y) - f(x)&= \int _0^1 \langle \nabla f(x + t(y-x)) , y-x \rangle dt\\&= \int _0^1 \langle \nabla f(x + t(y-x)) - \nabla f(x) , y-x \rangle dt + \langle \nabla f(x), y-x \rangle \\&\le \int _0^1 \left\Vert \nabla f(x + t(y-x)) - \nabla f(x)\right\Vert _* \left\Vert y-x\right\Vert dt + \langle \nabla f(x), y-x \rangle \\&\le \int _0^1 t L \left\Vert y-x\right\Vert ^2 dt + \langle \nabla f(x), y-x \rangle = \frac{L}{2} \left\Vert y-x\right\Vert ^2 + \langle \nabla f(x) , y-x \rangle . \end{aligned}$$

\(\square \)

Lemma 11

(Co-coercivity inequality with general norm) Let \(f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}\) be a differentiable convex function such that

$$\begin{aligned} \left\Vert \nabla f(x) - \nabla f(y)\right\Vert _* \le L \left\Vert x-y\right\Vert \end{aligned}$$

for all \(x,y\in {\mathbb {R}}^n\). Then

$$\begin{aligned} f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle + \frac{1}{2L}\left\Vert \nabla f(x)- \nabla f(y)\right\Vert _*^2. \end{aligned}$$

Proof

Set \(\phi (y) = f(y) - \langle \nabla f(x), y-x\rangle \). Then \(x \in {{\,\mathrm{arg\,min}\,}}\phi \). So by Lemma 9,

$$\begin{aligned} \phi (x)&\le \phi (\texttt{Grad}(y))\\&\le \phi (y) + \langle \nabla \phi (y), \texttt{Grad}(y) -y \rangle + \frac{L}{2}\left\Vert \texttt{Grad}(y) - y\right\Vert ^2\\&= \phi (y) - \frac{1}{2L}\left\Vert \nabla \phi (y)\right\Vert _*^2. \end{aligned}$$

Substituting f back in \(\phi \) yields the co-coercivity inequality. \(\square \)

Appendix C: Telescoping Sum Argument

Suppose we established the inequality

$$\begin{aligned} a_i F_i + b_i G_i \le c_i F_{i-1} + d_i G_{i-1} - E_i \end{aligned}$$

for \(i=1,2,\dots \), where \(E_i, F_i, G_i\) are nonnegative quantities and \(a_i\), \(b_i\), \(c_i\), and \(d_i\) are nonnegative scalars. Assume \(c_i \le a_{i-1}\) and \(d_i \le b_{i-1}\). By summing the inequalities for \(i=1, 2, \dots , k\), we obtain

$$\begin{aligned} a_k F_k&\le - b_k G_k - \sum _{i=2}^{k} (a_{i-1} - c_i) F_{i-1} - \sum _{i=2}^k (b_{i-1} - d_i) G_{i-1} - \sum _{i=2}^k E_i + c_1 F_{0} + d_1 G_{0}\\&\le c_1 F_{0} + d_1 G_{0}. \end{aligned}$$

However, note that the

$$\begin{aligned} - b_k G_k - \sum _{i=2}^{k} (a_{i-1} - c_i) F_{i-1} - \sum _{i=2}^k (b_{i-1} - d_i) G_{i-1} - \sum _{i=1}^k E_i \end{aligned}$$

terms are wasted in the analysis. If one has the freedom to do so, it may be good to choose parameters so that

$$\begin{aligned} a_{i-1} = c_i,\, b_{i-1} = d_i \end{aligned}$$

and \(E_i = 0\) for \(i=1,2,\dots \). Not having wasted terms may be an indication that the analysis is tight.

Appendix D: SC-OGM via Linear Coupling

In this section, we analyze SC-OGM through the linear coupling analysis. We consider the linear coupling form

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}Q^{-1}\nabla {f(x_{k})}\\ z_{k+1}&= \frac{1}{1+\gamma }\left( z_k + \gamma x_{k} -\frac{\gamma }{\mu }Q^{-1}\nabla f(x_{k})\right) \\ x_{k+1}&= \tau z_{k+1} + (1-\tau ) y_{k+1}, \end{aligned}$$

where \(\tau \) is a coupling coefficient to be determined. As an aside, we can view \(z_{k+1}\) as a mirror descent update of the form

$$\begin{aligned} z_{k+1} =\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{z} \left\{ \frac{1}{2}\left\Vert z-z_k\right\Vert ^2 + \frac{\gamma }{2}\left\Vert z-x_k\right\Vert ^2 +\frac{\gamma }{\mu } \langle \nabla f(x_k), z \rangle \right\} , \end{aligned}$$

which is similar to what was considered in [6].

Lemma 12

Assume (A1), (A2) and (A3). Then,

$$\begin{aligned}&\frac{\gamma }{\mu }\langle \nabla f(x_{k}) , z_{k+1} - x_\star \rangle - \frac{\gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad \le -\frac{\gamma ^2}{2(1+\gamma )\mu ^2}\left\Vert \nabla f(x_k)\right\Vert _*^2+ \frac{1}{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{1+\gamma }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2 \end{aligned}$$

for \(k = 0,1, \dots \).

Proof

This proof follows steps similar to that of [6, Lemma 5.4].

From the definition of \(z_{k+1}\), we say

$$\begin{aligned} 0=&\left\langle \frac{\partial }{\partial z}\left\{ \frac{1}{2}\left\Vert z-z_k\right\Vert ^2 + \frac{\gamma }{2}\left\Vert z-x_k\right\Vert ^2 +\frac{\gamma }{\mu } \langle \nabla f(x_k), z \rangle \right\} \bigg |_{z_{k+1}}, z_{k+1} - x_\star \right\rangle \\ =&\langle Q(z_{k+1} - z_k), z_{k+1} - x_\star \rangle + \frac{\gamma }{\mu } \langle \nabla f(x_{k}), z_{k+1} - x_\star \rangle + \gamma \langle Q(z_{k+1} - x_k),z_{k+1} - x_\star \rangle \end{aligned}$$

By three point equation,

$$\begin{aligned}&\frac{\gamma }{\mu }\langle \nabla f(x_{k}) ,z_{k+1} - x_\star \rangle + \gamma \left( \frac{1}{2} \left\Vert x_k - z_{k+1}\right\Vert ^2 - \frac{1}{2} \left\Vert x_k - x_\star \right\Vert ^2 \right) \\&\quad = -\frac{1}{2} \left\Vert z_{k}- z_{k+1}\right\Vert ^2 + \frac{1}{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{1+\gamma }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2. \end{aligned}$$

Plugging the definition of \(z_{k+1}\),

$$\begin{aligned}&\frac{\gamma }{2} \left\Vert x_k - z_{k+1}\right\Vert ^2 + \frac{1}{2}\left\Vert z_k - z_{k+1}\right\Vert ^2 \\&\quad = \frac{\gamma }{2} \left\Vert \frac{1}{1+\gamma }(x_k - z_k) + \frac{\gamma }{(1+\gamma )\mu } Q^{-1}\nabla f(x_k)\right\Vert ^2 \\&\qquad + \frac{1}{2}\left\Vert -\frac{\gamma }{1+\gamma }(x_k - z_k) + \frac{\gamma }{(1+\gamma )\mu }Q^{-1}\nabla f(x_k)\right\Vert ^2\\&\quad \ge \frac{\gamma ^2}{2(1+\gamma )\mu ^2}\left\Vert \nabla f(x_k)\right\Vert _*^2. \end{aligned}$$

Combining results above, we get

$$\begin{aligned}&\frac{\gamma }{\mu }\langle \nabla f(x_{k}) , z_{k+1} - x_\star \rangle - \frac{\gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad \le -\frac{\gamma ^2}{2(1+\gamma )\mu ^2}\left\Vert \nabla f(x_k)\right\Vert _*^2+ \frac{1}{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{1+\gamma }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2 . \end{aligned}$$

\(\square \)

Lemma 13

(Coupling lemma in SC-OGM) Assume (A1), (A2) and (A3). Then

$$\begin{aligned}&(1+\gamma )\biggl (f(x_k) - \frac{1}{2L} \left\Vert \nabla f(x_k)\right\Vert _*^2 + \frac{\mu }{2}\left\Vert z_k - x_\star \right\Vert ^2 \biggr ) \\&\quad \le \left( f(x_{k-1}) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 + \frac{\mu }{2}\left\Vert z_{k-1} - x_\star \right\Vert ^2 \right) \end{aligned}$$

holds for \(k = 1, 2, \dots \)

Proof

We have

$$\begin{aligned}&\gamma \left( f(x_{k}) - f(x_\star ) \right) \\&\quad \le \gamma \langle \nabla f(x_{k}), x_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2 \\&\quad = \gamma \langle \nabla f(x_{k}), x_{k}- z_{k}\rangle + \gamma \langle \nabla f(x_{k}), z_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad = \frac{1-\tau }{\tau }\gamma \langle \nabla f(x_k) ,y_k - x_k \rangle + \gamma \langle \nabla f(x_{k}), z_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad = \frac{1-\tau }{\tau }\gamma \langle \nabla f(x_k) , x_{k-1} - x_{k} - \frac{1}{L}Q^{-1}\nabla f(x_{k-1}) \rangle \\&\qquad + \gamma \langle \nabla f(x_{k}), z_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2 \\&\quad \le \left( \frac{1-\tau }{\tau }\gamma - 1 \right) \langle \nabla f(x_k) , x_{k-1} - x_{k} - \frac{1}{L}Q^{-1}\nabla f(x_{k-1}) \rangle \\&\qquad +\left( f(x_{k-1}) - f(x_k) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 - \frac{1}{2L}\left\Vert \nabla f(x_k)\right\Vert _*^2\right) \\&\qquad +\gamma \langle \nabla f(x_{k}), z_{k}-z_{k+1} \rangle + \gamma \langle \nabla f(x_{k}), z_{k+1}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad \le \left( \frac{1-\tau }{\tau }\gamma - 1 \right) \langle \nabla f(x_k) , y_{k} - x_{k}\rangle \\&\qquad +\left( f(x_{k-1}) - f(x_k) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 - \frac{1}{2L}\left\Vert \nabla f(x_k)\right\Vert _*^2\right) \\&\qquad +\gamma \langle \nabla f(x_{k}), z_{k}-z_{k+1} \rangle -\frac{\gamma ^2}{2(1+\gamma )\mu }\left\Vert \nabla f(x_k)\right\Vert _*^2\\&\qquad +\frac{\mu }{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{(1+\gamma )\mu }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2, \end{aligned}$$

where the last inequality is an application of Lemma 12. Note that

$$\begin{aligned} z_k - z_{k+1}&= z_k - \frac{1}{1+\gamma }\left( z_k + \gamma x_k -\frac{\gamma }{\mu }Q^{-1}\nabla f(x_k) \right) \\&= \frac{\gamma }{1+\gamma }(z_k-x_k) + \frac{\gamma }{(1+\gamma )\mu } Q^{-1} \nabla f(x_k) \\&= \frac{\gamma }{1+\gamma }\frac{1-\tau }{\tau }(x_k - y_k) + \frac{\gamma }{(1+\gamma )\mu } Q^{-1} \nabla f(x_k). \end{aligned}$$

To eliminate the \(\langle \nabla f(x_k), \cdot \rangle \) term, we choose \(\tau \) to satisfy

$$\begin{aligned} \frac{1-\tau }{\tau }\gamma - 1 = \frac{\gamma }{1+\gamma }\frac{1-\tau }{\tau }. \end{aligned}$$
(9)

Plugging this in, the inequality above is

$$\begin{aligned}&\gamma \left( f(x_{k}) - f(x_\star ) \right) \\&\quad \le \left( f(x_{k-1}) - f(x_k) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 - \frac{1}{2L}\left\Vert \nabla f(x_k)\right\Vert _*^2\right) \\&\quad +\frac{\gamma ^2}{2(1+\gamma )\mu }\left\Vert \nabla f(x_k)\right\Vert _*^2+ \frac{\mu }{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{(1+\gamma )\mu }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2. \end{aligned}$$

In order to make the telescoping form such as

$$\begin{aligned}&M_{k}\biggl (f(x_{k}) - B_{k}\left\Vert \nabla f(x_{k})\right\Vert _*^2 + C_k \left\Vert z_{k+1} - x_\star \right\Vert ^2 \biggr ) \\&\quad \le N_{k-1}\left( f(x_{k-1}) - B_{k-1} \left\Vert \nabla f(x_{k-1})\right\Vert _*^2+ C_{k-1} \left\Vert z_{k} - x_\star \right\Vert ^2 \right) , \end{aligned}$$

we chose \(B_k = \frac{1}{2L}\) and \(C_k = \frac{\mu }{2}\), which leads to the choice of \(\gamma \) satisfying

$$\begin{aligned} \frac{2+\gamma }{2L} = \frac{\gamma ^2}{2(1+\gamma )\mu }. \end{aligned}$$
(10)

We get the desired result by plugging (9) and (10) in the above inequality. \(\square \)

Appendix E: Asymptotic Characterization of \(\theta _k\)

Theorem 7

Let the positive sequence \(\{\theta _k\}_{k=0}^\infty \) satisfy \(\theta _0 = 1\) and \(\theta _{k+1}^2 - \theta _{k+1} - \theta _{k}^2=0\) for \(k = 0,1, \dots \). Then,

$$\begin{aligned} \theta _k = \frac{k+\zeta +1}{2} + \frac{\log k}{4} + o(1). \end{aligned}$$

Proof

Let \(\theta _k = \frac{k+2}{2} + c_k\log k \). The proof consists of the following 3 steps:

  1. 1.

    If \(c_k < \frac{1}{4}\), then \(c_{k+1} < \frac{1}{4}\).

  2. 2.

    \(c_k \rightarrow \frac{1}{4}\) as \(k \rightarrow \infty \).

  3. 3.

    If \(\theta _k = \frac{k+2}{2} + \frac{\log k}{4} + e_k\), then \(e_k\) is convergent.

First step If \(c_k < \frac{1}{4}\), then \(c_{k+1}<\frac{1}{4}\).

For our convenience, let \(c_0=0\) with \(c_0 \log 0 = 0\). Plugging this in \(\theta _{k+1}^2 - \theta _{k+1} - \theta _{k}^2=0\), we have

$$\begin{aligned} \left( \frac{k+2}{2} + c_{k+1} \log (k+1) \right) ^2 = \left( \frac{k+2}{2} + c_{k}\log k \right) ^2 + \frac{1}{4}, \end{aligned}$$

so

$$\begin{aligned} \left( c_{k+1} \log (k+1) - c_k \log k \right) \left( k+2+c_{k+1}\log (k+1) + c_k\log k \right) = \frac{1}{4}. \end{aligned}$$

Assume \(c_{k+1}\ge 1/4\). Then

$$\begin{aligned} \frac{1}{4}&= \left( c_{k+1}\log (k+1) - c_k \log k \right) \left( k+2+c_{k+1}\log (k+1) + c_k \log k \right) \\&\ge \frac{1}{4}\log \left( 1+\frac{1}{k}\right) (k+2)\\&>\frac{1}{4}, \end{aligned}$$

which proves the first claim.

Second step \(c_k \rightarrow \frac{1}{4}\) as \(k \rightarrow \infty \).

Put \(d_k = \frac{1}{4}-c_k\), then \(0 < d_k \le \frac{1}{4}\).

$$\begin{aligned} \frac{1}{4}&=\left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) -d_{k+1}\log (k+1) + d_k \log k \right) \\&\quad \left( k+2+\frac{1}{4}\log k(k+1) -d_{k+1}\log (k+1) - d_k \log k \right) \\&\le \left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) -d_{k+1}\log (k+1) + d_k \log k \right) \left( k+2+\frac{1}{2}\log (k+1) \right) \end{aligned}$$

Therefore

$$\begin{aligned} d_{k+1} \log (k+1) - d_k \log k \le \frac{1}{4} \log \left( 1+\frac{1}{k}\right) - \frac{1}{4}\frac{1}{k+2+\frac{1}{2}\log (k+1)}. \end{aligned}$$

By talyor expansion,

$$\begin{aligned} d_{k+1} \log (k+1) - d_k \log k \le \frac{1}{4} \left( \frac{3+2\log k}{2k^2} + {\mathcal {O}}\left( \frac{1}{k^2}\right) \right) . \end{aligned}$$

So, By summing all the above inequality from 1 to k,

$$\begin{aligned} d_{k+1} \log (k+1) \le C \end{aligned}$$

so \(d_{k+1} < \frac{C}{\log (k+1)}\). In conclusion, as \(k \rightarrow \infty \), \(d_k \rightarrow 0\).

Third step If \(\theta _k = \frac{k+2}{2} + \frac{\log k}{4} + e_k\), then, \(e_k\) converges.

From the previous claim, we can say that for some sufficiently large k, \(|e_k| < \frac{1}{6}\log k\).

$$\begin{aligned} \left( \frac{k+2}{2} + \frac{1}{4}\log (k+1) + e_{k+1} \right) ^2 = \left( \frac{k+2}{2} + \frac{1}{4} \log k + e_{k} \right) ^2 + \frac{1}{4} \end{aligned}$$

Then,

$$\begin{aligned} \frac{1}{4}&=\left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2+\frac{1}{4}\log k(k+1) + e_{k+1} + e_k \right) \\&\le \left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2+\frac{5}{6}\log (k+1) \right) . \end{aligned}$$

So,

$$\begin{aligned} e_{k+1} - e_k \ge \frac{1}{4\left( k+2+ \frac{5}{6}\log (k+1) \right) } - \frac{1}{4}\log \left( 1+ \frac{1}{k} \right) = -\frac{\frac{5}{6}\log k + \frac{3}{2}}{k^2} + {\mathcal {O}}\left( \frac{1}{k^2}\right) . \end{aligned}$$

Summing this for \(k=1,\dots ,k\), we get that \(e_{k+1} > D\) for some constant D. Moreover,

$$\begin{aligned} \frac{1}{4}&=\left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2+\frac{1}{4}\log k(k+1) + e_{k+1} + e_k \right) \\&\ge \left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2\right) >\frac{1}{4} + (k+2)(e_{k+1} - e_k), \end{aligned}$$

which indicates that \(e_{k+1} < e_k\). Since \(\{e_k\}_{k=0}^\infty \) is a decreasing sequence with a lower bound, it converges. \(\square \)

Proof of equality in Section 2.1 We have

$$\begin{aligned} \frac{L\left\Vert x_0 - x_\star \right\Vert ^2}{2\theta _{k-1}^2}&= \frac{L\left\Vert x_0 - x_\star \right\Vert ^2}{2\left( \frac{k+\zeta }{2} + \frac{\log (k-1)}{4} + o(1)\right) ^2}\\&= \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2}{(k+ \zeta )^2 \left( 1 + \frac{\log (k-1)}{2(k+\zeta )} + o(1/k)\right) ^2}\\&= \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2}{(k+ \zeta )^2} \left( 1 - 2 \frac{\log (k-1)}{2(k+\zeta )} + o(1/k)\right) \\&= \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2}{(k+\zeta )^2} - \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2\log k}{(k+\zeta )^3} + o\left( \frac{1}{k^3}\right) , \end{aligned}$$

which verifies the equality in Sect. 2.1.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, C., Park, J. & Ryu, E.K. Factor-\(\sqrt{2}\) Acceleration of Accelerated Gradient Methods. Appl Math Optim 88, 77 (2023). https://doi.org/10.1007/s00245-023-10047-9

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00245-023-10047-9

Keywords

Navigation