Factor- $$\sqrt{2}$$ Acceleration of Accelerated Gradient Methods

Park, Chanwoo; Park, Jisun; Ryu, Ernest K.

doi:10.1007/s00245-023-10047-9

Factor-$\sqrt{2}$ Acceleration of Accelerated Gradient Methods

Published: 23 August 2023

Volume 88, article number 77, (2023)
Cite this article

Applied Mathematics & Optimization Submit manuscript

Chanwoo Park¹,
Jisun Park² &
Ernest K. Ryu²

510 Accesses
Explore all metrics

Abstract

The optimized gradient method (OGM) provides a factor-$\sqrt{2}$ speedup upon Nesterov’s celebrated accelerated gradient method in the convex (but non-strongly convex) setup. However, this improved acceleration mechanism has not been well understood; prior analyses of OGM relied on a computer-assisted proof methodology, so the proofs were opaque for humans despite being verifiable and correct. In this work, we present a new analysis of OGM based on a Lyapunov function and linear coupling. These analyses are developed and presented without the assistance of computers and are understandable by humans. Furthermore, we generalize OGM’s acceleration mechanism and obtain a factor-$\sqrt{2}$ speedup in other setups: acceleration with a simpler rational stepsize, the strongly convex setup, and the mirror descent setup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

References

Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. COLT (2020)
Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. STOC (2017)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. ICML (2016)
Allen-Zhu, Z., Orecchia, L.: Linear coupling: An ultimate unification of gradient and mirror descent. ITCS (2017)
Allen-Zhu, Z., Lee, Y.T., Orecchia, L.: Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. SODA (2016)
Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. ICML (2016)
Aujol, J., Dossal, C.: Optimal rate of convergence of an ODE associated to the fast gradient descent schemes for $b> 0$. HAL Archives Ouvertes (2017)
Aujol, J.F., Dossal, C., Fort, G., Moulines, É.: Rates of convergence of perturbed FISTA-based algorithms. HAL Archives Ouvertes (2019)
Aujol, J.F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov acceleration. SIAM J. Optim. 29(4), 3131–3153 (2019)
Article MathSciNet MATH Google Scholar
Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the heavy-ball method for quasi-strongly convex optimization. SIAM J. Optim. 32(3), 1817–1842 (2021)
Article MathSciNet MATH Google Scholar
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Article MathSciNet MATH Google Scholar
Baes, M.: Estimate sequence methods: extensions and approximations. Tech. rep, Institute for Operations Research, ETH, Zürich, Switzerland (2009)
Bansal, N., Gupta, A.: Potential-function proofs for gradient methods. Theory Comput. 15(4), 1–32 (2019)
MathSciNet MATH Google Scholar
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175 (2003)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
De Klerk, E., Glineur, F., Taylor, A.B.: Worst-case convergence analysis of inexact gradient and newton methods through semidefinite programming performance estimation. SIAM J. Optim. 30(3), 2053–2082 (2020)
Article MathSciNet MATH Google Scholar
Dragomir, R.A., Taylor, A.B., d’Aspremont, A., Bolte, J.: Optimal complexity and certification of Bregman first-order methods. Mathematical Programming (2021)
Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)
Article MathSciNet MATH Google Scholar
Drori, Y., Taylor, A.B.: Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184(1), 183–220 (2020)
Article MathSciNet MATH Google Scholar
Drori, Y., Taylor, A.: On the oracle complexity of smooth strongly convex minimization. J. Complex. 68, 101590 (2022)
Article MathSciNet MATH Google Scholar
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Article MathSciNet MATH Google Scholar
Gu, G., Yang, J.: Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems. SIAM J. Optim. 30(3), 1905–1921 (2020)
Article MathSciNet MATH Google Scholar
Kim, D.: Accelerated proximal point method for maximally monotone operators. Math. Program. 190(1–2), 57–87 (2021)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. J. Optim. Theory Appl. 178(1), 240–263 (2018)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Another look at the fast iterative shrinkage/thresholding algorithm (FISTA). SIAM J. Optim. 28(1), 223–250 (2018)
Article MathSciNet MATH Google Scholar
Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM J. Optim. 28(2), 1920–1950 (2018)
Article MathSciNet MATH Google Scholar
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Article MathSciNet MATH Google Scholar
Li, B., Coutiño, M., Giannakis, G.B.: Revisit of estimate sequence for accelerated gradient methods. ICASSP (2020)
Lieder, F.: On the convergence rate of the Halpern-iteration. Optim. Lett. 15(2), 405–418 (2020)
Article MathSciNet MATH Google Scholar
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Article MathSciNet MATH Google Scholar
Nemirovsky, A.S.: On optimality of Krylov’s information when solving linear operator equations. J. Complex. 7(2), 121–130 (1991)
Article MathSciNet Google Scholar
Nemirovsky, A.S.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)
Article MathSciNet Google Scholar
Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization. (1983)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence $\cal{O} (1/k^2)$. Proc. USSR Acad. Sci. 269, 543–547 (1983)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Cham (2004)
Book MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Nesterov, Y., Stich, S.U.: Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM J. Optim. 27(1), 110–123 (2017)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Book MATH Google Scholar
Ryu, E.K., Yin, W.: Large-scale convex optimization via monotone operators. Draft (2021)
Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection. SIAM J. Optim. 30(3), 2251–2271 (2020)
Article MathSciNet MATH Google Scholar
Shi, B., Du, S.S., Su, W., Jordan, M.I.: Acceleration via symplectic discretization of high-resolution differential equations. NeurIPS (2019)
Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions. (2019) arXiv:1903.05671
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. NeurIPS (2014)
Taylor, A.B., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. COLT (2019)
Taylor, A., Drori, Y.: An optimal gradient method for smooth strongly convex minimization. Math. Program. 199(1–2), 557–594 (2022)
MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
Article MathSciNet MATH Google Scholar
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Article MathSciNet MATH Google Scholar
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

JP and EKR were supported by the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02) and the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010]. We thank Gyumin Roh for reviewing the manuscript and providing valuable feedback. We thank Bryan Van Scoy and Suvrit Sra for the discussions regarding the triple momentum method and estimate sequences, respectively.

Funding

Funding was provided by Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02) - National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [NRF-2022R1C1C1010010].

Author information

Authors and Affiliations

Department of EECS, Massachusetts Institute of Technology, Massachusetts, USA
Chanwoo Park
Department of Mathematical Sciences, Seoul National University, Seoul, Korea
Jisun Park & Ernest K. Ryu

Authors

Chanwoo Park
View author publications
You can also search for this author in PubMed Google Scholar
Jisun Park
View author publications
You can also search for this author in PubMed Google Scholar
Ernest K. Ryu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ernest K. Ryu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Method Reference

For reference, we restate all aforementioned methods. In all methods, we assume that f is L-smooth function, $\{ \theta _k \}_{k=0}^\infty $ and $\{ \varphi _k \}_{k=0}^\infty $ are the sequences of positive scalars, and $x_0=y_0=z_0$.

OGM One form of OGM is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ x_{k+1}&= y_{k+1} + \frac{\theta _{k}-1}{\theta _{k+1}}(y_{k+1}- y_k) + \frac{\theta _k}{\theta _{k+1}}(y_{k+1} - x_{k}) \end{aligned}$$

and an equivalent form with z-iterates is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ z_{k+1}&= z_{k} - \frac{2\theta _k}{L}\nabla {f(x_{k})}\\ x_{k+1}&= \left( 1-\frac{1}{\theta _{k+1}}\right) y_{k+1} + \frac{1}{\theta _{k+1}}z_{k+1} \end{aligned}$$

for $k=0,1,\dots $. The last-step modification on the secondary sequence can be written as

$$\begin{aligned} \tilde{x}_{k+1}&= y_{k+1} + \frac{\theta _{k}-1}{\varphi _{k+1}}(y_{k+1}- y_k) + \frac{\theta _k}{\varphi _{k+1}}(y_{k+1} - x_{k})\\&= \left( 1-\frac{1}{\varphi _{k+1}}\right) y_{k+1} + \frac{1}{\varphi _{k+1}}z_{k+1} \end{aligned}$$

where $k=0,1,\dots $.

OGM-simple OGM-simple is a simpler variant of OGM with $\theta _k = \frac{k+2}{2}$ and $\varphi _k = \frac{k+1 +\frac{1}{\sqrt{2}}}{\sqrt{2}}$. One form of OGM-simple is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ x_{k+1}&= y_{k+1} + \frac{k}{k+3}(y_{k+1}- y_k) + \frac{k+2}{k+3}(y_{k+1} - x_{k}) \end{aligned}$$

and an equivalent form with z-iterates is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ z_{k+1}&= z_{k} - \frac{k+2}{L}\nabla {f(x_{k})}\\ x_{k+1}&= \left( 1-\frac{2}{{k+3}} \right) y_{k+1} + \frac{2}{k+3}z_{k+1} \end{aligned}$$

for $k=0,1,\dots $. The last-step modification on secondary sequence is written as

$$\begin{aligned} \tilde{x}_{k+1}&= y_{k+1} + \frac{k}{\sqrt{2}(k+2) + 1}(y_{k+1} - y_{k}) + \frac{k+2}{\sqrt{2}(k+2) + 1}(y_{k+1} - x_k) \end{aligned}$$

where $k=0,1,\dots $.

SC-OGM Here, we assume that f is a $\mu $-strongly convex function, condition number of f is $\kappa = L/\mu $, and $\gamma = \frac{\sqrt{8\kappa +1}+3}{2\kappa -2}$. SC-OGM is written as

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})}\\ x_{k+1}&= y_{k+1} + \frac{1}{2\gamma + 1}(y_{k+1}-y_k) + \frac{1}{2\gamma + 1}(y_{k+1} - x_{k}) \end{aligned}$$

for $k=0,1,\dots $.

LC-OGM LC-OGM (Linear Coupling OGM) is defined as

$$\begin{aligned} y_{k+1}&= x_k-L^{-1}Q^{-1}\nabla f(x_k)\\ z_{k+1}&= \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y\in {\mathbb {R}}^n}\left\{ V_{z_k}(y)+\langle \alpha _{k+1}\nabla f(x_k),y-x_k\rangle \right\} \\ x_{k+1}&= (1-\tau _{k+1}) y_{k+1} + \tau _{k+1} z_{k+1} \end{aligned}$$

for $k=0,1,\dots $, where $V_z(y)$ is a Bregman divergence, $\{\alpha _k\}_{k=1}^\infty $ and $\{\tau _k\}_{k=1}^\infty $ are nonnegative sequences defined as $\alpha _1 = \frac{2}{L}$, $0 \le \alpha _{k+1}^2L -2\alpha _{k+1} \le \alpha _k^2L$, $\tau _k = \frac{2}{\alpha _{k+1}L}$, and Q is a positive definite matrix defining $\left\Vert x\right\Vert ^2 = x^T Q x $.

For last step modification, we define positive sequences $\{\tilde{\alpha }_k\}_{k=1}^\infty $ and $\{\tilde{\tau }_k\}_{k=1}^\infty $ as $\alpha _1 = \frac{1}{L}$, $0 \le \tilde{\alpha }_{k+1}^2L - \tilde{\alpha _{k+1}} \le \frac{1}{2}\alpha _k^2L$, and $\tilde{\tau }_k = \frac{1}{\tilde{\alpha }_{k+1}L} $, and also define

$$\begin{aligned} \tilde{x}_k = (1-\tilde{\tau }_k)y_k + \tilde{\tau }_k z_k \end{aligned}$$

for $k=1,2,\dots $.

Unification of AGM and OGM Using LC-OGM, we can unify AGM and OGM as

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})} \\ z_{k+1}&= z_{k} - \frac{2t\theta _k}{L}\nabla {f(x_{k})}\\ x_{k+1}&= \left( 1-\frac{1}{\theta _{k+1}}\right) y_{k+1} + \frac{1}{\theta _{k+1}}z_{k+1}. \end{aligned}$$

for $k=0,1,\dots $. This is equivalent to

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}\nabla {f(x_{k})} \\ x_{k+1}&= y_{k+1} + \frac{\theta _{k}-1}{\theta _{k+1}}(y_{k+1}- y_k)+(2t-1)\frac{\theta _k}{\theta _{k+1}}(y_{k+1} - x_{k}). \end{aligned}$$

LC-SC-OGM LC-SC-OGM (Linear coupling strongly convex OGM) is

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}Q^{-1}\nabla {f(x_{k})}\\ z_{k+1}&= \frac{1}{1+\gamma }\left( z_k + \gamma x_{k} -\frac{\gamma }{\mu }Q^{-1}\nabla f(x_{k})\right) \\ x_{k+1}&= \tau z_{k+1} + (1-\tau ) y_{k+1}, \end{aligned}$$

for $k=0,1,\dots $, where Q is a positive definite matrix.

Appendix B: Co-coercivity Inequality in General Norm

Lemma 7

Let f be a closed convex proper function. Then,

$$\begin{aligned} 0 \le f(x) + f^* (u) - \langle x, u \rangle \end{aligned}$$

and

$$\begin{aligned}{} & {} \inf _{x} \{f(x) + f^*(u) - \langle x, u \rangle \} = 0\\{} & {} \inf _{u} \{f(x) + f^*(u) - \langle x, u \rangle \} = 0. \end{aligned}$$

Proof

By the definition of the conjugate function,

$$\begin{aligned} -f^*(u) = \inf _x \left\{ f(x) - \langle x, u \rangle \right\} \end{aligned}$$

and

$$\begin{aligned} \inf _{x} \{f(x) + f^*(u) - \langle x, u \rangle \} = 0. \end{aligned}$$

Therefore,

$$\begin{aligned} 0 \le f(x) + f^* (u) - \langle x, u \rangle \quad \forall x. \end{aligned}$$

The statement with u follows from the same argument and the fact that $f^{**} = f$. $\square $

Lemma 8

Consider a norm $\Vert \cdot \Vert $ and its dual norm $\Vert \cdot \Vert _*$. Then,

$$\begin{aligned} 0 \le \frac{1}{2}\left\Vert x\right\Vert ^2 + \frac{1}{2} \left\Vert u\right\Vert _*^2 - \langle x, u \rangle \end{aligned}$$

and

$$\begin{aligned}{} & {} \inf _{x \in {\mathbb {R}}^n} \left\{ \frac{1}{2}\left\Vert x\right\Vert ^2 + \frac{1}{2} \left\Vert u\right\Vert _*^2 - \langle x, u \rangle \right\} =0\\{} & {} \inf _{u \in {\mathbb {R}}^n} \left\{ \frac{1}{2}\left\Vert x\right\Vert ^2 + \frac{1}{2} \left\Vert u\right\Vert _*^2 - \langle x, u \rangle \right\} = 0. \end{aligned}$$

Proof

This follows from Lemma 7 with $f(x) = \frac{1}{2}\left\Vert x\right\Vert ^2$ and $\left( \frac{1}{2}\left\Vert \cdot \right\Vert ^2\right) ^* = \frac{1}{2}\left\Vert \cdot \right\Vert _*^2$. $\square $

Lemma 9

Let

$$\begin{aligned} \texttt{Grad}(x) = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{y \in {\mathbb {R}}^n}\left\{ \frac{L}{2}\left\Vert y-x\right\Vert ^2 + \langle \nabla f(x), y-x \rangle \right\} . \end{aligned}$$

Then,

$$\begin{aligned} \langle \nabla f(x), \texttt{Grad}(x) - x \rangle + \frac{L}{2}\left\Vert \texttt{Grad}(x) - x \right\Vert ^2 = -\frac{1}{2L}\left\Vert \nabla f(x)\right\Vert _*^2. \end{aligned}$$

Proof

Let $z= L (\texttt{Grad}(x) - x)$. By the definition of $\texttt{Grad}(x)$ and Lemma 8, we have

$$\begin{aligned} \frac{1}{2L}\left\Vert \nabla f(x)\right\Vert _*^2 +&\frac{L}{2} \left\Vert \texttt{Grad}(x) - x\right\Vert ^2 + \langle \nabla f(x) , \texttt{Grad}(x) - x \rangle \\&= \inf _{z \in {\mathbb {R}}^n} \frac{1}{2L}\left\Vert \nabla f(x)\right\Vert _*^2 + \frac{1}{2L} \left\Vert z\right\Vert ^2 + \frac{1}{L} \langle \nabla f(x), z \rangle \\&= 0. \end{aligned}$$

$\square $

Lemma 10

Let $f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be a differentiable convex function such that

$$\begin{aligned} \left\Vert \nabla f(x) - \nabla f(y)\right\Vert _* \le L \left\Vert x-y\right\Vert \end{aligned}$$

for all $x,y\in {\mathbb {R}}^n$. Then

$$\begin{aligned} f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2}\left\Vert y-x\right\Vert ^2. \end{aligned}$$

Proof

Since a differentiable convex function is continuously differentiable [45, Theorem 25.5],

$$\begin{aligned} f(y) - f(x)&= \int _0^1 \langle \nabla f(x + t(y-x)) , y-x \rangle dt\\&= \int _0^1 \langle \nabla f(x + t(y-x)) - \nabla f(x) , y-x \rangle dt + \langle \nabla f(x), y-x \rangle \\&\le \int _0^1 \left\Vert \nabla f(x + t(y-x)) - \nabla f(x)\right\Vert _* \left\Vert y-x\right\Vert dt + \langle \nabla f(x), y-x \rangle \\&\le \int _0^1 t L \left\Vert y-x\right\Vert ^2 dt + \langle \nabla f(x), y-x \rangle = \frac{L}{2} \left\Vert y-x\right\Vert ^2 + \langle \nabla f(x) , y-x \rangle . \end{aligned}$$

$\square $

Lemma 11

(Co-coercivity inequality with general norm) Let $f: {\mathbb {R}}^n \rightarrow {\mathbb {R}}$ be a differentiable convex function such that

$$\begin{aligned} \left\Vert \nabla f(x) - \nabla f(y)\right\Vert _* \le L \left\Vert x-y\right\Vert \end{aligned}$$

for all $x,y\in {\mathbb {R}}^n$. Then

$$\begin{aligned} f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle + \frac{1}{2L}\left\Vert \nabla f(x)- \nabla f(y)\right\Vert _*^2. \end{aligned}$$

Proof

Set $\phi (y) = f(y) - \langle \nabla f(x), y-x\rangle $. Then $x \in {{\,\mathrm{arg\,min}\,}}\phi $. So by Lemma 9,

$$\begin{aligned} \phi (x)&\le \phi (\texttt{Grad}(y))\\&\le \phi (y) + \langle \nabla \phi (y), \texttt{Grad}(y) -y \rangle + \frac{L}{2}\left\Vert \texttt{Grad}(y) - y\right\Vert ^2\\&= \phi (y) - \frac{1}{2L}\left\Vert \nabla \phi (y)\right\Vert _*^2. \end{aligned}$$

Substituting f back in $\phi $ yields the co-coercivity inequality. $\square $

Appendix C: Telescoping Sum Argument

Suppose we established the inequality

$$\begin{aligned} a_i F_i + b_i G_i \le c_i F_{i-1} + d_i G_{i-1} - E_i \end{aligned}$$

for $i=1,2,\dots $, where $E_i, F_i, G_i$ are nonnegative quantities and $a_i$, $b_i$, $c_i$, and $d_i$ are nonnegative scalars. Assume $c_i \le a_{i-1}$ and $d_i \le b_{i-1}$. By summing the inequalities for $i=1, 2, \dots , k$, we obtain

$$\begin{aligned} a_k F_k&\le - b_k G_k - \sum _{i=2}^{k} (a_{i-1} - c_i) F_{i-1} - \sum _{i=2}^k (b_{i-1} - d_i) G_{i-1} - \sum _{i=2}^k E_i + c_1 F_{0} + d_1 G_{0}\\&\le c_1 F_{0} + d_1 G_{0}. \end{aligned}$$

However, note that the

$$\begin{aligned} - b_k G_k - \sum _{i=2}^{k} (a_{i-1} - c_i) F_{i-1} - \sum _{i=2}^k (b_{i-1} - d_i) G_{i-1} - \sum _{i=1}^k E_i \end{aligned}$$

terms are wasted in the analysis. If one has the freedom to do so, it may be good to choose parameters so that

$$\begin{aligned} a_{i-1} = c_i,\, b_{i-1} = d_i \end{aligned}$$

and $E_i = 0$ for $i=1,2,\dots $. Not having wasted terms may be an indication that the analysis is tight.

Appendix D: SC-OGM via Linear Coupling

In this section, we analyze SC-OGM through the linear coupling analysis. We consider the linear coupling form

$$\begin{aligned} y_{k+1}&= x_{k} - \frac{1}{L}Q^{-1}\nabla {f(x_{k})}\\ z_{k+1}&= \frac{1}{1+\gamma }\left( z_k + \gamma x_{k} -\frac{\gamma }{\mu }Q^{-1}\nabla f(x_{k})\right) \\ x_{k+1}&= \tau z_{k+1} + (1-\tau ) y_{k+1}, \end{aligned}$$

where $\tau $ is a coupling coefficient to be determined. As an aside, we can view $z_{k+1}$ as a mirror descent update of the form

$$\begin{aligned} z_{k+1} =\mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{z} \left\{ \frac{1}{2}\left\Vert z-z_k\right\Vert ^2 + \frac{\gamma }{2}\left\Vert z-x_k\right\Vert ^2 +\frac{\gamma }{\mu } \langle \nabla f(x_k), z \rangle \right\} , \end{aligned}$$

which is similar to what was considered in [6].

Lemma 12

Assume (A1), (A2) and (A3). Then,

$$\begin{aligned}&\frac{\gamma }{\mu }\langle \nabla f(x_{k}) , z_{k+1} - x_\star \rangle - \frac{\gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad \le -\frac{\gamma ^2}{2(1+\gamma )\mu ^2}\left\Vert \nabla f(x_k)\right\Vert _*^2+ \frac{1}{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{1+\gamma }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2 \end{aligned}$$

for $k = 0,1, \dots $.

Proof

This proof follows steps similar to that of [6, Lemma 5.4].

From the definition of $z_{k+1}$, we say

$$\begin{aligned} 0=&\left\langle \frac{\partial }{\partial z}\left\{ \frac{1}{2}\left\Vert z-z_k\right\Vert ^2 + \frac{\gamma }{2}\left\Vert z-x_k\right\Vert ^2 +\frac{\gamma }{\mu } \langle \nabla f(x_k), z \rangle \right\} \bigg |_{z_{k+1}}, z_{k+1} - x_\star \right\rangle \\ =&\langle Q(z_{k+1} - z_k), z_{k+1} - x_\star \rangle + \frac{\gamma }{\mu } \langle \nabla f(x_{k}), z_{k+1} - x_\star \rangle + \gamma \langle Q(z_{k+1} - x_k),z_{k+1} - x_\star \rangle \end{aligned}$$

By three point equation,

$$\begin{aligned}&\frac{\gamma }{\mu }\langle \nabla f(x_{k}) ,z_{k+1} - x_\star \rangle + \gamma \left( \frac{1}{2} \left\Vert x_k - z_{k+1}\right\Vert ^2 - \frac{1}{2} \left\Vert x_k - x_\star \right\Vert ^2 \right) \\&\quad = -\frac{1}{2} \left\Vert z_{k}- z_{k+1}\right\Vert ^2 + \frac{1}{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{1+\gamma }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2. \end{aligned}$$

Plugging the definition of $z_{k+1}$,

$$\begin{aligned}&\frac{\gamma }{2} \left\Vert x_k - z_{k+1}\right\Vert ^2 + \frac{1}{2}\left\Vert z_k - z_{k+1}\right\Vert ^2 \\&\quad = \frac{\gamma }{2} \left\Vert \frac{1}{1+\gamma }(x_k - z_k) + \frac{\gamma }{(1+\gamma )\mu } Q^{-1}\nabla f(x_k)\right\Vert ^2 \\&\qquad + \frac{1}{2}\left\Vert -\frac{\gamma }{1+\gamma }(x_k - z_k) + \frac{\gamma }{(1+\gamma )\mu }Q^{-1}\nabla f(x_k)\right\Vert ^2\\&\quad \ge \frac{\gamma ^2}{2(1+\gamma )\mu ^2}\left\Vert \nabla f(x_k)\right\Vert _*^2. \end{aligned}$$

Combining results above, we get

$$\begin{aligned}&\frac{\gamma }{\mu }\langle \nabla f(x_{k}) , z_{k+1} - x_\star \rangle - \frac{\gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad \le -\frac{\gamma ^2}{2(1+\gamma )\mu ^2}\left\Vert \nabla f(x_k)\right\Vert _*^2+ \frac{1}{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{1+\gamma }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2 . \end{aligned}$$

$\square $

Lemma 13

(Coupling lemma in SC-OGM) Assume (A1), (A2) and (A3). Then

$$\begin{aligned}&(1+\gamma )\biggl (f(x_k) - \frac{1}{2L} \left\Vert \nabla f(x_k)\right\Vert _*^2 + \frac{\mu }{2}\left\Vert z_k - x_\star \right\Vert ^2 \biggr ) \\&\quad \le \left( f(x_{k-1}) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 + \frac{\mu }{2}\left\Vert z_{k-1} - x_\star \right\Vert ^2 \right) \end{aligned}$$

holds for $k = 1, 2, \dots $

Proof

We have

$$\begin{aligned}&\gamma \left( f(x_{k}) - f(x_\star ) \right) \\&\quad \le \gamma \langle \nabla f(x_{k}), x_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2 \\&\quad = \gamma \langle \nabla f(x_{k}), x_{k}- z_{k}\rangle + \gamma \langle \nabla f(x_{k}), z_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad = \frac{1-\tau }{\tau }\gamma \langle \nabla f(x_k) ,y_k - x_k \rangle + \gamma \langle \nabla f(x_{k}), z_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad = \frac{1-\tau }{\tau }\gamma \langle \nabla f(x_k) , x_{k-1} - x_{k} - \frac{1}{L}Q^{-1}\nabla f(x_{k-1}) \rangle \\&\qquad + \gamma \langle \nabla f(x_{k}), z_{k}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2 \\&\quad \le \left( \frac{1-\tau }{\tau }\gamma - 1 \right) \langle \nabla f(x_k) , x_{k-1} - x_{k} - \frac{1}{L}Q^{-1}\nabla f(x_{k-1}) \rangle \\&\qquad +\left( f(x_{k-1}) - f(x_k) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 - \frac{1}{2L}\left\Vert \nabla f(x_k)\right\Vert _*^2\right) \\&\qquad +\gamma \langle \nabla f(x_{k}), z_{k}-z_{k+1} \rangle + \gamma \langle \nabla f(x_{k}), z_{k+1}- x_\star \rangle - \frac{\mu \gamma }{2}\left\Vert x_k - x_\star \right\Vert ^2\\&\quad \le \left( \frac{1-\tau }{\tau }\gamma - 1 \right) \langle \nabla f(x_k) , y_{k} - x_{k}\rangle \\&\qquad +\left( f(x_{k-1}) - f(x_k) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 - \frac{1}{2L}\left\Vert \nabla f(x_k)\right\Vert _*^2\right) \\&\qquad +\gamma \langle \nabla f(x_{k}), z_{k}-z_{k+1} \rangle -\frac{\gamma ^2}{2(1+\gamma )\mu }\left\Vert \nabla f(x_k)\right\Vert _*^2\\&\qquad +\frac{\mu }{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{(1+\gamma )\mu }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2, \end{aligned}$$

where the last inequality is an application of Lemma 12. Note that

$$\begin{aligned} z_k - z_{k+1}&= z_k - \frac{1}{1+\gamma }\left( z_k + \gamma x_k -\frac{\gamma }{\mu }Q^{-1}\nabla f(x_k) \right) \\&= \frac{\gamma }{1+\gamma }(z_k-x_k) + \frac{\gamma }{(1+\gamma )\mu } Q^{-1} \nabla f(x_k) \\&= \frac{\gamma }{1+\gamma }\frac{1-\tau }{\tau }(x_k - y_k) + \frac{\gamma }{(1+\gamma )\mu } Q^{-1} \nabla f(x_k). \end{aligned}$$

To eliminate the $\langle \nabla f(x_k), \cdot \rangle $ term, we choose $\tau $ to satisfy

$$\begin{aligned} \frac{1-\tau }{\tau }\gamma - 1 = \frac{\gamma }{1+\gamma }\frac{1-\tau }{\tau }. \end{aligned}$$

(9)

Plugging this in, the inequality above is

$$\begin{aligned}&\gamma \left( f(x_{k}) - f(x_\star ) \right) \\&\quad \le \left( f(x_{k-1}) - f(x_k) - \frac{1}{2L}\left\Vert \nabla f(x_{k-1})\right\Vert _*^2 - \frac{1}{2L}\left\Vert \nabla f(x_k)\right\Vert _*^2\right) \\&\quad +\frac{\gamma ^2}{2(1+\gamma )\mu }\left\Vert \nabla f(x_k)\right\Vert _*^2+ \frac{\mu }{2} \left\Vert z_k - x_\star \right\Vert ^2 - \frac{(1+\gamma )\mu }{2} \left\Vert z_{k+1} - x_\star \right\Vert ^2. \end{aligned}$$

In order to make the telescoping form such as

$$\begin{aligned}&M_{k}\biggl (f(x_{k}) - B_{k}\left\Vert \nabla f(x_{k})\right\Vert _*^2 + C_k \left\Vert z_{k+1} - x_\star \right\Vert ^2 \biggr ) \\&\quad \le N_{k-1}\left( f(x_{k-1}) - B_{k-1} \left\Vert \nabla f(x_{k-1})\right\Vert _*^2+ C_{k-1} \left\Vert z_{k} - x_\star \right\Vert ^2 \right) , \end{aligned}$$

we chose $B_k = \frac{1}{2L}$ and $C_k = \frac{\mu }{2}$, which leads to the choice of $\gamma $ satisfying

$$\begin{aligned} \frac{2+\gamma }{2L} = \frac{\gamma ^2}{2(1+\gamma )\mu }. \end{aligned}$$

(10)

We get the desired result by plugging (9) and (10) in the above inequality. $\square $

Appendix E: Asymptotic Characterization of $\theta _k$

Theorem 7

Let the positive sequence $\{\theta _k\}_{k=0}^\infty $ satisfy $\theta _0 = 1$ and $\theta _{k+1}^2 - \theta _{k+1} - \theta _{k}^2=0$ for $k = 0,1, \dots $. Then,

$$\begin{aligned} \theta _k = \frac{k+\zeta +1}{2} + \frac{\log k}{4} + o(1). \end{aligned}$$

Proof

Let $\theta _k = \frac{k+2}{2} + c_k\log k $. The proof consists of the following 3 steps:

1.
If $c_k < \frac{1}{4}$, then $c_{k+1} < \frac{1}{4}$.
2.
$c_k \rightarrow \frac{1}{4}$ as $k \rightarrow \infty $.
3.
If $\theta _k = \frac{k+2}{2} + \frac{\log k}{4} + e_k$, then $e_k$ is convergent.

First step If $c_k < \frac{1}{4}$, then $c_{k+1}<\frac{1}{4}$.

For our convenience, let $c_0=0$ with $c_0 \log 0 = 0$. Plugging this in $\theta _{k+1}^2 - \theta _{k+1} - \theta _{k}^2=0$, we have

$$\begin{aligned} \left( \frac{k+2}{2} + c_{k+1} \log (k+1) \right) ^2 = \left( \frac{k+2}{2} + c_{k}\log k \right) ^2 + \frac{1}{4}, \end{aligned}$$

so

$$\begin{aligned} \left( c_{k+1} \log (k+1) - c_k \log k \right) \left( k+2+c_{k+1}\log (k+1) + c_k\log k \right) = \frac{1}{4}. \end{aligned}$$

Assume $c_{k+1}\ge 1/4$. Then

$$\begin{aligned} \frac{1}{4}&= \left( c_{k+1}\log (k+1) - c_k \log k \right) \left( k+2+c_{k+1}\log (k+1) + c_k \log k \right) \\&\ge \frac{1}{4}\log \left( 1+\frac{1}{k}\right) (k+2)\\&>\frac{1}{4}, \end{aligned}$$

which proves the first claim.

Second step $c_k \rightarrow \frac{1}{4}$ as $k \rightarrow \infty $.

Put $d_k = \frac{1}{4}-c_k$, then $0 < d_k \le \frac{1}{4}$.

$$\begin{aligned} \frac{1}{4}&=\left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) -d_{k+1}\log (k+1) + d_k \log k \right) \\&\quad \left( k+2+\frac{1}{4}\log k(k+1) -d_{k+1}\log (k+1) - d_k \log k \right) \\&\le \left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) -d_{k+1}\log (k+1) + d_k \log k \right) \left( k+2+\frac{1}{2}\log (k+1) \right) \end{aligned}$$

Therefore

$$\begin{aligned} d_{k+1} \log (k+1) - d_k \log k \le \frac{1}{4} \log \left( 1+\frac{1}{k}\right) - \frac{1}{4}\frac{1}{k+2+\frac{1}{2}\log (k+1)}. \end{aligned}$$

By talyor expansion,

$$\begin{aligned} d_{k+1} \log (k+1) - d_k \log k \le \frac{1}{4} \left( \frac{3+2\log k}{2k^2} + {\mathcal {O}}\left( \frac{1}{k^2}\right) \right) . \end{aligned}$$

So, By summing all the above inequality from 1 to k,

$$\begin{aligned} d_{k+1} \log (k+1) \le C \end{aligned}$$

so $d_{k+1} < \frac{C}{\log (k+1)}$. In conclusion, as $k \rightarrow \infty $, $d_k \rightarrow 0$.

Third step If $\theta _k = \frac{k+2}{2} + \frac{\log k}{4} + e_k$, then, $e_k$ converges.

From the previous claim, we can say that for some sufficiently large k, $|e_k| < \frac{1}{6}\log k$.

$$\begin{aligned} \left( \frac{k+2}{2} + \frac{1}{4}\log (k+1) + e_{k+1} \right) ^2 = \left( \frac{k+2}{2} + \frac{1}{4} \log k + e_{k} \right) ^2 + \frac{1}{4} \end{aligned}$$

Then,

$$\begin{aligned} \frac{1}{4}&=\left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2+\frac{1}{4}\log k(k+1) + e_{k+1} + e_k \right) \\&\le \left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2+\frac{5}{6}\log (k+1) \right) . \end{aligned}$$

So,

$$\begin{aligned} e_{k+1} - e_k \ge \frac{1}{4\left( k+2+ \frac{5}{6}\log (k+1) \right) } - \frac{1}{4}\log \left( 1+ \frac{1}{k} \right) = -\frac{\frac{5}{6}\log k + \frac{3}{2}}{k^2} + {\mathcal {O}}\left( \frac{1}{k^2}\right) . \end{aligned}$$

Summing this for $k=1,\dots ,k$, we get that $e_{k+1} > D$ for some constant D. Moreover,

$$\begin{aligned} \frac{1}{4}&=\left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2+\frac{1}{4}\log k(k+1) + e_{k+1} + e_k \right) \\&\ge \left( \frac{1}{4}\log \left( 1+\frac{1}{k}\right) +e_{k+1} - e_k \right) \left( k+2\right) >\frac{1}{4} + (k+2)(e_{k+1} - e_k), \end{aligned}$$

which indicates that $e_{k+1} < e_k$. Since $\{e_k\}_{k=0}^\infty $ is a decreasing sequence with a lower bound, it converges. $\square $

Proof of equality in Section 2.1 We have

$$\begin{aligned} \frac{L\left\Vert x_0 - x_\star \right\Vert ^2}{2\theta _{k-1}^2}&= \frac{L\left\Vert x_0 - x_\star \right\Vert ^2}{2\left( \frac{k+\zeta }{2} + \frac{\log (k-1)}{4} + o(1)\right) ^2}\\&= \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2}{(k+ \zeta )^2 \left( 1 + \frac{\log (k-1)}{2(k+\zeta )} + o(1/k)\right) ^2}\\&= \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2}{(k+ \zeta )^2} \left( 1 - 2 \frac{\log (k-1)}{2(k+\zeta )} + o(1/k)\right) \\&= \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2}{(k+\zeta )^2} - \frac{2L\left\Vert x_0 - x_\star \right\Vert ^2\log k}{(k+\zeta )^3} + o\left( \frac{1}{k^3}\right) , \end{aligned}$$

which verifies the equality in Sect. 2.1.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Park, C., Park, J. & Ryu, E.K. Factor-$\sqrt{2}$ Acceleration of Accelerated Gradient Methods. Appl Math Optim 88, 77 (2023). https://doi.org/10.1007/s00245-023-10047-9

Download citation

Accepted: 08 March 2023
Published: 23 August 2023
DOI: https://doi.org/10.1007/s00245-023-10047-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Factor-\(\sqrt{2}\) Acceleration of Accelerated Gradient Methods

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Method Reference

Appendix B: Co-coercivity Inequality in General Norm

Lemma 7

Proof

Lemma 8

Proof

Lemma 9

Proof

Lemma 10

Proof

Lemma 11

Proof

Appendix C: Telescoping Sum Argument

Appendix D: SC-OGM via Linear Coupling

Lemma 12

Proof

Lemma 13

Proof

Appendix E: Asymptotic Characterization of \(\theta _k\)

Theorem 7

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation