Abstract
Recently, semidefinite programming performance estimation has been employed as a strong tool for the worst-case performance analysis of first order methods. In this paper, we derive new non-ergodic convergence rates for the alternating direction method of multipliers (ADMM) by using performance estimation. We give some examples which show the exactness of the given bounds. We also study the linear and R-linear convergence of ADMM in terms of dual objective. We establish that ADMM enjoys a global linear convergence rate if and only if the dual objective satisfies the Polyak–Łojasiewicz (PŁ) inequality in the presence of strong convexity. In addition, we give an explicit formula for the linear convergence rate factor. Moreover, we study the R-linear convergence of ADMM under two scenarios.
Similar content being viewed by others
1 Introduction
We consider the optimization problem
where \(f: {\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{\infty \}\) and \(g: {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{\infty \}\) are closed proper convex functions, \(0\ne A\in {\mathbb {R}}^{r\times n}\), \(0\ne B\in {\mathbb {R}}^{r\times m}\) and \(b\in {\mathbb {R}}^{r}\). Moreover, we assume that \((x^\star , z^\star )\) is an optimal solution of problem (1) and \(\lambda ^\star \) is its corresponding Lagrange multipliers. Moreover, we denote the value of f and g at \(x^\star \) and \(z^\star \) with \(f^\star \) and \(g^\star \), respectively.
Problem (1) appears naturally (or after variable splitting) in many applications in statistics, machine learning and image processing to name but a few [9, 23, 29, 42]. The most common method for solving problem (1) is the alternating direction method of multipliers (ADMM). ADMM is a dual based approach that exploits separable structure and it may be described as follows.
ADMM was first proposed in [14, 16] for solving nonlinear variational problems. We refer the interested reader to [17] for a historical review of ADMM. The popularity of ADMM is due to its capability to be implemented parallelly and hence can handle large-scale problems [9, 22, 34, 45]. For example, it is used for solving inverse problems governed by partial differential equation forward models [32], and distributed energy resource coordinations [30], to mention but a few.
The convergence of ADMM has been investigated extensively in the literature and there exist many convergence results. However, different performance measures have been used for the computation of convergence rate; see [13, 18, 19, 24, 28, 29, 35, 44]. In this paper, we consider the dual objective value as a performance measure.
Throughout the paper, we assume that each subproblem in steps 1 and 2 of Algorithm 1 attains its minimum. The Lagrangian function of problem (1) may be written as
and the dual objective of problem (1) is also defined as
We assume throughout the paper that strong duality holds for problem (1), that is
Note that we have strong duality when both functions f and g are real-valued. For extended convex functions, strong duality holds under some mild conditions; see e.g. [4, Chapter 15].
Some common performance measures for the analysis of ADMM are as follows,
-
Objective value: \(\left| f(x^N)+g(z^N)-f^\star -g^\star \right| \);
-
Primal and dual feasibility: \(\left\| Ax^N+Bz^N-b\right\| \) and \(\left\| A^TB(z^N-z^{N-1})\right\| \);
-
Dual objective value: \(D(\lambda ^\star )-D(\lambda ^N)\);
-
Distance between \((x^N, z^N, \lambda ^N)\) and a saddle points of problem (2).
Note that the mathematical expressions are written in a non-ergodic sense for convenience. Each measure is useful in monitoring the progress and convergence of ADMM. The objective value is the most commonly used performance measure for the analysis of algorithms in convex optimization [4, 5, 37]. As mentioned earlier, ADMM is a dual based method and it may be interpreted as a proximal method applied to the dual problem; see [5, 29] for further discussions and insights. Thus, a natural performance measure for ADMM would be dual objective value. In this study, we investigate the convergence rate of ADMM in terms of dual objective value and feasibility. It worth noting that most performance measures may be analyzed through the framework developed in Sect. 2.
Regarding dual objective value, the following convergence rate is known in the literature. This theorem holds for strongly convex functions f and g; recall that f is called strongly convex with modulus \(\mu \ge 0\) if the function \(f-\tfrac{\mu }{2}\Vert \cdot \Vert ^2\) is convex.
Theorem 1
[19, Theorem 1] Let f and g be strongly convex with moduli \(\mu _1>0\) and \(\mu _2>0\), respectively. If \(t\le \root 3 \of {\frac{\mu _1\mu _2^2}{\lambda _{\max } (A^TA)\lambda _{\max }^2 (B^TB)}}\), then
In this study we establish that Algorithm 1 has the convergence rate of \(O(\tfrac{1}{N})\) in terms of dual objective value without assuming the strong convexity of g. Under this setting, we also prove that Algorithm 1 has the convergence rate of \(O(\tfrac{1}{N})\) in terms of primal and dual residuals. Moreover, we show that the given bounds are exact. Furthermore, we study the linear and R-linear convergence.
1.1 Outline of our paper
Our paper is structured as follows. We present the semidefinite programming (SDP) performance estimation method in Sect. 2, and we develop the performance estimation to handle dual based methods including ADMM. In Sect. 3, we derive some new non-asymptotic convergence rates by using performance estimation for ADMM in terms of dual function, primal and dual residuals. Furthermore, we show that the given bounds are tight by providing some examples. In Sect. 4 we proceed with the study of the linear convergence of ADMM. We establish that ADMM enjoys a linear convergence if and only if the dual function satisfies the PŁ inequality when the objective function is strongly convex. Furthermore, we investigate the relation between the PŁ inequality and common conditions used by scholars to prove the linear convergence. Section 5 is devoted to the R-linear convergence. We prove that ADMM is R-linear convergent under two new scenarios which are weaker than the existing ones in the literature.
1.2 Terminology and notation
In this subsection we review some definitions and concepts from convex analysis. The interested reader is referred to the classical text by Rockafellar [41] for more information. The n-dimensional Euclidean space is denoted by \({\mathbb {R}}^n\). We use \(\langle \cdot , \cdot \rangle \) and \(\Vert \cdot \Vert \) to denote the Euclidean inner product and norm, respectively. The column vector \(e_i\) represents the i-th standard unit vector and I stands for the identity matrix. For a matrix A, \(A_{i, j}\) denotes its (i, j)-th entry, and \(A^T\) represents the transpose of A. The notation \(A\succeq 0\) means the matrix A is symmetric positive semidefinite. We use \(\lambda _{\max } (A)\) and \(\lambda _{\min } (A)\) to denote the largest and the smallest eigenvalue of symmetric matrix A, respectively. Moreover, the seminorm \(\Vert \cdot \Vert _A\) is defined as \(\Vert x\Vert _A=\Vert Ax\Vert \) for any \(A\in {\mathbb {R}}^{m\times n}\); see [26, Section 5.2] for more discussion.
Suppose that \(f:{\mathbb {R}}^n\rightarrow (-\infty , \infty ]\) is an extended convex function. The function f is called closed if its epi-graph is closed, that is \(\{(x, r): f(x)\le r\}\) is a closed subset of \({\mathbb {R}}^{n+1}\). The function f is said to be proper if there exists \(x\in {\mathbb {R}}^n\) with \(f(x)<\infty \). We denote the set of proper and closed convex functions on \({\mathbb {R}}^n\) by \({\mathcal {F}}_{0}({\mathbb {R}}^n)\). The subgradients of f at x is denoted and defined as
We call a differentiable function f L-smooth if for any \(x_1, x_2\in {\mathbb {R}}^n\),
Definition 1
Let \(f:{\mathbb {R}}^n\rightarrow (-\infty , \infty ]\) be a closed proper function and let \(A\in {\mathbb {R}}^{m\times n}\). We say f is c-strongly convex relative to \(\Vert .\Vert _A\) if the function \(f-\tfrac{c}{2} \Vert . \Vert _A^2\) is convex.
In the rest of the section, we assume that \(A\in {\mathbb {R}}^{m\times n}\). It is seen that any \(\mu \)-strongly convex function is \(\tfrac{\mu }{\lambda _{\max }(A^TA)}\)-strongly convex relative to \(\Vert .\Vert _A\). However, its converse does not necessarily hold unless A has full column rank. Hence, the assumption of strong convexity relative to \(\Vert .\Vert _A\) for a given matrix A is weaker compared to the assumption of strong convexity. For further details on the strong convexity in relation to a given function, we refer the reader to [3, 33]. We denote the set of c-strongly convex functions relative to \(\Vert .\Vert _A\) on \({\mathbb {R}}^n\) by \({\mathcal {F}}_{c}^A({\mathbb {R}}^n)\). We denote the distance function to the set X by \(d_X(x):=\inf _{y\in X}\Vert y-x\Vert \).
In the following sections we derive some new convergence rates for ADMM by using performance estimation. The main idea of performance estimation is based on interpolablity. Let \({\mathcal {I}}\) be an index set and let \(\{(x^i; g^i; f^i)\}_{i\in {\mathcal {I}}}\subseteq {\mathbb {R}}^n\times {\mathbb {R}}^n\times {\mathbb {R}}\). A set \(\{(x^i; \xi ^i; f^i)\}_{i\in {\mathcal {I}}}\) is called \({\mathcal {F}}^A_{c}\)-interpolable if there exists \(f\in {\mathcal {F}}^A_{c}({\mathbb {R}}^n)\) with
The next theorem gives necessary and sufficient conditions for \({\mathcal {F}}_{c}^A\)-interpolablity.
Theorem 2
Let \(c\in [0, \infty )\) and let \({\mathcal {I}}\) be an index set. The set \(\{(x^i; \xi ^i; f^i)\}_{i\in {\mathcal {I}}}\subseteq {\mathbb {R}}^n\times {\mathbb {R}}^n \times {\mathbb {R}}\) is \({\mathcal {F}}^A_{c}\)-interpolable if and only if for any \(i, j\in {\mathcal {I}}\), we have
Moreover, \(\{(x^i; \xi ^i; f^i)\}_{i\in {\mathcal {I}}}\) is \({\mathcal {F}}_{0}\)-interpolable and L-smooth if and only if for any \(i, j\in {\mathcal {I}}\), we have
Proof
The argument is analogous to that of [46, Theorem 4]. The triple \(\{(x^i; \xi ^i; f^i)\}_{i\in {\mathcal {I}}}\) is \({\mathcal {F}}^A_{c}\)-interpolable if and only if the triple \(\{(x^i; \xi ^i-cA^TAx^i; f^i-\tfrac{c}{2}\Vert x^i\Vert _A^2)\}_{i\in {\mathcal {I}}}\) is \({\mathcal {F}}_{0}\)-interpolable. By [46, Theorem 1], \(\{(x^i; \xi ^i-cA^TAx^i; f^i-\tfrac{c}{2}\Vert x^i\Vert _A^2)\}_{i\in {\mathcal {I}}}\) is \({\mathcal {F}}_{0}\)-interpolable if and only if
which implies inequality (4). The second part follows directly from [46, Theorem 4]. \(\square \)
Note that any convex function is 0-strongly convex relative to A. Let \(f\in {\mathcal {F}}_{0}({\mathbb {R}}^n)\). The conjugate function \(f^*:{\mathbb {R}}^n\rightarrow (-\infty , \infty ]\) is defined as \(f^*(y)=\sup _{x\in {\mathbb {R}}^n} \langle y, x\rangle -f(x)\). We have the following identity
Let \(f\in {\mathcal {F}}_{0}({\mathbb {R}}^n)\) be \(\mu \)-strongly convex. The function f is \(\mu \)-strongly convex if and only if \(f^*\) is \(\tfrac{1}{\mu }\)-smooth. Moreover, \((f^*)^*=f\).
By using conjugate functions, the dual of problem (1) may be written as
By the optimality conditions for the dual problem, we get
for some \( x^\star \in \partial f^*(-A^T\lambda ^\star )\) and \(z^\star \in \partial g^*(-B^T\lambda ^\star )\). Equation (8) with (6) imply that \(( x^\star , z^\star )\) is an optimal solution to problem (1).
The optimality conditions for the subproblems of Algorithm 1 may be written as
As \(\lambda ^k=\lambda ^{k-1}+t(Ax^k+Bz^k-b)\), we get
So, \((x^k, z^k)\) is optimal for dual objective at \(\lambda ^k\) if and only if \(A^TB\left( z^{k-1}-z^k\right) =0\). We call \(A^TB\left( z^{k-1}-z^k\right) \) dual residual.
2 Performance estimation
In this section, we develop the performance estimation for ADMM. The performance estimation method introduced by Drori and Teboulle [12] is an SDP-based method for the analysis of first order methods. Since then, many scholars employed this strong tool to derive the worst case convergence rate of different iterative methods; see [2, 27, 43, 46] and the references therein. Moreover, Gu and Yang [20] employed performance estimation to study the extension of the dual step length for ADMM. Note that while there are some similarities between our work and [20] in using performance estimation, the formulations and results are different.
The worst-case convergence rate of Algorithm 1 with respect to dual objective value may be cast as the following abstract optimization problem,
where \(f, g, A, B, b, z^0, \lambda ^0, x^\star , z^\star , \lambda ^\star \) are decision variables and \(N, t, c_1, c_2, \varDelta \) are the given parameters. Note that problem (11) will be unbounded unless we impose some initial condition. We regard boundedness of \(\Vert \lambda ^0-\lambda ^\star \Vert ^2+t^2\left\| z^0-z^\star \right\| _B^2\) as an initial condition. The boundedness of \(t^{-1}\Vert \lambda ^0-\lambda ^\star \Vert ^2+t\left\| z^0-z^\star \right\| _B^2\) is commonly used for the convergence analysis of ADMM; see e.g. [9, 29]. We opt to utilize the positive multiplication of this criterion for notational convenience as t is a fixed positive constant in Algorithm 1. Moreover, we use this measure to establish R-linear convergence in terms of dual objective; see Sect. 5 for more discussion.
Note that \(D(\lambda ^\star )=f^\star +g^\star \) and \((\tilde{x}, \tilde{z})\in {{\,\textrm{argmin}\,}}f(x)+g(z)+\langle \lambda ^N, Ax+Bz-b\rangle \) if and only if
for some \(\tilde{\xi }\in \partial f(\tilde{x})\) and \(\tilde{\eta }\in \partial g(\tilde{z})\). It is worth noting that a point \(\tilde{x}\) satisfying these conditions exists, as function f is strongly convex relative to A. In addition, one may consider \(\tilde{z}=z^N\) by virtue of (10). For the sake of notational convenience, we introduce \(x^{N+1}=\tilde{x}\) and \(\xi ^{N+1}=\tilde{\xi }\). The reader should bear in mind that \(x^{N+1}\) is not generated by Algorithm 1. Therefore,
for some \(x^{N+1}\) with \(-A^T\lambda ^N\in \partial f(x^{N+1})\).
By using Theorem 2 to replace the conditions \(f\in {\mathcal {F}}^A_{c_1}({\mathbb {R}}^n)\), and \(g\in {\mathcal {F}}^B_{c_2}({\mathbb {R}}^m)\) by finite interpolation conditions, and by using the optimality conditions (9), problem (11) may be reformulated as a finite dimensional optimization problem, through the performance estimation technique:
In problem (13), \(A, B, \{x^k; \xi ^k; f^k\}_1^{N+1}, \{(x^\star ; \xi ^\star ; f^\star )\}, \{\lambda ^k\}_0^{N}, \{z^k; \eta ^k; g^k\}_0^{N},\) \(\{(z^\star ; \eta ^\star ; g^\star )\}, \lambda ^\star , b\) are decision variables. To handle problem (13), without loss of generality, we assume that the matrix \(\begin{pmatrix} A&B \end{pmatrix}\) has full row rank. Note this assumption does not appear in our arguments in the following sections. In addition, we introduce some new variables. As problem (1) is invariant under translation of (x, z), we may assume without loss of generality that \(b=0\) and \((x^\star , z^\star )=(0, 0)\). In addition, due to the full row rank of the matrix \(\begin{pmatrix} A&B \end{pmatrix}\), we may assume that \(\lambda ^0=\begin{pmatrix} A&B \end{pmatrix} \begin{pmatrix} x^{\dagger } \\ z^{\dagger } \end{pmatrix}\) and \(\lambda ^\star =\begin{pmatrix} A&B \end{pmatrix} \begin{pmatrix} \bar{x} \\ \bar{z} \end{pmatrix}\) for some \(\bar{x}, x^{\dagger }, \bar{z}, z^{\dagger }\). So,
and \(D(\lambda ^\star )=f^\star +g^\star \).
By using equality constraints of problem (13) and the newly introduced variables, we have for \(k\in \{1,\ldots , N\}\)
Hence, problem (13) may be written as
In problem (15), \(A, B, \{x^k; f^k\}_1^{N+1}, \{z^k; g^k\}_1^{N}, x^{\dagger }, z^{\dagger }, \bar{x}, f^\star , \bar{z}, g^\star , z^0\) are decision variables. By using the Gram matrix method, problem (15) may be relaxed as a semidefinite program as follows. Let
By introducing matrix variable
problem (15) may be relaxed as the following SDP,
where the constant matrices \(L_{i, j}^f, L_{i, j}^g, L_o, L_0\) are determined according to the constraints of problem (15). In the following sections, we present some new convergence results that are derived by solving this kind of formulation.
3 Worst-case convergence rate
In this section, we provide new convergence rates for ADMM with respect to some performance measures. Before we get to the theorems we need to present some lemmas.
Lemma 1
Let \(N\ge 4\) and \(t, c \in {\mathbb {R}}\). Let E(t, c) be \((N+1)\times (N+1)\) symmetric matrix given by
where
and k denotes row number. If \(c >0\) is given, then
Proof
As \(\{t: E(t, c )\succeq 0\}\) is a convex set, it suffices to prove the positive semidefiniteness of E(0, c) and E(c, c). Since E(0, c) is diagonally dominant, it is positive semidefinite. Now, we establish that the matrix \(K=E(1,1)\) is positive definite. To this end, we show that all leading principal minors of K are positive. To compute the leading principal minors, we perform the following elementary row operations on K:
-
(i)
Add the second row to the third row;
-
(ii)
Add the second row to the last row;
-
(iii)
Add the third row to the forth row;
-
(iv)
For \(i=4:N-1\)
-
Add \(i-th\) row to \((i+1)-th\) row;
-
Add \(\tfrac{3-i}{2i^2-3i-1}\) times of \(i-th\) row to the last row;
-
-
(v)
Add \(\frac{N-1}{3N-5}\) times of \(N-th\) row to \((N+1)-th\) row.
It is seen that \(K_{k-1, k}+K_{k, k}=-K_{k+1, k}\) for \(2\le k\le N-1\). Hence, by performing these operations, we get an upper triangular matrix J with diagonal
It is seen all first N diagonal elements of J are positive. We show that \(J_{N+1, N+1}\) is also positive. For \(i\ge 4\) we have
So,
which implies \(J_{N+1, N+1}>0\). Since we add a factor of \(i-th\) row to \(j-th\) row with \(i<j\), all leading principal minors of matrices K and J are the same. Hence K is positive definite. As \( E(c, c )=c K\), one can infer the positive definiteness of E(c, c) and the proof is complete. \(\square \)
In the upcoming lemma, we establish a valid inequality for ADMM that will be utilized in all the subsequent results presented in this section.
Lemma 2
Let \(f\in {\mathcal {F}}^A_{c_1}({\mathbb {R}}^n)\), \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) and \(x^\star =0\), \(z^\star =0\). Suppose that ADMM with the starting points \(\lambda ^0\) and \(z^0\) generates \(\{(x^k; z^k; \lambda ^k)\}\). If \(N\ge 4\) and \(v\in {\mathbb {R}}^r\), then
where
Proof
To establish the desired inequality, we demonstrate its validity by summing a series of valid inequalities. To simplify the notation, let \(f^k=f(x^k)\) and \(g^k=g(z^k)\) for \(k\in \{1, \dots , N\}\). Note that \(b=0\) because \(x^\star =0, z^\star =0\). By (4) and (9), we get the following inequality
As \(\lambda ^k = \lambda ^{k-1} + tAx^k + tBz^k\), the inequality can be expressed as
After performing some algebraic manipulations, we obtain
By using \(\lambda ^{N-1}=\lambda ^N-tAx^N-tBz^N\) and
we get
which implies the desired inequality. \(\square \)
We may now prove the main result of this section.
Theorem 3
Let \(f\in {\mathcal {F}}^A_{c_1}({\mathbb {R}}^n)\) and \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) with \(c_1>0\). If \(t\le c_1\) and \(N\ge 4\), then
Proof
As discussed in Sect. 2, we may assume that \(x^\star =0\) and \(z^\star =0\). By (12), we have \(D(\lambda ^N)=f(\hat{x}^{N})+g(z^N)+\left\langle \lambda ^N, A\hat{x}^{N}+Bz^N\right\rangle \) for some \(\hat{x}^{N}\) with \(-A^T\lambda ^N\in \partial f(\hat{x}^{N})\). By employing (4) and (9), we obtain
By substituting v with \(A\hat{x}^{N}\) in inequality (18) and summing it with (20), we get the following inequality after performing some algebraic manipulations
where the positive semidefinite matrix \(E(t, c_1)\) is given in Lemma 1. As the inner product of positive semidefinite matrices is non-negative, inequality (21) implies that
and the proof is complete. \(\square \)
In comparison with Theorem 1, we could get a new convergence rate when only f is strongly convex, i.e. g does not need to be strongly convex. Also, the constant does not depend on \(\lambda ^1\). One important question concerning bound (19) is its tightness, that is, if there is an optimization problem which attains the given convergence rate. It turns out that the bound (19) is exact. The following example demonstrates this point.
Example 1
Suppose that \(c_1>0\), \(N\ge 4\) and \(t\in (0, c_1]\). Let \(f, g: {\mathbb {R}}\rightarrow {\mathbb {R}}\) be given as follows,
Consider the optimization problem
It is seen that \(A=B=I\) in this problem. Note that \((x^\star , z^\star )=(0, 0)\) with Lagrangian multiplier \(\lambda ^\star =\tfrac{1}{2}\) is an optimal solution and the optimal value is zero. One can check that Algorithm 1 with initial point \(\lambda ^0=\tfrac{-1}{2}\) and \( z^0=0\) generates the following points,
At \(\lambda ^{N}\), we have \(D(\lambda ^N)=\tfrac{-1}{4Nt} =-\tfrac{\Vert \lambda ^0-\lambda ^\star \Vert ^2+t^2 \left\| z^0-z^\star \right\| _B^2}{4Nt}\), which shows the tightness of bound (19).
One important factor concerning dual-based methods that determines the efficiency of an algorithm is primal and dual feasibility (residual) convergence rates. In what follows, we study this subject under the setting of Theorem 3. The next theorem gives a convergence rate in terms of primal residual under the setting of Theorem 3.
Theorem 4
Let \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) and \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) with \(c_1>0\). If \(t\le c_1\) and \(N\ge 4\), then
Proof
The argument is similar to that used in the proof of Theorem 3. By setting \(v=Ax^{N}\) in (18), one can infer the following inequality
By employing (4) and (9), we have
By summing (23) and (24), we obtain
where the matrix \(D(t, c_1)\) is as follows,
and
As the matrix \(D(t, c_1)\) is positive semidefinite, see Appendix A, inequality (25) implies that
and the proof is complete. \(\square \)
The following example shows the exactness of bound (22).
Example 2
Let \(c_1>0\), \(N\ge 4\) and \(t\in (0, c_1]\). Consider functions \(f, g: {\mathbb {R}}\rightarrow {\mathbb {R}}\) given by the formulae follows,
We formulate the following optimization problem,
where \(A=B=I\). One can verify that \((x^\star , z^\star )=(0, 0)\) with Lagrangian multiplier \(\lambda ^\star =\tfrac{1}{2}\) is an optimal solution. Algorithm 1 with initial point \(\lambda ^0=\tfrac{-1}{2}\) and \( z^0=0\) generates the following points,
At iteration N, we have \(\Vert Ax^N+Bz^N\Vert =\tfrac{1}{tN} =\tfrac{\sqrt{\Vert \lambda ^0-\lambda ^\star \Vert ^2+t^2\left\| z^0-z^\star \right\| _B^2}}{tN}\), which shows the tightness of bound (22).
In what follows, we study the convergence rate of ADMM in terms of residual dual. To this end, we investigate the convergence rate of \(\{B\left( z^{k-1}-z^k\right) \}\) as \(\left\| A^TB\left( z^{k-1}-z^k\right) \right\| \le \Vert A\Vert \left\| z^{k-1}-z^k\right\| _B\). The next theorem provides a convergence rate for the aforementioned sequence.
Theorem 5
Let \(f\in {\mathcal {F}}^A_{c_1}({\mathbb {R}}^n)\) and \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) with \(c_1>0\). If \(t\le c_1\) and \(N\ge 4\), then
Proof
Similar to the proof of Theorem 3, by setting \(v=Ax^{N}\) in (18) for \(N-1\) iterations, one can infer the following inequality
By summing (27) and (28), we obtain
where the matrix \(F(t, c_1)\) is as follows,
and
The rest of the proof proceeds analogously to the proof of Theorem 4. \(\square \)
The following example shows the tightness of this bound.
Example 3
Assume that \(c_1>0\), \(N\ge 4\) and \(t\in (0, c_1]\) are given, and \(f, g: {\mathbb {R}}\rightarrow {\mathbb {R}}\) are defined by,
Consider the optimization problem
where \(A=B=I\). The point \((x^\star , z^\star )=(0, 0)\) with Lagrangian multiplier \(\lambda ^\star =\tfrac{1}{2}\) is an optimal solution. After performing N iterations of Algorithm 1 with setting \(\lambda ^0=\tfrac{-1}{2}\) and \( z^0=0\), we have
It can be seen that \(\left\| A^TB\left( z^N-z^{N-1}\right) \right\| =\tfrac{1}{(N-1)t}=\tfrac{\sqrt{\Vert \lambda ^0-\lambda ^\star \Vert ^2+t^2 \left\| z^0-z^\star \right\| _B^2}}{(N-1)t}\), which shows that the bound is tight.
Theorems 3 and 4 address the case that f is strongly convex relative to \(\Vert .\Vert _A\) and g is convex. Based on numerical results by solving performance estimation problems including (15) we conjecture, under the assumptions of Theorem 3, if g is \(c_2\)-strongly convex relative to \(\Vert .\Vert _B\), Algorithm 1 enjoys the following convergence rates
We have verified these conjectures numerically for many specific values of the parameters. Nevertheless, we could not manage to guess a closed-form formula for the residual dual in this case.
4 Linear convergence of ADMM
In this section we study the linear convergence of ADMM. The linear convergence of ADMM has been addressed by some authors and some conditions for linear convergence have been proposed, see [11, 21, 22, 25, 31, 38, 47]. Two common types of assumptions employed for proving the linear convergence of ADMM are error bound property and L-smoothness. To the best knowledge of authors, most scholars investigated the linear convergence of the sequence \(\{(x^k, z^k, \lambda ^k)\}\) to a saddle point and there is no result in terms of dual objective value for ADMM. In line with the previous section, we study the linear convergence in terms of dual objective value and we derive some formulas for linear convergence rate by using performance estimation. It is noteworthy to mention that the term "Q-linear convergence" is also employed to describe the linear convergence in the literature.
As mentioned earlier, error bound property is used by scholars for establishing the linear convergence; see e.g. [21, 25, 31, 40, 47]. Let
stands for augmented dual objective for the given \(a>0\) and \(\varLambda ^\star \) denotes the optimal solution set of the dual problem. Note that the function \(D^a\) is an \(\tfrac{1}{a}\)-smooth function on its domain without assuming strong convexity; see [25, Lemma 2.2].
Definition 2
The function \(D^a\) is said to satisfy the error bound property if we have
for some \(\tau >0\).
Hong et al. [25] established the linear convergence by employing the error bound property (30).
Recently, some scholars established the linear convergence of gradient methods for L-smooth convex functions by replacing strong convexity with some mild conditions, see [1, 7, 36] and references therein. Inspired by these results, we prove the linear convergence of ADMM by using the so-called PŁ inequality. It is worth noting that we employ the nonsmooth version of the PŁ inequality introduced in [6]. Concerning differentiability of dual objective, by (7), we have
Note that inclusion (31) holds as an equality under some mild conditions, see e.g. [4, Chapter 3].
Definition 3
The function D is said to satisfy the PŁ inequality if there exists an \(L_p>0\) such that for any \(\lambda \in {\mathbb {R}}^r\) we have
Note that if f and g are strongly convex, then \(-D\) is an L-smooth convex function with \(L\le \tfrac{\lambda _{\max } (A^TA)}{\mu _1}+\tfrac{\lambda _{\max }(B^TB)}{\mu _2}\). In this setting, we have \(L_p\le \tfrac{\lambda _{\max }(A^TA)}{\mu _1} +\tfrac{\lambda _{\max }(B^TB)}{\mu _2}\). This follows from the duality between smoothness and strong convexity and
In the next proposition, we show that definitions (30) and (32) are equivalent.
Proposition 1
Let \(L_a=\tfrac{1}{a}\) denote the Lipschitz constant of \(\nabla D^a\), where \(D^a\) is given in (29). Suppose that (31) holds as equality.
-
(i)
If \(D^a\) satisfies the error bound (30), then D satisfies the PŁ inequality (32) with \(L_p=\tfrac{1}{L_a\tau ^2}\).
-
(ii)
If D satisfies the PŁ inequality (32), then \(D^a\) satisfies the error bound (30) with \(\tau =\tfrac{L_p}{1+aL_p}\).
Proof
First we prove i). Suppose \(\lambda \in {\mathbb {R}}^r\) and \(\xi \in b-A\partial f^*(-A^T\lambda )-B\partial g^*(-B^T\lambda )\). By identity (6), we have \(\xi =b-A\bar{x}-B\bar{z}\) for some \((\bar{x}, \bar{z})\in {{\,\textrm{argmin}\,}}f(x)+g(z)+\langle \lambda , Ax+Bz-b\rangle \). Due to the smoothness of \(D^a\) and (30), we get
where \(\lambda ^\star \in \varLambda ^\star \) with \(d_{\varLambda ^\star }=\Vert \nu -\lambda ^\star \Vert \). Suppose that \(\bar{\nu }=\lambda -a(A\bar{x}+B\bar{z}-b)\). As we assume strong duality, we have \(D^a(\lambda ^\star )=D(\lambda ^\star )\). By the definitions of \(\bar{x}, \bar{y}\), we get
By [25, Lemma 2.1], we have \(\nabla D^a(\bar{\nu })=A\bar{x}+B\bar{z}-b\). This equality with (33) imply
and the proof of i) is complete.
Now we establish ii). Let \(\lambda \) be in the domain of \(\nabla D^a\). By [25, Lemma 2.1], we have \(\nabla D^a(\lambda )=A\bar{x}+B\bar{z}-b\) for some \((\bar{x}, \bar{z})\in {{\,\textrm{argmin}\,}}f(x)+g(z)+\langle \lambda , Ax+Bz-b\rangle +\tfrac{a}{2}\Vert Ax+Bz-b\Vert ^2\), which implies that
Supposing \(\nu =\lambda +a(A\bar{x}+B\bar{z}-b)\). By (34), one can infer that \(D(\nu )=f(\bar{x})+g(\bar{z})+\langle \nu , A\bar{x}+B\bar{z}-b\rangle \). In addition, (6) implies that \(b-A\bar{x}-B\bar{z}\in b-A\partial f^*(-A^T\nu )-B\partial g^*(-B^T\nu )\). By the PŁ inequality, we have
where the equality follows from \(D(\nu )=D^a(\lambda )+\tfrac{a}{2}\left\| A\bar{x}+B\bar{z}-b\right\| ^2\) and \(D^a(\lambda ^\star )=D(\lambda ^\star )\). Hence,
This inequality says that \(D^a\) satisfies the PŁ inequality. On the other hand, the PŁ inequality implies the error bound with the same constant, see [7], and the proof is complete. \(\square \)
In what follows, we employ performance estimation to derive a linear convergence rate for ADMM in terms of dual objective when the PŁ inequality holds. To this end, we compare the value of dual problem in two consecutive iterations, that is, \(\tfrac{D(\lambda ^\star ) -D(\lambda ^2)}{D(\lambda ^\star )-D(\lambda ^1)}\). The following optimization problem gives the worst-case convergence rate,
Analogous to our discussion in Sect. 2, we may assume without loss of generality \(b=0\), \(\lambda ^1=\begin{pmatrix} A&B \end{pmatrix} \begin{pmatrix} x^{\dagger } \\ z^{\dagger } \end{pmatrix}\) and \(\lambda ^\star =\begin{pmatrix} A&B \end{pmatrix} \begin{pmatrix} \bar{x} \\ \bar{z} \end{pmatrix}\) for some \(\bar{x}, x^{\dagger }, \bar{z}, z^{\dagger }\). In addition, we assume that \(\hat{x}^1\in {{\,\textrm{argmin}\,}}f(x)+\langle \lambda ^1, Ax\rangle \) and \(\hat{x}^2\in {{\,\textrm{argmin}\,}}f(x)+\langle \lambda ^2, Ax\rangle \). Hence,
and
Moreover, by (36) and (31), we get
On the other hand, \(\lambda ^2=\lambda ^1+tAx^2+tBz^2\). Therefore, by using Theorem 2, problem (35) may be relaxed as follows,
By deriving an upper bound for the optimal value of problem (37) in the next theorem, we establish the linear convergence of ADMM in the presence of the PŁ inequality.
Theorem 6
Let \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) and \(g\in {\mathcal {F}}^B_{c_2}({\mathbb {R}}^m)\) with \(c_1, c_2>0\), and let D satisfies the PŁ inequality with \(L_p\). Suppose that \(t\le \sqrt{c_1c_2}\).
-
(i)
If \(c_1\ge c_2\), then
$$\begin{aligned} \frac{D(\lambda ^\star )-D(\lambda ^2)}{D(\lambda ^\star ) -D(\lambda ^1)}\le \frac{2c_1c_2-t^2}{2c_1c_2-t^2+L_pt \left( 4c_1c_2-c_2t-2t^2\right) }, \end{aligned}$$(38)in particular, if \(t=\sqrt{c_1c_2}\),
$$\begin{aligned} \frac{D(\lambda ^\star )-D(\lambda ^2)}{D(\lambda ^\star ) -D(\lambda ^1)}\le \frac{1}{1+L_p\left( 2\sqrt{c_1c_2}-c_2\right) }. \end{aligned}$$ -
(ii)
If \(c_1< c_2\), then
$$\begin{aligned}&\frac{D(\lambda ^\star )-D(\lambda ^2)}{D(\lambda ^\star ) -D(\lambda ^1)}\nonumber \\&\quad \le \frac{4 c_2^2-2c_2 \sqrt{c_1c_2}-t^2}{4 c_2^2-2c_2 \sqrt{c_1c_2}-t^2+L_pt\left( 8c_2^2+5c_2t-2\sqrt{c_1c_2} \left( 1+\tfrac{t}{c_1}\right) \left( 2c_2+t\right) \right) }. \end{aligned}$$(39)
Proof
The argument is based on weak duality. Indeed, by introducing suitable Lagrangian multipliers, we establish that the given convergence rates are upper bounds for problem (37). First, we prove (i). Assume that \(\alpha \) denotes the right hand side of inequality (38). As \(2c_1c_2-t^2>0\) and \(4c_1c_2-c_2t-2t^2>0\), we have \(0<\alpha <1\). With some algebra, one can show that
Hence, we get
for any feasible point of problem (35) and the proof of the first part is complete. For (ii), we proceed analogously to the proof of (i), but with different Lagrange multipliers. Let \(\beta \) denote the right hand side of inequality (39), i.e.
It is seen that \(0<\beta <1\). By doing some calculus, we have
The rest of the proof is similar to that of the former case. \(\square \)
We computed the bounds in Theorem 6 by selecting suitable Lagrangian multipliers and solving the semidefinite formulation of problem (37) by hand. The semidefinite formulation is formed analogous to problem (16). Note that the optimal value of problem (37) may be smaller than the bounds introduced in Theorem 6. Indeed, our aim was to provide a concrete mathematical proof for the linear convergence rate. However, the linear convergence rate factor is not necessarily tight. Needless to say that the optimal value of problem (37) also does not necessarily give the tight convergence factor as it is just a relaxation of problem (35).
Recently the authors showed that the PŁ inequality is necessary and sufficient conditions for the linear convergence of the gradient method with constant step lengths for L-smooth function; see[1, Theorem 5]. In what follows, we establish that the PŁ inequality is a necessary condition for the linear convergence of ADMM. Firstly, we present a lemma that is very useful for our proof.
Lemma 3
Let \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) and \(g\in {\mathcal {F}}^B_{c_2}({\mathbb {R}}^m)\). Consider Algorithm 1. If \((\hat{x}^1, z^1)\in {{\,\textrm{argmin}\,}}f(x)+g(z)+\langle \lambda ^1, Ax+Bz-b\rangle \), then
Proof
Without loss of generality we assume that \(c_1=c_2=0\). By optimality conditions, we have
By using these inequities, we get
Hence, we have
which completes the proof. \(\square \)
The next theorem establishes that the PŁ inequality is a necessary condition for the linear convergence of ADMM.
Theorem 7
Let \(f\in {\mathcal {F}}^A_{c_1}({\mathbb {R}}^n)\), \(g\in {\mathcal {F}}^B_{c_2}({\mathbb {R}}^m)\) and let (31) hold as equality. If Algorithm 1 is linearly convergent with respect to the dual objective value, then D satisfies the PŁ inequality.
Proof
Consider \(\lambda ^1\in {\mathbb {R}}^r\) and \(\xi \in b-A\partial f^*(-A^T\lambda ^1)-B\partial g^*(-B^T\lambda ^1)\). Hence, \(\xi =b-A\hat{x}^1-Bz^1\) for some \((\hat{x}^1, z^1)\in {{\,\textrm{argmin}\,}}f(x)+g(z)+\langle \lambda , Ax+Bz-b\rangle \). If one sets \(z^0=z^1\) and \(\lambda ^0=\lambda ^1-t(A\hat{x}^1+Bz^1-b)\) in Algorithm 1, the algorithm may generate \(\lambda ^1\). As Algorithm 1 is linearly convergent, there exist \(\gamma \in [0, 1)\) with
So, we have
where the last inequality follows from the concavity of the function D. Since \(\lambda ^2-\lambda ^1=t(Ax^2+Bz^2-b)\), Lemma 3 implies that
so D satisfies the PŁ inequality. \(\square \)
Another assumption used in the literature for establishing linear convergence is L-smoothness; see for example [10, 11, 15, 38]. Deng et al. [11] show that the sequence \(\{(x^k, z^k, \lambda ^k)\}\) is convergent linearly to a saddle point under Scenario 1 and 2 given in Table 1.
It is worth mentioning that Scenario 1 or Scenario 2 implies strong convexity of the dual objective function and therefore the PŁ inequality is implied, see [1]. Hence, Theorem 6 implies the linear convergence in terms of dual value under Scenario 1 or Scenario 2. Deng et al. [11] studied the linear convergence under Scenario 3, but they just proved the linear convergence of the sequence \(\{(x^k, Bz^k, \lambda ^k)\}\). In the next section, we investigate the R-linear convergence without assuming L-smoothness of f. Indeed, we establish the R-linear convergence when f is strongly convex, g is L-smooth and B has full row rank.
Note that the PŁ inequality does not imply necessarily Scenario 1 or Scenario 2. Indeed, consider the following optimization problem,
where \(f(x)=\tfrac{1}{2}\Vert x\Vert ^2+\Vert x\Vert _1\) and \(g(z)=\tfrac{1}{2}\Vert z\Vert ^2+\Vert z\Vert _1\). With some algebra, one may show that \(D(\lambda )=\sum _{i=1}^{n} h(\lambda _i)\) with
Hence, the PŁ inequality holds for \(L_p=\tfrac{1}{2}\) while neither f nor g is L-smooth.
As mentioned earlier the performance estimation problem including the PŁ inequality at finite set of points is a relaxation for computing the worst-case convergence rate. Contrary to Theorem 6, we could not manage to prove the linear convergence of primal and dual residuals under the assumptions of Theorem 6 by employing performance estimation.
5 R-linear convergence of ADMM
This section focuses on examining the linear convergence rate for ADMM from a weaker convergence rate perspective than Q-linear which is already studied in Sect. 4. This concept is known as R-linear convergence where R stands for root [39]. Recall that ADMM enjoys R-linear convergence in terms of dual objective value if there exists sequence \(\{s_k\}\subseteq {\mathbb {R}}_+\) such that
and \(s_k\) tends Q-linearly to zero. It is easily seen that the linear convergence implies R-linear convergence. For an extensive discussion of convergence rates see [39, Section A.2] or [8, Section 1.5].
We investigate the R-linear convergence under the following scenarios:
-
(S1): \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) is L-smooth with \(c_1>0\) and A has full row rank;
-
(S2): \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) with \(c_1>0\), g is L-smooth and B has full row rank.
Under these scenarios, we could not manage to find a value of q within the range [0, 1) that satisfies the inequality:
As a result, we turn our attention towards studying the R-linear convergence.
Our technique for proving the R-linear convergence is based on establishing the linear convergence of the sequence \(\{V^k\}\) given by
Note that \(V^k\) is called Lyapunov function for ADMM and it decreases in each iteration; see [9]. It is worth noting Q-linear and R-linear convergence of ADMM have been studied under similar scenarios for some performance measures, see e.g. [10, 15, 38]. However, to the best of knowledge, no existing results in the literature address the dual objective and \(V^k\) under Scenario (S1) and (S2).
First we consider the case that f is L-smooth and \(c_1\)-strongly convex relative to A. The following proposition establishes the linear convergence of \(\{V^k\}\).
Proposition 2
Let \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) be L-smooth with \(c_1>0\), \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) and let A has full row rank. If \(t< \sqrt{\tfrac{c_1 L}{\lambda _{\min }(AA^T)}}\), then
where \(d=\tfrac{L}{\lambda _{\min }(AA^T)}\).
Proof
We may assume without loss of generality that \(x^\star , z^\star \) and b are zero; see our discussion in Sect. 2. By optimality conditions, we have
for some \(\eta ^k\in \partial g(z^{k+1})\) and \(\eta ^\star \in \partial g(z^\star )\). Let \(\alpha =\tfrac{2t}{c_1^2d^2+2c_1dt^2-4c_1^2t^2 +t^4}\). By Theorem 2, we get
As \(\Vert A^T\lambda \Vert ^2\ge \tfrac{L}{d}\Vert \lambda \Vert ^2\) and \(\lambda ^{k+1}=\lambda ^k+tAx^{k+1}+tBz^{k+1}\), we obtain the following inequality after performing some algebraic manipulations
The above inequality implies that
and the proof is complete. \(\square \)
Note that one can improve bound (42) under the assumptions of Proposition 2 and the \(\mu \)-strong convexity of f by employing the following known inequality
Indeed, we employed the given inequality but we could not manage to obtain a closed form formula for the convergence rate. The next theorem establishes the R-linear convergence of ADMM in terms of dual objective value under the assumptions of Proposition 2.
Theorem 8
Let \(N\ge 4\) and let A has full row rank. Suppose that \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) is L-smooth with \(c_1>0\) and \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\). If \(t<\min \{c_1, \sqrt{\tfrac{c_1 L}{\lambda _{\min }(AA^T)}}\}\), then
where \(d=\tfrac{L}{\lambda _{\min }(AA^T)}\) and \(\rho =\tfrac{V^0}{16t}\left( 1-\tfrac{2c_1 t}{c_1d+2c_1 t+ t^2} \right) ^{-4}\).
Proof
By Theorem 3 and Proposition 2, one can infer the following inequalities,
which shows the desired inequality. \(\square \)
In the sequel, we investigate the R-linear convergence under the hypotheses of scenario (S2). The next proposition shows the linear convergence of \(\{V^k\}\).
Proposition 3
Let \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) with \(c_1>0\) and let \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) be L-smooth. Suppose that B has full row rank and \(k\ge 1\). If \(t\le \min \{\tfrac{c_1}{2}, \tfrac{L}{2\lambda _{\min }(BB^T)}\}\), then
Proof
Analogous to the proof of Proposition 2, we assume that \(x^\star =0\), \(z^\star =0\) and \(b=0\). Due to the optimality conditions, we have
for some \(\xi ^{k+1}\in \partial f(x^{k+1})\) and \(\xi ^\star \in \partial f(x^\star )\). Suppose that \(d=\tfrac{L}{\lambda _{\min }(BB^T)}\) and \(\alpha =\tfrac{2dt}{d+t}\). By Theorem 2, we obtain
By employing \(\Vert B^T\lambda \Vert ^2\ge \tfrac{L}{d}\Vert \lambda \Vert ^2\) and \(\lambda ^{k+1}=\lambda ^k+tAx^{k+1}+tBz^{k+1}\), the aforementioned inequality can be expressed as follows after some algebraic manipulation,
Hence, we have
and the proof is complete. \(\square \)
As the sequence \(\{V^k\}\) is not increasing [9, Convergence Proof], we have \(V^1\le V^0\). Thus, by using Theorem 3 and Proposition 3, one can infer the following theorem.
Theorem 9
Let \(f\in {\mathcal {F}}_{c_1}^A({\mathbb {R}}^n)\) with \(c_1>0\) and let \(g\in {\mathcal {F}}_{0}({\mathbb {R}}^m)\) be L-smooth. Assume that \(N\ge 5\) and B has full row rank. If \(t<\min \{\tfrac{c_1}{2}, \tfrac{L}{2\lambda _{\min }(BB^T)} \}\), then
where \(\rho =\tfrac{V^0}{16t}\left( \tfrac{L}{L+t\lambda _{\min }(BB^T)} \right) ^{-10}\).
In the same line, one can infer the R-linear convergence in terms of primal and dual residuals under the assumptions of Theorems 8 and 9. In this section, we proved the linear convergence of \(\{V^k\}\) under two scenarios (S1) and (S2). By (7), it is readily seen that function \(-D\) is strongly convex under the hypotheses of both scenarios (S1) and (S2). Therefore, both scenarios imply the PŁ inequality. One may wonder that if the PŁ inequality and the strong convexity of f imply the linear of \(\{V^k\}\). By using performance estimation, we could not establish such an implication.
As mentioned above, function \(-D\) under both scenarios are \(\mu \)-strongly convex. Hence, the optimal solution set of the dual problem is unique and one can infer the R-linear convergence of \(\lambda ^N\) by using Theorem 8 (Theorem 9) and the known inequality,
6 Concluding remarks
In this paper we developed performance estimation framework to handle dual-based methods. Thanks to this framework, we could obtain some tight convergence rates for ADMM. This framework may be exploited for the analysis of other variants of ADMM in the ergodic and non-ergodic sense. Moreover, similarly to [27], one can apply this framework for introducing and analyzing new accelerated ADMM variants. Moreover, most results hold for any arbitrary positive step length, t, but we managed to get closed form formulas for some interval of positive numbers.
References
Abbaszadehpeivasti, H., de Klerk, E., Zamani, M.: Conditions for linear convergence of the gradient method for non-convex optimization. Optim. Lett. 17(5), 1105–1125 (2023)
Abbaszadehpeivasti, H., de Klerk, E., Zamani, M.: On the rate of convergence of the difference-of-convex algorithm (DCA). J. Optim. Theory Appl. 1–22 (2023)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Beck, A.: First-Order Methods in Optimization. SIAM (2017)
Bertsekas, D.: Convex Optimization Algorithms. Athena Scientific (2015)
Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017)
Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.A.: Numerical Optimization: Theoretical and Practical Aspects. Springer (2006)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)
Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42(3), 783–805 (2017)
Deng, W., Yin, W.: On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2016)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1), 451–482 (2014)
Franca, G., Robinson, D., Vidal, R.: ADMM and accelerated ADMM as continuous dynamical systems. In: International Conference on Machine Learning, pp. 1559–1567. PMLR (2018)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)
Giselsson, P., Boyd, S.: Linear convergence and metric selection for Douglas-Rachford splitting and ADMM. IEEE Trans. Autom. Control 62(2), 532–544 (2016)
Glowinski, R., Marroco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. ESAIM Math. Model. Numer. Anal. 9(R2), 41–76 (1975)
Glowinski, R., Osher, S.J., Yin, W.: Splitting Methods in Communication. Science, and Engineering. Springer, Imaging (2017)
Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1), 349–382 (2013)
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imag. Sci. 7(3), 1588–1623 (2014)
Gu, G., Yang, J.: On the dual step length of the alternating direction method of multipliers. arXiv preprint arXiv:2006.08309 (2020)
Han, D., Sun, D., Zhang, L.: Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Math. Oper. Res. 43(2), 622–637 (2018)
Han, D.R.: A survey on some recent developments of alternating direction method of multipliers. J. Oper. Res. Soc. China 10(1), 1–52 (2022)
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity. Monogr. Stat. Appl. Probab. 143, 143 (2015)
He, B., Yuan, X.: On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)
Hong, M., Luo, Z.Q.: On the linear convergence of the alternating direction method of multipliers. Math. Program. 162(1), 165–199 (2017)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press (2012)
Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)
Li, H., Lin, Z.: Accelerated alternating direction method of multipliers: an optimal O(1/k) nonergodic analysis. J. Sci. Comput. 79(2), 671–699 (2019)
Lin, Z., Li, H., Fang, C.: Alternating Direction Method of Multipliers for Machine Learning. Springer (2022)
Liu, H., Shi, Y., Wang, Z., Ran, L., Lü, Q., Li, H.: A distributed algorithm based on relaxed ADMM for energy resources coordination. Int. J. Electr. Power Energy Syst. 135, 107482 (2022)
Liu, Y., Yuan, X., Zeng, S., Zhang, J.: Partial error bound conditions and the linear convergence rate of the alternating direction method of multipliers. SIAM J. Numer. Anal. 56(4), 2095–2123 (2018)
Lozenski, L., Villa, U.: Consensus ADMM for inverse problems governed by multiple PDE models. arXiv preprint arXiv:2104.13899 (2021)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Madani, R., Kalbat, A., Lavaei, J.: ADMM for sparse semidefinite programming with applications to optimal power flow problem. In: 2015 54th IEEE Conference on Decision and Control (CDC), pp. 5932–5939. IEEE (2015)
Monteiro, R.D., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim. 23(1), 475–507 (2013)
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1), 69–107 (2019)
Nesterov, Y.: Introductory Lectures on Convex Pptimization: A Basic Course, vol. 87. Springer (2003)
Nishihara, R., Lessard, L., Recht, B., Packard, A., Jordan, M.: A general analysis of the convergence of ADMM. In: International Conference on Machine Learning, pp. 343–352. PMLR (2015)
Nocedal, J., Wright, S.J.: Numerical Optimization, 2e edn. Springer, New York (2006)
Peña, J., Vera, J.C., Zuluaga, L.F.: Linear convergence of the Douglas–Rachford algorithm via a generic error bound condition. arXiv preprint arXiv:2111.06071 (2021)
Rockafellar, R.T.: Convex analysis. In: Convex Analysis. Princeton University Press (1970)
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)
Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection. SIAM J. Optim. 30(3), 2251–2271 (2020)
Sabach, S., Teboulle, M.: Faster Lagrangian-based methods in convex optimization. SIAM J. Optim. 32(1), 204–227 (2022)
Stellato, B., Banjac, G., Goulart, P., Bemporad, A., Boyd, S.: OSQP: an operator splitting solver for quadratic programs. Math. Program. Comput. 12(4), 637–672 (2020)
Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Yuan, X., Zeng, S., Zhang, J.: Discerning the linear convergence of ADMM for structured convex optimization through the lens of variational analysis. J. Mach. Learn. Res. 21, 1–83 (2020)
Acknowledgements
The authors would like to thank to two anonymous referees for their valuable comments and suggestions which help to improve the paper considerably. In particular, one of the reviewers provided constructive feedback that greatly improved the presentation of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the Dutch Scientific Council (NWO) Grant OCENW.GROOT.2019.015, Optimization for and with Machine Learning (OPTIMAL).
A Appendix
A Appendix
Lemma 4
Let \(N\ge 4\) and \(t, c_1\in {\mathbb {R}}\). Let \(D(t, c_1)\) be \(N\times N\) symmetric matrix given in Theorem 4. If \(c_1>0\) is given, then
Proof
The argument proceeds in the same manner as in Lemma 1. Due to the convexity of \(\{t: D(t, c_1)\succeq 0\}\), is sufficient to establish the positive semidefiniteness of \(D(0, c_1)\) and \(D(c_1, c_1)\). As \(D(0, c_1)\) is diagonally dominant, it is positive semidefinite. Next, we proceed to demonstrate the positive definiteness of the matrix \(K = D(1,1)\) by computing its leading principal minors. One can show that the claim holds for \(N=4\). So we investigate \(N\ge 5\). To accomplish this, we perform the following elementary row operations on matrix D:
-
(i)
Add the second row to the third row;
-
(ii)
Add the second row to the last row;
-
(iii)
Add the third row to the forth row;
-
(iv)
For \(i=4:N-2\)
-
Add \(i-th\) row to \((i+1)-th\) row;
-
Add \(\tfrac{3-i}{2i^2-3i-1}\) times of \(i-th\) row to the last row;
-
-
(v)
Add \(\frac{2N^2-8N+9}{2N^2-7N+4}\) times of \((N-1)-th\) row to \(N-th\) row.
By executing these operations, we transform K into an upper triangular matirx J with diagonal
It is seen all first \((N-1)\) diagonal elements of J are positive. We show that \(J_{N, N}\) is also positive. By using inequality (17), we get
for \(N\ge 5\), which implies \(J_{N, N}>0\). Hence, \( D(c_1, c_1)\succeq 0\) and the proof is complete. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zamani, M., Abbaszadehpeivasti, H. & de Klerk, E. The exact worst-case convergence rate of the alternating direction method of multipliers. Math. Program. (2023). https://doi.org/10.1007/s10107-023-02037-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10107-023-02037-0
Keywords
- Alternating direction method of multipliers (ADMM)
- Performance estimation
- Convergence rate
- PŁ inequality