Abstract
This paper expands the cyclic block proximal gradient method for block separable composite minimization by allowing for inexactly computed gradients and pre-conditioned proximal maps. The resultant algorithm, the inexact cyclic block proximal gradient (I-CBPG) method, shares the same convergence rate as its exactly computed analogue provided the allowable errors decrease sufficiently quickly or are pre-selected to be sufficiently small. We provide numerical experiments that showcase the practical computational advantage of I-CBPG for certain fixed tolerances of approximation error and for a dynamically decreasing error tolerance regime in particular. Our experimental results indicate that cyclic methods with dynamically decreasing error tolerance regimes can actually outpace their randomized siblings with fixed error tolerance regimes. We establish a tight relationship between inexact pre-conditioned proximal map evaluations and \(\delta \)-subgradients in our \((\delta ,B)\)-Second Prox theorem. This theorem forms the foundation of our convergence analysis and enables us to show that inexact gradient computations can be subsumed within a single unifying framework.
Similar content being viewed by others
Notes
Data and related code for these experiments will be made available upon reasonable request.
References
Beck, A.: First-Order Methods in Optimization. SIAM, Philadelphia (2017)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013). https://doi.org/10.1137/120887679
Broughton, R., Coope, I., Renaud, P., Tappenden, R.: A box constrained gradient projection algorithm for compressed sensing. Signal Process. 91(8), 1985–1992 (2011)
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
Devolder, O., Glineur, F., Nesterov, Y.: Intermediate gradient methods for smooth convex problems with inexact oracle. Technical report CORE-2013017 Center for Operations Research (2013)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1), 37–75 (2014)
Donoho, D.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Dvurechensky, P., Gasnikov, A.: Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. J. Optim. Theory Appl. 171(1), 121–145 (2016)
Frongillo, R., Reid, M.: Convergence analysis of prediction markets via randomized subspace descent. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28 (2015)
Hiriart-Urruty, J., Lemaréchal, C.: Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Springer, Berlin (2013)
Hua, X., Yamashita, N.: Block coordinate proximal gradient methods with variable Bregman functions for nonsmooth separable optimization. Math. Program. 160(1), 1–32 (2016)
Leventhal, D., Lewis, A.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
Lu, H., Freund, R., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Qin, Z., Scheinberg, K., Goldfarb, D.: Efficient block-coordinate descent algorithms for the group lasso. Math. Program. Comput. 5(2), 143–169 (2013)
Richtárik, P., Takáč, M.: Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. In: Klatte, D., Lüthi, H., Schmedders, K. (eds.) Operations Research Proceedings, vol. 2011 (2012)
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)
Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. J. Convex Anal. 19(4), 1167–1192 (2012)
Schmidt, M., Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 24 (2011)
Shefi, R., Teboulle, M.: On the rate of convergence of the proximal alternating linearized minimization algorithm for convex problems. EURO J. Comput. Optim. 4(1), 27–46 (2016)
Simon, N., Tibshirani, R.: Standardization and the group lasso penalty. Stat. Sin. 22(2), 983–1002 (2012)
Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Optim. Theory Appl. 170(1), 144–176 (2016)
Villa, S., Salzo, S., Baldassarre, L., Verri, A.: Accelerated and inexact forward–backward algorithms. SIAM J. Optim. 23(3), 1607–1633 (2013)
Wright, S., Nowak, R., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Edouard Pauwels.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proof of Lemma 3.3
Proof of Lemma 3.3
Proof
Fix \(k\ge 2\). We begin by dividing both sides of (27) by \(A_\ell A_{\ell +1}\),
then rearrange and use monotonicity of \(\{A_\ell \}_{\ell \ge 0}\) to simplify this to
This rearrangement foreshadows the important roles of \(A_{\ell +1}/A_\ell \) and \(\Delta _{\ell +1}/(A_\ell A_{\ell +1})\). We consider two cases, divided according to the typical size of the ratio \(A_{\ell +1}/A_\ell \) for \(\ell + 1 \le k\). In the second case, when the values of \(A_\ell \) fall at what one may consider a relatively slow rate over this range, we consider three subcases based on the behavior of \(\{\Delta _\ell \}_{\ell \ge 1}\) and the typical values of \(\frac{\Delta _{\ell +1}}{A_\ell A_{\ell +1}}\).
-
(i)
For at least \(\lfloor k/2\rfloor \) values of \(0\le \ell \le k-1\), we have \(A_{\ell +1}/A_\ell \le 1/2\).
-
(ii)
For at least \(\lfloor k/2\rfloor \) values of \(0\le \ell \le k-1\), we have \(1/2 < A_{\ell +1}/A_\ell \le 1\). In this case, we consider three subcases based on the values of \(\Delta _{\ell +1}/(A_\ell A_{\ell +1})\) and the sequence \(\{\Delta _\ell \}_{\ell \ge 1}\).
Case 1: For at least \(\lfloor k/2\rfloor \) values of \(0\le \ell \le k-1\), \(\frac{A_{\ell +1}}{A_\ell }\le \frac{1}{2}\).
This is the easy case. First, assume that k is even. Then, we have that \(A_{\ell +1}\le \frac{1}{2}A_\ell \) for at least k/2 values of \(0\le \ell \le k-1\) so
since the \(A_\ell \) terms are decreasing. If \(k>2\) is odd, then \(k-1\) is even, so by the same logic
Case 2: For at least \(\lfloor k/2\rfloor \) values of \(0\le \ell \le k-1\), \(\frac{1}{2} < \frac{A_{\ell +1}}{A_\ell } \le 1\).
We examine the following three subcases in turn:
-
(i)
\(\Delta _{\ell } = \Delta \ge 0 \) for all \(\ell \).
-
(ii)
The sequence \(\{\Delta _\ell \}_{\ell \ge 1}\) shrinks at the sublinear rate \({\mathcal {O}}(1/\ell ^2)\) and for at least \(\lfloor k/4\rfloor \) of the values for which \(\frac{1}{2}< \frac{A_{\ell +1}}{A_\ell }\le 1\) it also holds that \(\frac{1}{4\gamma }>\frac{\Delta _{\ell +1}}{A_\ell A_{\ell +1}}\).
-
(iii)
The sequence \(\{\Delta _\ell \}_{\ell \ge 1}\) shrinks at the sublinear rate \({\mathcal {O}}(1/\ell ^2)\) and for at least \(\lfloor k/4\rfloor \) of the values for which \(\frac{1}{2}<\frac{A_{\ell +1}}{A_\ell }\le 1\) it also holds that \(\frac{1}{4\gamma } \le \frac{\Delta _{\ell +1}}{A_\ell A_{\ell +1}}\).
Case 2, Subcase i: \(\Delta _\ell = \Delta \ge 0\) for all \(\ell \).
Assume for now that k is even. Define \(u=\sqrt{\Delta \gamma }\), and let \(\tilde{A}_{\ell }=A_{\ell }-u\). Then, the recurrence (27) implies that \( \frac{1}{\gamma }A_{\ell +1}^2\le A_\ell -A_{\ell +1}+\Delta _{\ell +1} \), which we may express as
Expanding the square on the left, using the definition of u, and rearranging we see
If \(\tilde{A}_{k} \le 0\), the result is immediate, so suppose that \(\tilde{A}_{k} > 0\), from which it follows that the earlier \(\tilde{A}_{\ell }\) terms are also positive. Then, for any \(\ell \) with \(0 \le \ell \le k-1\), we may divide the recurrence inequality by the product \(\tilde{A}_{\ell +1}\tilde{A}_\ell \) to obtain
Now, by hypothesis, for at least k/2 indices in the range \(0 \le \ell \le k -1\)
Iterating backward, one obtains
which gives \(\tilde{A}_k \le 4\gamma \tilde{A}_0/[k (\tilde{A}_0+4u)]\). The result follows from noting that \(k-1\) is even if k is odd, so we may replace k with \(k-1\) above to obtain a generic bound.
Case 2, Subcase ii: The sequence \(\{\Delta _\ell \}_{\ell \ge 1}\) shrinks at the sublinear rate \({\mathcal {O}}(1/\ell ^2)\) and for at least \(\lfloor k/4\rfloor \) of the values for which \(\frac{1}{2} < \frac{A_{\ell +1}}{A_\ell } \le 1\), it also holds that\(\frac{\Delta _{\ell +1}}{A_\ell A_{\ell +1}}< \frac{1}{4\gamma }\).
Our reasoning follows the same idea as when \(\Delta _\ell =\Delta \ge 0\) for all \(\ell \ge 1\) (Case 2, Subcase i). First, assume that k is divisible by 4. We have for k/4 values of \(0\le \ell \le k-1\) that
This inequality iterated backward, plus monotonicity and non-negativity of the sequence \(\{A_\ell \}_{\ell \ge 0}\), implies that
Rearranging, we have that \(A_k \le 16\gamma / k\). If \(k > 4\) is not divisible by 4, then \(k-1\), \(k-2\), or \(k-3\) must be, so in the worst case \(A_k \le 16\gamma /(k-3)\).
Case 2, Subcase iii: The sequence \(\{\Delta _\ell \}_{\ell \ge 1}\) shrinks at the sublinear rate \({\mathcal {O}}(1/\ell ^2)\) and for at least \(\lfloor k/4\rfloor \) of the values for which \(\frac{1}{2} < \frac{A_{\ell +1}}{A_\ell } \le 1\), it also holds that \(\frac{\Delta _{\ell +1}}{A_\ell A_{\ell +1}}\ge \frac{1}{4\gamma }\).
First, suppose k is divisible by 4. Let \(\ell ^*\) denote the largest \(\ell \in \{0,\ldots ,k-1\}\) for which \(\frac{\Delta _{\ell +1}}{A_\ell A_{\ell +1}}\ge \frac{1}{4\gamma }\) holds. By hypothesis, \(\ell ^*\) must be at least as big as \(\frac{k}{4} - 1\), and \(\Delta ^2_\ell \le D/\ell ^2\), so
Dividing by \(1/4\gamma \) and taking square roots, we have \(A_{k}\le \frac{8\sqrt{\gamma D}}{k}\). If \(k>4\) is not divisible by 4, then one of \(k-1\), \(k-2\), or \(k-3\) are, so at worst \(A_k\le \frac{8\sqrt{\gamma D}}{k-3}\).
Having completed our analysis, we may now combine the results from Case 1 with the appropriate Subcase(s) of Case 2 to establish the results in Lemma 3.3.
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Farias Maia, L., Gutman, D.H. & Hughes, R.C. The Inexact Cyclic Block Proximal Gradient Method and Properties of Inexact Proximal Maps. J Optim Theory Appl (2024). https://doi.org/10.1007/s10957-024-02404-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10957-024-02404-7