Skip to main content
Log in

Iterative regularization for low complexity regularizers

  • Published:
Numerische Mathematik Aims and scope Submit manuscript

Abstract

Iterative regularization exploits the implicit bias of optimization algorithms to regularize ill-posed problems. Constructing algorithms with such built-in regularization mechanisms is a classic challenge in inverse problems but also in modern machine learning, where it provides both a new perspective on algorithms analysis, and significant speed-ups compared to explicit regularization. In this work, we propose and study the first iterative regularization procedure with explicit computational steps able to handle biases described by non smooth and non strongly convex functionals, prominent in low-complexity regularization. Our approach is based on a primal-dual algorithm of which we analyze convergence and stability properties, even in the case where the original problem is unfeasible. The general results are illustrated considering the special case of sparse recovery with the \(\ell _1\) penalty. Our theoretical results are complemented by experiments showing the computational benefits of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. This is a abuse of language since, when R is not strictly convex, \(D_R\) may not be a divergence; meaning that, in general, \(D_{R}(x,x')=0\) does not imply \(x'=x\).

  2. If initialized at 0 and provided \(\gamma < 2 / \left\| A \right\| _{\text {op}}^2\).

  3. https://github.com/mathurinm/libsvmdata.

References

  1. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Structured sparsity through convex optimization. Stat. Sci. 27(4), 450–468 (2012)

    Article  MathSciNet  Google Scholar 

  2. Bachmayr, M., Burger, M.: Iterative total variation schemes for nonlinear inverse problems. Inverse Probl. 25(10), 105004 (2009)

    Article  MathSciNet  Google Scholar 

  3. Bahraoui, M., Lemaire, B.: Convergence of diagonally stationary sequences in convex optimization. Set-Valued Anal. 2, 49–61 (1994)

    Article  MathSciNet  Google Scholar 

  4. Barré, M., Taylor, A., Bach, F.: Principled analyses and design of first-order methods with inexact proximal operators. arXiv preprint arXiv:2006.06041 (2020)

  5. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)

    Book  Google Scholar 

  6. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  7. Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)

    Article  MathSciNet  Google Scholar 

  8. Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)

    Article  MathSciNet  Google Scholar 

  9. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    Article  Google Scholar 

  10. Bredies, K., Zhariy, M.: A discrepancy-based parameter adaptation and stopping rule for minimization algorithms aiming at Tikhonov-type regularization. Inverse Probl. 29(2), 025008 (2013)

    Article  MathSciNet  Google Scholar 

  11. Brianzi, P., Di Benedetto, F., Estatico, C.: Preconditioned iterative regularization in Banach spaces. Comput. Optim. Appl. 54(2), 263–282 (2013)

    Article  MathSciNet  Google Scholar 

  12. Burger, M., Osher, S.: Convergence rates of convex variational regularization. Inverse Probl. 20(5), 1411 (2004)

    Article  MathSciNet  Google Scholar 

  13. Burger, M., Osher, S., Xu, J., Gilboa, G: Nonlinear inverse scale space methods for image restoration. In: International Workshop on Variational, Geometric, and Level Set Methods in Computer Vision, pp. 25–36. Springer (2005)

  14. Burger, M., Gilboa, G., Osher, S., Xu, J.: Nonlinear inverse scale space methods. Commun. Math. Sci. 4(1), 179–212 (2006)

    Article  MathSciNet  Google Scholar 

  15. Burger, M., Resmerita, E., He, L.: Error estimation for Bregman iterations and inverse scale space methods in image restoration. Computing 81(2–3), 109–135 (2007)

    Article  MathSciNet  Google Scholar 

  16. Burger, M., Möller, M., Benning, M., Osher, S.: An adaptive inverse scale space method for compressed sensing. Math. Comput. 82(281), 269–299 (2013)

    Article  MathSciNet  Google Scholar 

  17. Cai, J.-F., Osher, S., Shen, Z.: Convergence of the linearized Bregman iteration for \(\ell _1\)-norm minimization. Math. Comput. 78(268), 2127–2136 (2009)

    Article  Google Scholar 

  18. Cai, J.-F., Osher, S., Shen, Z.: Linearized Bregman iterations for compressed sensing. Math. Comput. 78(267), 1515–1536 (2009)

    Article  MathSciNet  Google Scholar 

  19. Calatroni, L., Garrigos, G., Rosasco, L., Villa, S.: Accelerated iterative regularization via dual diagonal descent. SIAM J. Optim. 31(1), 754–784 (2021)

    Article  MathSciNet  Google Scholar 

  20. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)

    Article  MathSciNet  Google Scholar 

  21. Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)

    Article  MathSciNet  Google Scholar 

  22. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)

    Article  MathSciNet  Google Scholar 

  23. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)

    Article  MathSciNet  Google Scholar 

  24. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998)

    Article  MathSciNet  Google Scholar 

  25. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)

  26. Condat, L.: A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158(2), 460–479 (2013)

    Article  MathSciNet  Google Scholar 

  27. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. A J. Issued Courant Inst. Math. Sci. 57(11), 1413–1457 (2004)

    Article  MathSciNet  Google Scholar 

  28. Deledalle, C., Vaiter, S., Peyré, G., Fadili, J.: Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM J. Imaging Sci. 7(4), 2448–2487 (2014)

    Article  MathSciNet  Google Scholar 

  29. Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)

    Article  MathSciNet  Google Scholar 

  30. Engl, H.W., Heinz, W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer, New York (1996)

    Book  Google Scholar 

  31. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

  32. Fazel, M.: Matrix rank minimization with applications. Ph.D. thesis, Stanford University (2002)

  33. Figueiredo, M., Nowak, R.: Ordered weighted \(\ell _1\) regularized regression with strongly correlated covariates: theoretical aspects. In: AISTATS, pp. 930–938. PMLR (2016)

  34. Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. Springer, New York (2013)

    Book  Google Scholar 

  35. Friedlander, M.P., Tseng, P.: Exact regularization of convex programs. SIAM J. Optim. 18(4), 1326–1350 (2008)

    Article  MathSciNet  Google Scholar 

  36. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)

  37. Garrigos, G., Rosasco, L., Villa, S.: Iterative regularization via dual diagonal descent. J. Math. Imaging Vis. 60(2), 189–215 (2018)

    Article  MathSciNet  Google Scholar 

  38. Ghorbani, B., Mei, S., Misiakiewicz, T., Montanari, A.: Linearized two-layers neural networks in high dimension. Ann. Stat. 49(2), 1029–1054 (2021)

    Article  MathSciNet  Google Scholar 

  39. Goldenshluger, A., Pereverzev, S.: On adaptive inverse estimation of linear functionals in Hilbert scales. Bernoulli 9(5), 783–807 (2003)

    Article  MathSciNet  Google Scholar 

  40. Grasmair, M., Scherzer, O., Haltmeier, M.: Necessary and sufficient conditions for linear convergence of l1-regularization. Commun. Pure Appl. Math. 64(2), 161–182 (2011)

    Article  Google Scholar 

  41. Gunasekar, S., Woodworth, B.E., Bhojanapalli, S., Neyshabur, B., Srebro, N.: Implicit regularization in matrix factorization. In: NeurIPS, pp. 6151–6159 (2017)

  42. Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 1832–1841 (2018)

  43. Hastie, T.J., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

    Book  Google Scholar 

  44. Huang, B., Ma, S., Goldfarb, D.: Accelerated linearized Bregman method. J. Sci. Comput. 54(2–3), 428–453 (2013)

  45. Iutzeler, F., Malick, J.: Nonsmoothness in machine learning: specific structure, proximal identification, and applications. Set-Valued Var. Anal. 28(4), 661–678 (2020)

    Article  MathSciNet  Google Scholar 

  46. Kaltenbacher, B., Neubauer, A., Scherzer, O.: Iterative Regularization Methods for Nonlinear Ill-Posed Problems, vol. 6. Walter de Gruyter, Berlin (2008)

    Book  Google Scholar 

  47. Lanza, A., Morigi, S., Selesnick, I.W., Sgallari, F.: Sparsity-inducing nonconvex nonseparable regularization for convex image processing. SIAM J. Imaging Sci. 12(2), 1099–1134 (2019)

    Article  MathSciNet  Google Scholar 

  48. Lorenz, D., Wenger, S., Schöpfer, F., Magnor, M.: A sparse Kaczmarz solver and a linearized Bregman method for online compressed sensing. In: 2014 IEEE international conference on image processing (ICIP), pp. 1347–1351. IEEE (2014)

  49. Lorenz, D.A., Schopfer, F., Wenger, S.: The linearized Bregman method via split feasibility problems: analysis and generalizations. SIAM J. Imaging Sci. 7(2), 1237–1262 (2014)

    Article  MathSciNet  Google Scholar 

  50. Massias, M., Vaiter, S., Gramfort, A., Salmon, J.: Dual extrapolation for sparse generalized linear models. J. Mach. Learn. Res. (2020)

  51. Molinari, C., Peypouquet, J.: Lagrangian Penalization scheme with parallel forward-backward splitting. J. Optim. Theory Appl. 117, 413–447 (2018)

    Article  MathSciNet  Google Scholar 

  52. Molinari, C., Peypouquet, J., Roldan, F.: Alternating forward-backward splitting for linearly constrained optimization problems. Optim. Lett. 14, 1071–1088 (2020)

    Article  MathSciNet  Google Scholar 

  53. Molinari, C., Massias, M., Rosasco, L., Villa, S.: Iterative regularization for convex regularizers. In: International Conference on Artificial Intelligence and Statistics. PMLR, pp 1684–1692 (2021)

  54. Moreau, T., Massias, M., Gramfort, A., Ablin, P., Bannier, P.A., Charlier, B., Dagréou, M., Dupre la Tour, T., Durif, G., Dantas, C.F., Klopfenstein, Q.: Benchopt: reproducible, efficient and collaborative optimization benchmarks. NeuRIPS 35, 25404–25421 (2022)

  55. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regularization with proximal methods. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 418–433. Springer (2010)

  56. Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: NeurIPS, pp. 451–459 (2011)

  57. Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York (1983)

    Google Scholar 

  58. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/k\(^{2}\)). In Dokl. Akad. Nauk. Sssr. 269, 543–547 (1983)

    MathSciNet  Google Scholar 

  59. Neubauer, A.: On Nesterov acceleration for Landweber iteration of linear ill-posed problems. J. Inverse Ill-posed Probl. 25(3), 381–390 (2017)

    Article  MathSciNet  Google Scholar 

  60. Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010)

    Article  MathSciNet  Google Scholar 

  61. Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation-based image restoration. SIAM Multiscale Model. Simul. 4, 460–489 (2005)

    Article  MathSciNet  Google Scholar 

  62. Osher, S., Ruan, F., Xiong, J., Yao, Y., Yin, W.: Sparse recovery via differential inclusions. Appl. Comput. Harmon. Anal. 41(2), 436–469 (2016)

    Article  MathSciNet  Google Scholar 

  63. Pagliana, N., Rosasco, L.: Implicit regularization of accelerated methods in Hilbert spaces. In: NeurIPRS, pp. 14454–14464 (2019)

  64. Pock, T., Chambolle, A.: Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In: 2011 International Conference on Computer Vision, pp. 1762–1769 (2011)

  65. Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  66. Raskutti, G., Wainwright, M.J., Yu, B.: Early stopping and non-parametric regression: an optimal data-dependent stopping rule. J. Mach. Learn. Res. 15(1), 335–366 (2014)

    MathSciNet  Google Scholar 

  67. Rosasco, L., Villa, S.: Learning with incremental iterative regularization. In: NeurIPS, pp. 1630–1638 (2015)

  68. Rosasco, L., Santoro, M., Mosci, S., Verri, A., Villa, S.: A regularization approach to nonlinear variable selection. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Volume 9 of Proceedings of Machine Learning Research, pp. 653–660 (2010)

  69. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)

    Article  MathSciNet  Google Scholar 

  70. Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. J. Convex Anal. 19(4), 1167–1192 (2012)

    MathSciNet  Google Scholar 

  71. Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: NeurIPS, pp. 1458–1466 (2011)

  72. Schopfer, F.: Exact regularization of polyhedral norms. SIAM J. Optim. 22(4), 1206–1223 (2012)

    Article  MathSciNet  Google Scholar 

  73. Schöpfer, F., Lorenz, D.: Linear convergence of the randomized sparse Kaczmarz method. Math. Program. 173(1), 509–536 (2019)

    Article  MathSciNet  Google Scholar 

  74. Schöpfer, F., Louis, A., Schuster, T.: Nonlinear iterative methods for linear ill-posed problems in Banach spaces. Inverse Probl. 22(1), 311 (2006)

    Article  MathSciNet  Google Scholar 

  75. Simon, N., Friedman, J., Hastie, T.J., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22(2), 231–245 (2013). (ISSN 1061-8600)

    Article  MathSciNet  Google Scholar 

  76. Steinwart, I., Christmann, A.: Support Vector Machines. Information Science and Statistics. Springer, New York (2008)

    Google Scholar 

  77. Teboulle, M., Beck, A.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003)

    Article  MathSciNet  Google Scholar 

  78. Tikhonov, A.N., Arsenin, V.Y.: Solutions of ill-posed problems. Wiley, New York, V. H. Winston & Sons, Washington (1977). Translated from the Russian, Preface by translation editor Fritz John, Scripta Series in Mathematics

  79. Vaiter, S., Peyré, G., Fadili, J.: Low complexity regularization of linear inverse problems. Sampling Theory, a Renaissance: Compressive Sensing and Other Developments, pp. 103–153 (2015)

  80. Vaškevičius, T., Kanade, V., Rebeschini, P.: Implicit regularization for optimal sparse recovery. In: NeurIPS, pp. 2968–2979 (2019)

  81. Vaškevičius, T., Kanade, V., Rebeschini, P.: The statistical complexity of early stopped mirror descent. In: NeurIPS, pp. 253–264 (2020)

  82. Villa, S., Salzo, S., Baldassarre, L., Verri, A.: Accelerated and inexact forward-backward algorithms. SIAM J. Optim. 23(3), 1607–1633 (2013)

    Article  MathSciNet  Google Scholar 

  83. Villa, S., Matet, S., Vu, B.C., Rosasco, L.: Implicit regularization with strongly convex bias: stability and acceleration. Anal. Appl. (2022). https://doi.org/10.1142/S0219530522400139

    Article  Google Scholar 

  84. Vũ, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 38(3), 667–681 (2013)

    Article  MathSciNet  Google Scholar 

  85. Wei, Y., Yang, F., Wainwright, M.J.: Early stopping for kernel boosting algorithms: a general analysis with localized complexities. In: Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/a081cab429ff7a3b96e0a07319f1049e-Paper.pdf

  86. Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression (2008)

  87. Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)

    Article  MathSciNet  Google Scholar 

  88. Yin, W.: Analysis and generalizations of the linearized Bregman method. SIAM J. Imaging Sci. 3(4), 856–877 (2010)

    Article  MathSciNet  Google Scholar 

  89. Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for l1-minimization with applications to compressed sensing. SIAM J. Imaging Sci. 1(1), 143–168 (2008)

    Article  MathSciNet  Google Scholar 

  90. Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty (2010)

  91. Zhang, X., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)

    Article  MathSciNet  Google Scholar 

  92. Zhang, X., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46, 20–46 (2011)

    Article  MathSciNet  Google Scholar 

  93. Zhao, P., Rocha, G., Yu, B.: The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat. 37(6A), 3468–3497 (2016)

    MathSciNet  Google Scholar 

Download references

Acknowledgements

L.R. acknowledges the financial support of the European Research Council (Grant SLING 819789), the AFOSR project FA9550-18-1-7009 (European Office of Aerospace Research and Development), the EU H2020-MSCA-RISE project NoMADS-DLV-777826, and the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. S.V. and L.R. acknowledge the support of the AFOSR project FA8655-22-1-7034 and of the H2020-MSCA-ITN Project Trade-OPT 2019. S.V., L.R. and C.M. acknowledge the support of MIUR-PRIN 202244A7YL project Gradient Flows and Non-Smooth Geometric Structures with Applications to Optimization and Machine Learning. The research by C.M. and S.V. has been supported by the MIUR Excellence Department Project awarded to Dipartimento di Matematica, Universita di Genova, CUP D33C23001110001. S.V. and C.M. are part of the INDAM research group “Gruppo Nazionale per l’Analisi Matematica, la Probabilitá e le loro applicazioni”. C.M. was supported by the Programma Operativo Nazionale (PON) “Ricerca e Innovazione” 2014-2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cesare Molinari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Preliminary lemmas

Lemma 27

[71, Lemma 2] Assume that \((u_j)\) is a non-negative sequence, \((S_j)\) is a non-decreasing sequence with \(S_0\ge u_0^2\) and \(\lambda \ge 0\) such that, for every \(j\in \mathbb {N}\),

$$\begin{aligned} u_j^2\le S_j+\lambda \sum _{i=1}^{j} u_i . \end{aligned}$$
(39)

Then, for every \(j\in \mathbb {N}\),

$$\begin{aligned} u_j\le \frac{\lambda j}{2}+\sqrt{S_j+\left( \frac{\lambda j}{2}\right) ^2} . \end{aligned}$$
(40)

Lemma 28

(Descent lemma, ([5], Thm 18.15 (iii)) Let \(f: \mathcal {X}\rightarrow \mathbb {R}\) be Fréchet differentiable with L-Lipschitz continuous gradient. Then, for every x and \(y\in \mathcal {X}\),

$$\begin{aligned} f(y)\le f(x)+\langle \nabla f(x),y-x\rangle +\frac{L}{2}\left\| y-x \right\| _{ }^2 . \end{aligned}$$
(41)

Lemma 29

Let \(\mathcal {Z}\) denote \(\mathcal {X}\) or \(\mathcal {Y}\) and U denote T or \(\Sigma \) accordingly. Let \(f \in \Gamma _0(\mathcal {Z})\) and \(\varepsilon \ge 0\). It follows easily from the definition of the \(\varepsilon \)-subdifferential that if \(a, b \in \mathcal {Z}\) satisfy

$$\begin{aligned} U^{-1}\left( a - b\right) \in \partial _{\varepsilon } f(b), \end{aligned}$$
(42)

then, for every \(c \in \mathcal {Z}\),

$$\begin{aligned} f(b) - f(c) + \frac{1}{2}\left\| b - c \right\| ^2_U - \frac{1}{2}\left\| a - b \right\| ^2_U - \frac{1}{2}\left\| c - a \right\| ^2_U \le \varepsilon . \end{aligned}$$
(43)

1.1 Primal-dual estimates

Lemma 30

(One step estimate) Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by iterations (12) under Assumption 7. Then, for any \(z=(x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\), with \(V(z):= \frac{1}{2}\left\| x \right\| _T^2 + \frac{1}{2} \left\| y \right\| _\Sigma ^2\),

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + \frac{1-\tau _M L}{2\tau _M}\left\| x_{k+1}-x_k \right\| _{ }^2+ \frac{1}{2}\left\| y_{k+1}-y_k \right\| _{ \Sigma }^2\\&\quad + \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] + \langle y_{k + 1} - {\tilde{y}}_k, A\left( x - x_{k + 1}\right) \rangle \le \varepsilon _{k+1} . \end{aligned} \end{aligned}$$
(44)

Proof

Let \((x, y) \in \mathcal {X}\times \mathcal {Y}\). Applying Lemma 29 to the definition of \(x_{k+1}\) yields

$$\begin{aligned} \begin{aligned}&\frac{1}{2} \left\| x_{k + 1} - x \right\| _T^2 - \frac{1}{2} \left\| x_k - x \right\| _T^2 + \frac{1}{2} \left\| x_{k + 1} - x_k \right\| _T^2 + \left[ R(x_{k + 1})-R(x)\right] \\&\quad + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle \nabla F(x_k), x_{k + 1} - x\rangle \ \le \ \varepsilon _{k+1} . \end{aligned} \end{aligned}$$
(45)

For the dual update, similarly,

$$\begin{aligned} \begin{aligned} \frac{1}{2} \left\| y_{k + 1} - y \right\| _{\Sigma }^2 - \frac{1}{2} \left\| y_k - y \right\| _{\Sigma }^2 + \frac{1}{2} \left\| y_{k + 1} - y_k \right\| _{\Sigma }^2 + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle \le 0 . \end{aligned}\nonumber \\ \end{aligned}$$
(46)

Recall that \(z:= (x, y)\) and the definition of V. Sum Eqs. (45) and (46):

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + V(z_{k + 1} - z_k) + \left[ R\left( x_{k + 1}\right) -R(x)\right] \\&\quad + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle + \langle \nabla F(x_k), x_{k + 1} - x\rangle \le \varepsilon _{k+1} . \end{aligned} \end{aligned}$$
(47)

From the Lemma 28,

$$\begin{aligned} F(x_{k+1})\le F(x_k)+ \langle \nabla F(x_k), x_{k + 1} - x_k\rangle +\frac{L}{2}\left\| x_{k+1}-x_k \right\| _{ }^2 , \end{aligned}$$

while from the convexity of F,

$$\begin{aligned} F(x_k)+ \langle \nabla F(x_k), x - x_k\rangle \le F(x) . \end{aligned}$$

Summing the last two equations, one obtains the 3 points descent lemma:

$$\begin{aligned} F(x_{k+1})\le F(x)+ \langle \nabla F(x_k), x_{k + 1} - x\rangle +\frac{L}{2}\left\| x_{k+1}-x_k \right\| _{ }^2 . \end{aligned}$$
(48)

Summing Eqs. (47) and (48),

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + V(z_{k+1} - z_k)\\&\qquad + \left[ R+F\right] (x_{k+1})- \left[ R+F\right] (x) + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle \\&\quad \le \frac{L}{2}\left\| x_{k+1}-x_k \right\| _{ }^2 + \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

Now compute

$$\begin{aligned}{} & {} \left[ R+F\right] (x_{k+1})- \left[ R+F\right] (x) + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle \\{} & {} \qquad + \langle y_{k + 1} - y, b^\delta - Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] -\langle y, Ax_{k + 1} - b^\delta \rangle + \langle y_{k + 1}, Ax - b^\delta \rangle \\{} & {} \qquad + \langle {\tilde{y}}_k, A\left( x_{k + 1} - x\right) \rangle + \langle y_{k + 1} - y, b^\delta -Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] - \langle y_{k + 1}-y, b^\delta \rangle - \langle y, Ax_{k + 1}\rangle + \langle y_{k + 1}, A x \rangle \\{} & {} \qquad + \langle {\tilde{y}}_k, A x_{k + 1}\rangle - \langle {\tilde{y}}_k, A x\rangle + \langle y_{k + 1} - y, b^\delta \rangle - \langle y_{k + 1} - y, Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] \\{} & {} \qquad - \langle y, Ax_{k + 1}\rangle + \langle y_{k + 1}, A x\rangle + \langle {\tilde{y}}_k, Ax_{k + 1}\rangle \\{} & {} \qquad - \langle {\tilde{y}}_k, A x\rangle - \langle y_{k + 1}, Ax_{k + 1}\rangle + \langle y, Ax_{k + 1}\rangle \\{} & {} \quad = \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] + \langle y_{k + 1} - {\tilde{y}}_k, A\left( x - x_{k + 1}\right) \rangle . \end{aligned}$$

Notice that

$$\begin{aligned} \frac{1}{2\tau _M}\left\| x_{k+1}-x_k \right\| _{ }^2 \le \frac{1}{2}\left\| x_{k+1}-x_k \right\| _{ T }^2 . \end{aligned}$$
(49)

Finally,

$$\begin{aligned} \begin{aligned}&V(z_{k + 1} - z) - V(z_k - z) + \frac{1-\tau _M L}{2\tau _M}\left\| x_{k+1}-x_k \right\| _{ }^2+ \frac{1}{2}\left\| y_{k+1}-y_k \right\| _{ \Sigma }^2\\&\quad + \left[ \mathcal {L}^{\delta }(x_{k + 1}, y) - \mathcal {L}^{\delta }(x, y_{k + 1})\right] + \langle y_{k + 1} - {\tilde{y}}_k, A\left( x - x_{k + 1}\right) \rangle \le \varepsilon _{k+1} . \end{aligned} \end{aligned}$$

\(\square \)

Lemma 31

(First cumulating estimate) Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by iterations (12) under Assumption 7. Define \(\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2)\). Then, for any \((x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\),

$$\begin{aligned} \begin{aligned}&\tfrac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - x \right\| ^2 + \frac{1}{2}\left\| y_{k} - y \right\| _{ \Sigma }^2 + \sum _{j=1}^{k}\left[ \mathcal {L}^{\delta }(x_j,y) - \mathcal {L}^{\delta }(x, y_j)\right] \\&\qquad +\frac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \\&\quad \le V(z_0 - z) + \sum _{j=1}^{k}\varepsilon _{j}. \end{aligned} \end{aligned}$$
(50)

Proof

We start from the inequality in Lemma 30, switching the index from k to j. Recall that \({\tilde{y}}_j:= 2 y_j - y_{j - 1}\), to get

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} - z) - V(z_j - z) + \frac{1-\tau _M L}{2\tau _M}\left\| x_{j+1}-x_j \right\| _{ }^2+ \frac{1}{2}\left\| y_{j+1}-y_j \right\| _{ \Sigma }^2 \\&\qquad + \left[ \mathcal {L}^{\delta }(x_{j+ 1}, y) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \\&\quad \le \varepsilon _{j+1}- \langle y_{j + 1} - \left( 2y_j - y_{j - 1}\right) , A\left( x - x_{j + 1}\right) \rangle \\&\quad = \varepsilon _{j+1}- \langle y_{j+ 1} - y_j, A\left( x - x_{j + 1}\right) \rangle +\langle y_j-y_{j - 1}, A\left( x - x_{j + 1}\right) \rangle \\&\quad = \varepsilon _{j+1}- \langle y_{j+ 1} - y_j, A\left( x - x_{j + 1}\right) \rangle +\langle y_j-y_{j - 1}, A\left( x - x_j\right) \rangle \\&\qquad +\langle y_j-y_{j - 1}, A\left( x_j - x_{j + 1}\right) \rangle . \end{aligned} \end{aligned}$$

Now focus on the term

$$\begin{aligned} \langle y_j-y_{j - 1}, A\left( x_j - x_{j + 1}\right) \rangle&= \langle \Sigma ^{\frac{1}{2}} \Sigma ^{-\frac{1}{2}} \left( y_j-y_{j - 1}\right) , A\left( x_j - x_{j + 1}\right) \rangle \nonumber \\&= \langle \Sigma ^{-\frac{1}{2}}\left( y_j-y_{j - 1}\right) , \Sigma ^{\frac{1}{2}}A\left( x_j - x_{j + 1}\right) \rangle \nonumber \\&\le \left\| \Sigma ^{-\frac{1}{2}}\left( y_j-y_{j - 1}\right) \right\| \left\| \Sigma ^{\frac{1}{2}}A\left( x_j - x_{j + 1}\right) \right\| \nonumber \\&\le \frac{1}{2}\left\| \Sigma ^{-\frac{1}{2}}\left( y_j-y_{j - 1}\right) \right\| ^2+\frac{1}{2} \left\| \Sigma ^{\frac{1}{2}}A\left( x_j - x_{j + 1}\right) \right\| ^2\nonumber \\&\le \frac{1}{2}\left\| y_j-y_{j - 1} \right\| _{ \Sigma }^2+\frac{\sigma _M\left\| A \right\| _{ }^2}{2} \left\| x_{j + 1}-x_j \right\| ^2 , \end{aligned}$$
(51)

where we used Cauchy-Schwarz and Young inequalities. Then, using the definition of \(\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2)\), we have

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} - z) - V(z_j - z) + \left[ \mathcal {L}(x_{j+ 1}, y) - \mathcal {L}(x, y_{j + 1})\right] \\&\qquad + \frac{\omega }{2\tau _M}\left\| x_{j+1}-x_j \right\| _{ }^2+ \frac{1}{2}\left\| y_{j+1}-y_j \right\| _{ \Sigma }^2 -\frac{1}{2} \left\| y_j-y_{j-1} \right\| _{ \Sigma }^2\\&\quad \le \varepsilon _{j+1}- \langle y_{j + 1} - y_j, A\left( x - x_{j + 1}\right) \rangle +\langle y_j-y_{j- 1}, A\left( x - x_j\right) \rangle . \end{aligned} \end{aligned}$$
(52)

Imposing \(y_{-1}=y_0\), summing-up Eq. (52) from \(j=0\) to \(j=k-1\):

$$\begin{aligned}{} & {} V(z_{k} - z) - V(z_0 - z) + \sum _{j=0}^{k-1}\left[ \mathcal {L}^{\delta }(x_{j + 1}, y) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \\{} & {} \qquad + \frac{\omega }{2\tau _M}\sum _{j=0}^{k-1}\left\| x_{j + 1} - x_j \right\| ^2 + \frac{1}{2} \left\| y_k - y_{k - 1} \right\| _{ \Sigma }^2 \\{} & {} \quad \le \sum _{j=0}^{k-1}\varepsilon _{j+1}- \langle y_{k} - y_{k-1}, A\left( x - x_{k}\right) \rangle \\{} & {} \quad \le \frac{1}{2} \left\| y_{k} - y_{k-1} \right\| _{ \Sigma }^2 +\frac{\sigma _M \left\| A \right\| ^2}{2} \left\| x_{k} - x \right\| ^2 + \sum _{j=1}^{k}\varepsilon _{j}, \end{aligned}$$

where in the last inequality we used again Cauchy-Schwarz and Young inequalities as before. Reordering, we obtain the claim. \(\square \)

Lemma 32

(Second cumulative estimate) Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by iterations (12) under Assumption 7. Given \(\xi >0\) and \(\eta >0\), define \(\theta := \xi - \tau _M(\xi L+\sigma _M \left\| A \right\| ^2)\) and \(\rho :=\sigma _m(\eta -1)-\sigma _M\xi \eta \). Then, for any \(z = (x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\),

$$\begin{aligned}{} & {} V(z_{k} - z) + \frac{\theta }{2\tau _M\xi }\sum _{j=1}^{k}\left\| x_{j} - x_{j-1} \right\| ^2 +\frac{\rho }{2\eta }\sum _{j=1}^{k}\left\| A x_{j}- Ax \right\| ^2\nonumber \\{} & {} \qquad +\sum _{j=1}^{k}\left[ \mathcal {L}^{\delta }(x_{j }, y) - \mathcal {L}^{\delta }(x, y_{j})\right] \nonumber \\{} & {} \quad \le V(z_0 - z) + \sum _{j=1}^{k} \varepsilon _j + \frac{\sigma _m \left( \eta -1\right) k}{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned}$$
(53)

Proof

In a similar fashion as in the previous proof, we start again from the main inequality in Lemma 30, switching the index from k to j. Since \({\tilde{y}}_j = y_j + (y_j - y_{j - 1}) = y_j + \Sigma (A x_j - b^\delta )\) and \(y_{j+1} - y_j = \Sigma (A x_{j+1} - b^\delta )\), we get

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} -z) - V(z_j - z) + \frac{1-\tau _M L}{2 \tau _M} \left\| x_{j+1} - x_j \right\| ^2 + \frac{1}{2}\left\| \Sigma \left( A x_{j+1} - b^\delta \right) \right\| _{ \Sigma }^2 \\&\qquad + \left[ \mathcal {L}^{\delta }(x_{j+ 1}, x) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \\&\quad \le \varepsilon _{j+1}+\langle y_{j+1} - y_j - \Sigma \left( A x_{j} - b^\delta \right) , A x_{j+1} - A x \rangle \\&\quad = \varepsilon _{j+1}+\langle \Sigma A\left( x_{j+1} - x_j\right) , A x_{j+1} - Ax\rangle . \end{aligned} \end{aligned}$$

Now estimate

$$\begin{aligned} \begin{aligned} \frac{1}{2}\left\| \Sigma \left( A x_{j+1} - b^\delta \right) \right\| _{ \Sigma }^2&=\frac{1}{2} \langle \Sigma \left( A x_{j+1} - b^\delta \right) , A x_{j+1} - b^\delta \rangle \\&\ge \frac{\sigma _m}{2} \left\| A x_{j+1} - b^\delta \right\| _{ }^2\\&= \frac{\sigma _m}{2}\left\| A x_{j+1} - Ax \right\| ^2 \\&\quad + \frac{\sigma _m}{2}\left\| Ax-b^\delta \right\| ^2 + \sigma _m \langle A x_{j+1} - Ax, Ax-b^\delta \rangle . \end{aligned} \end{aligned}$$

So,

$$\begin{aligned}{} & {} V(z_{j + 1} - z) - V(z_j - z) + \frac{1-\tau _M L}{2\tau _M} \left\| x_{j+1} - x_j \right\| ^2 + \frac{\sigma _m}{2}\left\| A x_{j+1} - Ax \right\| ^2 \\{} & {} \qquad + \left[ \mathcal {L}^{\delta }(x_{j+1},y) - \mathcal {L}^{\delta }(x,y_{j+1})\right] \\{} & {} \quad \le \varepsilon _{j+1}+\langle \Sigma A\left( x_{j+1} - x_j\right) , A x_{j+1} - Ax\rangle + \sigma _m \langle A x_{j+1} - Ax, b^\delta -Ax \rangle \\{} & {} \qquad - \frac{\sigma _m}{2}\left\| Ax-b^\delta \right\| ^2\\{} & {} \quad \le \varepsilon _{j+1}+ \frac{\sigma _M\left\| A \right\| ^2}{2\xi }\left\| x_{j+1} - x_j \right\| ^2 +\frac{\xi \sigma _M}{2}\left\| A x_{j+1}- Ax \right\| ^2 \\{} & {} \qquad - \frac{\sigma _m}{2}\left\| Ax - b^\delta \right\| ^2\\{} & {} \qquad +\frac{\sigma _m}{2\eta }\left\| A x_{j+1}- Ax \right\| ^2 + \frac{\sigma _m \eta }{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned}$$

In the last inequality we used three times Cauchy-Schwarz inequality and twice Young inequality with parameters \(\xi >0\) and \(\eta >0\). Then, reordering and recalling the definitions of \(\theta := \xi - \tau _M(\xi L+\sigma _M \left\| A \right\| ^2)\), we obtain

$$\begin{aligned} \begin{aligned}&V(z_{j + 1} - z) - V(z_j - z) + \frac{\theta }{2\tau _M \xi }\left\| x_{j+1} - x_j \right\| ^2 \\&\quad + \frac{\sigma _m(\eta -1)-\sigma _M\xi \eta }{2\eta }\left\| A x_{j+1}- Ax \right\| ^2 \\&\quad + \left[ \mathcal {L}^{\delta }(x_{j+1},y) - \mathcal {L}^{\delta }\left( x,y_{j+1}\right) \right] \le \varepsilon _{j+1}+\frac{\sigma _m \left( \eta -1\right) }{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned} \end{aligned}$$

Summing-up the latter from \(j=0\) to \(j=k-1\), we get

$$\begin{aligned} \begin{aligned}&V(z_{k} - z) - V(z_0 - z) +\frac{\theta }{2\tau _M\xi }\sum _{j=0}^{k-1}\left\| x_{j + 1} - x_j \right\| ^2 \\&\quad +\frac{\sigma _m(\eta -1)-\sigma _M\xi \eta }{2\eta }\sum _{j=0}^{k-1}\left\| A x_{j+1}- Ax \right\| ^2\\&\quad + \sum _{j=0}^{k-1}\left[ \mathcal {L}^{\delta }(x_{j + 1}, y) - \mathcal {L}^{\delta }(x, y_{j + 1})\right] \ \ \le \ \ \sum _{j=0}^{k-1}\varepsilon _{j+1}+\frac{\sigma _m \left( \eta -1\right) k}{2}\left\| Ax-b^\delta \right\| ^2 . \end{aligned} \end{aligned}$$

By trivial manipulations, we get the claim. \(\square \)

Proofs of main results

1.1 Proof of Proposition 10

Proposition 10

Assume that Assumptions 4 and 5 hold. Let \((x_k, y_k)\) be the sequence generated by iterations (12) applied to \(b^{\delta } = {b}^\star \) under Assumptions 7 and 8. Let also \(\varepsilon _{k}=0\) for every \(k\in \mathbb {N}\). Then \((x_k, y_k)\) weakly converges to a pair in \(\mathcal {S}^\star \). In particular, \((x_k)\) weakly converges to a point in \(\mathcal {P}^{\star }\).

Proof

Up to a change of initialization and offset of index, the steps of algorithm (12) when \(\varepsilon _k = 0\) correspond to

$$\begin{aligned} {\left\{ \begin{array}{ll} y_{k + 1} = y_k + \Sigma \left( Ax_{k} - {b}^\star \right) \\ x_{k+1} = {{\,\textrm{prox}\,}}^{T}_R(x_k - T\nabla F(x_k) - TA^*(2 y_{k+1} - y_{k})) . \end{array}\right. } \end{aligned}$$
(54)

We now show that the previous iterations correspond to Algorithm 3.2 in [26], setting \(\sigma =\tau =1\) and applying it in the metrics defined by the preconditioning operators; namely, in the primal and dual spaces \((\mathcal {X}, \ \langle T^{-1} \cdot , \cdot \rangle )\) and \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) - respectively. Comparing problem (15) with (1) in [26], their notation in our setting reads as \(F=F,\ G=R,\ H=\iota _{\left\{ {b}^\star \right\} }\) and \(K=A\). The Fenchel conjugate of H in \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) is

$$\begin{aligned} \begin{aligned} H^{\star }(y)&=\sup _{z\in \mathcal {Y}} \left\{ \langle \Sigma z, y \rangle -\iota _{\left\{ {b}^\star \right\} }(z)\right\} =\langle \Sigma {b}^\star , y \rangle \end{aligned} \end{aligned}$$
(55)

and its proximal-point operator, again in \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\), is

$$\begin{aligned} \begin{aligned} {{\,\textrm{prox}\,}}_{H^{\star }}(y)&=\mathop {\textrm{argmin}}\limits _{z\in \mathcal {Y}} \left\{ \langle \Sigma {b}^\star , z \rangle +\frac{1}{2}\langle \Sigma (z-y),z-y\rangle \right\} =y-{b}^\star . \end{aligned} \end{aligned}$$
(56)

The gradient of F in \((\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\) is denoted by \(\nabla _{T} F(x)\) and satisfies, for x and v in \(\mathcal {X}\),

$$\begin{aligned} \langle T^{-1}\nabla _{T} F(x), v \rangle =\langle \nabla F(x), v \rangle . \end{aligned}$$

It is easy to see that one has \(\nabla _T F(x)=T \nabla F(x)\).

The adjoint operator of \(K: \ (\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\rightarrow (\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) satisfies, for every \((x,y)\in \mathcal {X}\times \mathcal {Y}\),

$$\begin{aligned} \begin{aligned} \langle T^{-1}K^*y,x \rangle&=\langle \Sigma Kx, y \rangle =\langle \Sigma Ax, y \rangle = \langle x, A^*\Sigma y \rangle , \end{aligned} \end{aligned}$$
(57)

implying that \(T^{-1}K^*=A^*\Sigma \) and so that \(K^*=TA^*\Sigma \). Then Algorithm 3.2 in [26] (with \(\sigma =\tau =1\), \(\rho _k=1\) for every \(k\in \mathbb {N}\) and no errors involved) is:

$$\begin{aligned} {\left\{ \begin{array}{ll} \bar{y}_{k + 1} = {{\,\textrm{prox}\,}}_{H^{\star }}(\bar{y}_k+K\bar{x}_k)\\ \bar{x}_{k+1} = {{\,\textrm{prox}\,}}_R(\bar{x}_k - \nabla _T F(\bar{x}_k) - K^*(2 \bar{y}_{k+1} - \bar{y}_{k})) , \end{array}\right. } \end{aligned}$$

and becomes, applied to our setting in the spaces \((\mathcal {X}, \ \langle T^{-1} \cdot , \cdot \rangle )\) and \((\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\),

$$\begin{aligned} {\left\{ \begin{array}{ll} \bar{y}_{k + 1} =\bar{y}_k + A\bar{x}_k-{b}^\star \\ \bar{x}_{k+1} = \mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ R(x)+\frac{1}{2}\left\| x-\left[ \bar{x}_k - T\nabla F(\bar{x}_k) - TA^*\Sigma (2 \bar{y}_{k+1} - \bar{y}_{k})\right] \right\| _{ T^{-1} }^2\right\} . \end{array}\right. } \end{aligned}$$

Define the variable \(\bar{z}_k=\Sigma \bar{y}_k\) and multiply the first line by \(\Sigma \). Then,

$$\begin{aligned} {\left\{ \begin{array}{ll} \bar{z}_{k + 1} =\bar{z}_k + \Sigma \left( A\bar{x}_k-{b}^\star \right) \\ \bar{x}_{k+1} = {{\,\textrm{prox}\,}}_R^T\left( \bar{x}_k - T\nabla F(\bar{x}_k) - TA^*(2 \bar{z}_{k+1} - \bar{z}_{k})\right) . \end{array}\right. } \end{aligned}$$

Comparing the previous with (54), we get that they are indeed the same algorithm. To conclude, we want to use Theorem 3.1 in [26], that ensures the weak convergence of the sequence generated by the algorithm to a saddle-point. It remains to check that, under our assumptions, the hypothesis of the above result are indeed satisfied; namely, that

$$\begin{aligned} 1-\left\| K \right\| _{ }^2-\frac{L_T}{2} \ge 0 , \end{aligned}$$
(58)

where \(\left\| K \right\| _{ }\) represents the operator norm of \(K: \ (\mathcal {X}, \ \langle T^{-1}\cdot , \cdot \rangle )\rightarrow (\mathcal {Y}, \ \langle \Sigma \cdot , \cdot \rangle )\) and \(L_T\) is the Lipschitz constant of \(\nabla _T F\). Notice that

$$\begin{aligned} \left\| K \right\| _{ }^2=\sup _{x\in \mathcal {X}} \frac{\langle \Sigma Ax,Ax\rangle }{\langle T^{-1}x,x\rangle }\le \sigma _M\tau _M\left\| A \right\| _{ }^2 . \end{aligned}$$

Moreover, \(L_T\le \tau _M L\). Indeed, for every x and \(x'\in \mathcal {X}\),

$$\begin{aligned} \left\| \nabla _T F(x')-\nabla _T F(x) \right\| _{ }=\left\| T \nabla F(x')-T \nabla F(x) \right\| _{ }\le \tau _M \left\| \nabla F(x')- \nabla F(x) \right\| _{ } . \end{aligned}$$

Then, by Assumption 8 and the previous considerations,

$$\begin{aligned} 0\le 1-\tau _M(L + \sigma _M\left\| A \right\| _{ }^2)\le 1-L_T-\left\| K \right\| _{ }^2\le 1-\frac{L_T}{2}-\left\| K \right\| _{ }^2 . \end{aligned}$$

In particular, (58) is satisfied and we the claim is proved. \(\square \)

1.2 Proof of Proposition 11

Proposition 11

Let \(({x}^\star , {y}^\star ) \in \mathcal {S}^\star \) and \((x, y) \in \mathcal {X}\times \mathcal {Y}\) such that \(\mathcal {L}^{\star }(x, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y) = 0\) and \(A x = {b}^\star \). Then \((x, {y}^\star ) \in \mathcal {S}^\star \).

Proof

For simplicity, denote \(J:=R+F\). First notice that, for our problem, the Lagrangian gap is equal to the Bregman divergence. Indeed, using \(-A^* {y}^\star \in \partial J({x}^\star )\) and \(A {x}^\star = {b}^\star \):

$$\begin{aligned} \mathcal {L}^{\star }(x, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)&= J(x) - J({x}^\star ) + \langle {y}^\star , A x - {b}^\star \rangle - \langle y, A {x}^\star - {b}^\star \rangle \nonumber \\&= J(x) - J({x}^\star ) + \langle A^* {y}^\star , x - {x}^\star \rangle = D_J^{-A^* {y}^\star }(x, {x}^\star ), \end{aligned}$$
(59)

We then show that if \(v \in \partial J({x}^\star )\) and \(D_{J}^{v}(x, {x}^\star ) = 0\), then \(v \in \partial J(x)\). Indeed, \(J(x) - J({x}^\star ) - \langle v, x - {x}^\star \rangle = 0\) and so, for all \(x' \in \mathcal {X}\),

$$\begin{aligned}{} & {} J(x') \ge J({x}^\star ) + \langle v, x' - {x}^\star \rangle = J(x) - \langle v, x - {x}^\star \rangle + \langle v, x' - {x}^\star \rangle \nonumber \\{} & {} \quad = J(x) + \langle v, x' - x \rangle . \end{aligned}$$
(60)

Proposition 11 follows by taking \(v = -A^* {b}^\star \). \(\square \)

1.3 Proof of Theorem 13

Theorem 13

Let Assumptions 4 and 5 hold and \(({x}^\star , {y}^\star ) \in \mathcal {S}^\star \) be a saddle-point of the exact problem. Let \((x_k, y_k)\) be generated by (12) under Assumptions 7 and 8 with inexact data \(b^\delta \) such that \(\left\| b^\delta - {b}^\star \right\| \le \delta \) and error \(\vert \varepsilon _k \vert \le C_0 \delta \) in the proximal operator for all \(k \in \mathbb {N}\). Denote by \((\hat{x}_k, \hat{y}_k)\) the averaged iterates \((\tfrac{1}{k}\sum _{j=1}^{k}x_j, \tfrac{1}{k} \sum _{j=1}^{k} y_j)\). Then there exist constants \(C_1, C_2\), \(C_3\) and \(C_4\) such that, for every \(k \in \mathbb {N}\),

$$\begin{aligned} \begin{aligned} \mathcal {L}^{\star }(\hat{x}_k,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , \hat{y}_k) \ \le \ \frac{C_1}{k} +C_2 \delta + C_3 \delta ^{3/2} k^{1/2} + C_4 \delta ^2 k . \end{aligned} \end{aligned}$$
(19)

Let also Assumption 9 hold. Then there exist constants \(C_5, C_6\), \(C_7\), \(C_8\) and \(C_9\) such that, for every \(k\in \mathbb {N}\),

$$\begin{aligned} \begin{aligned} \left\| A \hat{x}_{k} - {b}^\star \right\| ^2 \ {}&\le \frac{C_5}{k} + C_6 \delta + C_7 \delta ^{3/2} k^{1/2} + C_8 \delta ^2 k + C_9 \delta ^2 . \end{aligned} \end{aligned}$$
(20)

Proof

Recall that we denote \(z = (x, y) \in \mathcal {X}\times \mathcal {Y}\) a primal-dual pair, and define

$$\begin{aligned} V(z):= \frac{1}{2}\left\| x \right\| _{ T }^2+\frac{1}{2}\left\| y \right\| _{ \Sigma }^2 . \end{aligned}$$
(61)

Use Lemma 31 at \(x={x}^\star \) and \(y={y}^\star \), to get

$$\begin{aligned} \begin{aligned}&\tfrac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - {x}^\star \right\| ^2 + \tfrac{1}{2}\left\| y_{k} - {y}^\star \right\| _{ \Sigma }^2 \\&\qquad + \sum _{j=1}^{k} {[}\mathcal {L}^\delta (x_j, {y}^\star ) - \mathcal {L}^\delta ({x}^\star , y_j)] +\tfrac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k}\varepsilon _{j} . \end{aligned} \end{aligned}$$
(62)

Notice that

$$\begin{aligned} \mathcal {L}^\delta (x_{j}, {y}^\star ) - \mathcal {L}^\delta ({x}^\star , y_{j}) = \mathcal {L}^{\star }(x_{j}, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_{j}) + \langle y_{j} - {y}^\star , b^\delta -{b}^\star \rangle . \end{aligned}$$
(63)

Then,

$$\begin{aligned}&\tfrac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - {x}^\star \right\| ^2 + \frac{1}{2}\left\| y_{k} - {y}^\star \right\| _{ \Sigma }^2 + \sum _{j=1}^{k}\left[ \mathcal {L}^{\star }(x_j,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_j)\right] \nonumber \\&\qquad +\tfrac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k}\varepsilon _{j}+ \delta \sum _{j=1}^{k} \left\| y_{j} - {y}^\star \right\| . \end{aligned}$$
(64)

Recall that \(\mathcal {L}^{\star }(x,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)\ge 0 \) for every \(\left( x,y\right) \in \mathcal {X}\times \mathcal {Y}\). Moreover, \(\omega \ge 0\) by Assumption 8 and so \(1-\tau _M\sigma _M\left\| A \right\| _{ }^2 \ge 0\). Then, for every \(j\in \mathbb {N}\), we have that

$$\begin{aligned} \left\| y_j - {y}^\star \right\| _{ \Sigma }^2 \ \le \ 2 V(z_0 - {z}^\star ) + 2\sum _{i=1}^{j}\varepsilon _{i} + 2\delta \sum _{i=1}^{j}\left\| y_i - {y}^\star \right\| \end{aligned}$$
(65)

and so

$$\begin{aligned} \left\| y_j - {y}^\star \right\| _{ }^2 \ \le \ 2\sigma _M \left[ V(z_0 - {z}^\star ) +\sum _{i=1}^{j}\varepsilon _{i}\right] + 2\delta \sigma _M \sum _{i=1}^{j}\left\| y_i - {y}^\star \right\| . \end{aligned}$$
(66)

Apply Lemma 27 to Eq. (66) with \(u_j=\left\| y_j - {y}^\star \right\| \), \(S_j=2 \sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{j}\varepsilon _{i} \right] \) and \(\lambda =2\delta \sigma _M\). We get, for \(1\le j\le k\),

$$\begin{aligned} \begin{aligned} \left\| y_j - {y}^\star \right\|&\le \delta \sigma _M j + \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{j}\varepsilon _{i} \right] +\left( \delta \sigma _M j\right) ^2}\\&\le 2\delta \sigma _M k + \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] } . \end{aligned} \end{aligned}$$
(67)

Insert the latter in Eq. (64), to obtain

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^{k}\left[ \mathcal {L}^{\star }(x_j,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_j)\right] \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k}\varepsilon _{j} + \delta \sum _{j=1}^{k} \left( 2\delta \sigma _M k + \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] }\right) \\&\quad = V(z_0 - {z}^\star ) + \sum _{j=1}^{k}\varepsilon _{j}+\delta k \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] } + 2 \delta ^2 \sigma _M k^2 \\&\quad \le V(z_0 - {z}^\star ) + C_0 k \delta + \delta k \left( \sqrt{2\sigma _M V(z_0 - {z}^\star )} + \sqrt{2\sigma _M C_0 k \delta } \right) + 2 \delta ^2 \sigma _M k^2 , \end{aligned} \end{aligned}$$

where the last line uses \(\sqrt{a + b} \le \sqrt{a} + \sqrt{b}\). By Jensen’s inequality, we get the first claim.

For the second result, apply Lemma 32 at \(x={x}^\star \) and \(y={y}^\star \):

$$\begin{aligned} \begin{aligned}&V(z_{k} - {z}^\star ) + \frac{\theta }{2\tau _M\xi } \sum _{j=1}^{k}\left\| x_{j} - x_{j-1} \right\| ^2 + \frac{\rho }{2\eta }\sum _{j=1}^{k}\left\| A x_{j} - A{x}^\star \right\| ^2\\&\quad + \sum _{j=1}^{k}\left[ \mathcal {L}^\delta (x_{j }, {y}^\star ) - \mathcal {L}^\delta ({x}^\star , y_{j})\right] \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j + \frac{\sigma _m \left( \eta -1\right) k}{2}\left\| A{x}^\star -b^\delta \right\| ^2 . \end{aligned} \end{aligned}$$
(68)

Using Eqs. (63) and (67), we have

$$\begin{aligned} V(z_{k} - {z}^\star )&+\frac{\theta }{2\tau _M\xi }\sum _{j=1}^{k}\left\| x_{j} - x_{j-1} \right\| ^2 +\frac{\rho }{2\eta }\sum _{j=1}^{k}\left\| A x_{j}- {b}^\star \right\| ^2\nonumber \\&\qquad +\sum _{j=1}^{k}\left[ \mathcal {L}^{\star }(x_{j }, {y}^\star ) - \mathcal {L}^{\star }({x}^\star , y_{j})\right] \nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j +\sum _{j=1}^{k}\langle y_j-{y}^\star ,{b}^\star -b^\delta \rangle + \frac{\sigma _m \left( \eta -1\right) k}{2}\left\| {b}^\star -b^\delta \right\| ^2\nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j +\delta \sum _{j=1}^{k}\left\| y_j-{y}^\star \right\| + \frac{\sigma _m \left( \eta -1\right) k}{2}\delta ^2\nonumber \\&\quad \le V(z_0 - {z}^\star )+\sum _{j=1}^{k} \varepsilon _j+2\sigma _M\delta ^2 k^2 + \delta k \sqrt{2\sigma _M \left[ V(z_0 - {z}^\star )+\sum _{i=1}^{k}\varepsilon _{i} \right] }\nonumber \\&\qquad +\frac{\sigma _m \left( \eta -1\right) k}{2}\delta ^2 . \end{aligned}$$
(69)

Recall that \(\theta \ge 0\) and that \(\mathcal {L}^{\star }(x,{y}^\star ) - \mathcal {L}^{\star }({x}^\star , y)\ge 0 \) for every \(\left( x,y\right) \in \mathcal {X}\times \mathcal {Y}\). By Jensen’s inequality, rearranging the terms and using \(\sum _{i=1}^k \varepsilon _i \le C_0 k \delta \), we get the claim. The exact values of the constants of Theorem 13 are therefore:

$$\begin{aligned} C_1= & {} V(z_0 - {z}^\star ) , \nonumber \\ C_2= & {} C_0 + \sqrt{2\sigma _M V(z_0 - {z}^\star )} , \nonumber \\ C_3= & {} \sqrt{2 \sigma _M C_0} , \nonumber \\ C_4= & {} 2 \sigma _M , \nonumber \\ C_5= & {} \frac{2\eta }{\rho } C_1 , \nonumber \\ C_6= & {} \frac{2\eta }{\rho } C_2 , \nonumber \\ C_7= & {} \frac{2\eta }{\rho } C_3 , \nonumber \\ C_8= & {} \frac{2\eta }{\rho } C_4 , \nonumber \\ C_9= & {} \frac{\eta \sigma _m(\eta - 1)}{\rho } , \end{aligned}$$
(70)

\(\square \)

1.4 Example of divergence in absence of noisy solution (see Remark 15)

We present an example in which the primal exact problem has solution, but the noisy one does not and the averaged primal iterates generated by Algorithm (12) indeed diverge. First note that, if the function R has bounded domain, the primal iterates remain bounded. So, to exhibit a case of divergence of the primal iterates, we consider a function R with full domain: set \(R (\cdot )= \frac{1}{2} \left\| \cdot \right\| ^2\) (and \(F=0\)). The exact problem is then

$$\begin{aligned} \min _{x \in \mathcal {X}}\ \frac{1}{2} \left\| x \right\| ^2 \quad \text {s.t.} \quad Ax = b^{\star } . \end{aligned}$$
(71)

Now consider a noisy datum \(b^{\delta }\) such that \(Ax=b^\delta \) does not have a solution. If the associated normal equation, namely \(A^*Ax=A^*b^{\delta }\) is feasible, in Sect. 7 we prove not only boundedness of the iterates but also convergence to a normal solution. On the contrary, to get divergence of the iterates, here we consider a classic scenario in which the perturbation of the exact data generates an unfeasible constraint, even for the associated normal equation. We recall that this may happen only in the infinite dimensional setting, as when R(A) is finite dimensional, it is also closed and a solution to the normal equation always exists. As a prototype of ill-posed problem, let \(\mathcal {X}=\mathcal {Y}=\ell ^2\) and A be defined by, for every \(x\in \ell ^2\) and for every \(i\in \mathbb {N}\),

$$\begin{aligned} (Ax)^i = a^ix^i , \end{aligned}$$

where, for every \(i\in \mathbb {N}\), \(a^i\in (0,M)\) for a fixed constant \(M>0\) and \(\inf _{i\in \mathbb {N}} a_i =0\). Note that \(A:\ell ^2\rightarrow \ell ^2\) is well-defined, linear, continuous, self-adjoint and compact. Let \(b^\star \) in the range of A and denote by \(x^\star \) the unique solution to \(\mathcal {P}^{\star }\) defined in Eq. (71); namely, \((x^\star )^i:= (b^\star )^i/a^i\) for every \(i\in \mathbb {N}\). In particular, the \((b^\star )^i\) are such that \(x^\star \) belongs to \(\ell ^2\). Let also \(b^{\delta }\in \ell ^2\) with \(\Vert b^\delta -b^\star \Vert \le \delta \), but such that the noisy equation does not have a normal solution. Defining, for every \(i\in \mathbb {N}\),

$$\begin{aligned} (x^\delta )^i:= (b^\delta )^i/a^i, \end{aligned}$$
(72)

the previous means that \(x^\delta \) does not belong to \(\ell ^2\). For an explicit example, consider \(a^i=1/i, (b^\star )^i=1/i^2\) and \((b^\delta )^i=(b^\star )^i+C/i\), with \(C={\delta /}{ \sqrt{\sum _{j=1}^{+\infty } 1/j^2}}\).

Apply the algorithm with step-sizes \(\sigma >0\) and \(\tau >0\) such that \(\sigma \tau < 1/\left\| A \right\| _{ }^2\) and notice that it implies, for every \(i\in \mathbb {N}\), \(\sigma \tau < 1/(a^i)^2\). As \(a^i>0\) for every \(i\in \mathbb {N}\), the coordinates of the averaged sequence \((\hat{x}_k^i)\) are convergent to a solution of the following (one-dimensional) optimization problem:

$$\begin{aligned} \mathcal {P}^i&:= \mathop {\textrm{argmin}}\limits _{x^i\in \mathbb {R}} \left\{ \frac{1}{2}(x^i)^2: \ \ a^ix^i=(b^\delta )^i\right\} = \left\{ \frac{(b^\delta )^i}{a^i} \right\} . \end{aligned}$$

Hence, for the primal-dual algorithm, if \(x^{\delta }\notin \ell ^2\), then \((\hat{x}_k)\) diverges. Indeed, by contradiction, suppose that \((\hat{x}_k)\) is bounded. As it is bounded and converges coordinate-wise to \(x^{\delta }\), then it weakly converges to \(x^{\delta }\). But this is not possible since \(x^{\delta }\) is not in \(\ell ^2\).

Note that the problem considered in this example can be treated by Landweber method and it is well-known that also the iterates generated by this method, while being different from the ones of primal-dual algorithm, diverge.

Sparse recovery

1.1 Proof of Proposition 16

Proposition 16

Fix a primal-dual solution \(\left( {x}^\star , {y}^\star \right) \in \mathcal {S}^\star \). Let the extended support be \(\Gamma := \{i \in \mathbb {N}: | \left( A^*{y}^\star \right) _i | = 1 \}\) and the saturation gap be \(m:= \sup \left\{ | \left( A^*{y}^\star \right) _i |: | \left( A^*{y}^\star \right) _i | < 1 \right\} \). Then \(\Gamma \) is finite, and \(m < 1\). Moreover, for every \(x\in \mathcal {X}\), with \(\Gamma _C:= \mathbb {N}{\setminus } \Gamma \),

$$\begin{aligned} \begin{aligned} D^{-A^*{y}^\star }(x,{x}^\star )&\ge (1-m) \sum _{i\in \Gamma _C}^{} | x_i |. \end{aligned} \end{aligned}$$
(24)

Proof

Recall that \({x}^\star , {y}^\star \) is a primal-dual solution, hence \(-A^* {y}^\star \in \partial \left\| {x}^\star \right\| _1\). For every \(i \in \mathbb {N}\) we have that \(\left[ \partial \left\| \cdot \right\| _{ 1 }\right] _i({x}^\star ) \subseteq \left[ -1,1\right] \) and so \(| \left( A^*{y}^\star \right) _i |\le 1\). Recall that \(\Gamma _C:=\mathbb {N}{\setminus } \Gamma \). As \(A^*{y}^\star \) belongs to \(\mathcal {X}=\ell ^2(\mathbb {N}; \mathbb {R})\), we have

$$\begin{aligned} \sum _{i\in \mathbb {N}}^{} | \left( A^*{y}^\star \right) _i |^2<+\infty . \end{aligned}$$
(73)

Indeed, \(m\le 1\) by definition and from Eq. (73) the coefficients \(| \left( A^*{y}^\star \right) _i |\) converge to 0 (and so they can not accumulate at 1). We have also that

$$\begin{aligned} D^{-A^*{y}^\star }(x,{x}^\star )= & {} \sum _{i\in \mathbb {N}}\left[ | x_i |-| {x}^\star _i |+\left( A^*{y}^\star \right) _i\left( x_i-{x}^\star _i\right) \right] \\= & {} \sum _{i\in \mathbb {N}}\left[ | x_i |+\left( A^*{y}^\star \right) _i x_i\right] \\\ge & {} \sum _{i\in \Gamma }\left[ | x_i |-\underbrace{| \left( A^*{y}^\star \right) _i |}_{=1}| x_i |\right] +\sum _{i\in \Gamma _C}\left[ | x_i |-\underbrace{| \left( A^*{y}^\star \right) _i |}_{\le m} | x_i | \right] \\\ge & {} (1-m) \sum _{i\in \Gamma _C}^{} | x_i |. \end{aligned}$$

\(\square \)

1.2 Tikhonov regularization: Lasso

For Tikhonov regularization, the results in terms of Bregman divergence and feasibility are the following.

Lemma 33

([40], Lemma 3.5) Let \(A{x}^\star ={b}^\star \), \(-A^*{y}^\star \in \partial \left\| \cdot \right\| _1({x}^\star )\) and, for \(\alpha >0\),

$$\begin{aligned} x_\alpha \in \mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ \left\| Ax - b^\delta \right\| ^2 + \alpha \left\| x \right\| _1 \right\} . \end{aligned}$$
(74)

Then it holds that

$$\begin{aligned} \left\| Ax_{\alpha }-{b}^\star \right\| _{ } \le \delta +\alpha \left\| {y}^\star \right\| _{ } \quad \quad \text {and} \quad \quad D^{-A^*{y}^\star }(x_{\alpha },{x}^\star )\le \frac{\left( \delta +\alpha \left\| {y}^\star \right\| _{ }/2\right) ^2}{\alpha }. \end{aligned}$$

The previous bounds, combined with Assumption 17 and the last inequality in Lemma 18, lead naturally to the following corollary.

Corollary 34

([40], Theorem 5.6) Suppose Assumption 17 holds. Then, for \(x_\alpha \) defined as in Lemma 33 and \(C:=\alpha /\delta \),

$$\begin{aligned}&\left\| Ax_\alpha -{b}^\star \right\| _{ } \le \left( 1+CW_s\right) \delta \quad \quad \text {and} \\&\left\| x_{\alpha }-{x}^\star \right\| _{ } \le Q_s \left( 1+CW_s\right) \delta + \frac{1+Q_s\left\| A \right\| _{ }}{1-M_s} \frac{\left( 1+CW_s/2\right) ^2}{C} \delta . \end{aligned}$$

Proofs of Sect. 7

1.1 Proof of Corollary 21

Corollary 21

Let Assumption 4 hold. Let \((x_k,y_k)\) be the sequence generated by Eq. (12) with data b under Assumption 7, Assumption 8 and summable error (\((\varepsilon _k)\in \ell ^1\)). Denote by \((\hat{x}_k,\hat{y}_k)\) the averaged iterates. Then, every weak cluster point of \((\hat{x}_k,\hat{y}_k)\) belongs to \(\mathcal {S}\). In particular, if \(\mathcal {S}=\emptyset \), then the primal-dual sequence \((\hat{x}_k,\hat{y}_k)\) diverges: \(\left\| (\hat{x}_k,\hat{y}_k) \right\| _{ }\rightarrow +\infty \).

Proof

From Lemma 31, for any \((x, y)\in \mathcal {X}\times \mathcal {Y}\) and for any \(k\in \mathbb {N}\), we have

$$\begin{aligned} \begin{aligned}&\frac{1-\tau _M\sigma _M\left\| A \right\| _{ }^2}{2\tau _M}\left\| x_{k} - x \right\| ^2 + \frac{1}{2\sigma }\left\| y_{k} - y \right\| _{ \Sigma }^2 + \sum _{j=1}^{k}\left[ \mathcal {L}(x_j,y) - \mathcal {L}(x, y_j)\right] \\&\quad +\frac{\omega }{2\tau _M}\sum _{j=1}^{k}\left\| x_j - x_{j - 1} \right\| ^2 \\&\quad \le V(z_0 - z)+\sum _{j=1}^{k}\varepsilon _{j}, \end{aligned} \end{aligned}$$
(75)

where \(\omega := 1 - \tau _M(L+\sigma _M\left\| A \right\| ^2) \ge 0\) by Assumption 8. Using Jensen’s inequality, we get

$$\begin{aligned} \begin{aligned}&\mathcal {L}(\hat{x}_k,y) - \mathcal {L}(x, \hat{y}_k) \le \frac{1}{k} \left[ V(z_0 - z)+\sum _{j=1}^{+\infty }\varepsilon _{j}\right] . \end{aligned} \end{aligned}$$
(76)

Let \(\left( x_{\infty },y_{\infty }\right) \) be a weak cluster point of \((\hat{x}_k,\hat{y}_k)\); namely, there exists a subsequence \((\hat{x}_{k_j},\hat{y}_{k_j})\subseteq (\hat{x}_k,\hat{y}_k)\) such that \((\hat{x}_{k_j},\hat{y}_{k_j})\rightharpoonup (x_{\infty },y_{\infty })\). By weak lower-semicontinuity of R and F, for every \((x, y)\in \mathcal {X}\times \mathcal {Y}\),

$$\begin{aligned} \begin{aligned} \mathcal {L}(x_{\infty },y) - \mathcal {L}(x, y_\infty )&\le \liminf _{j} \mathcal {L}(\hat{x}_{k_j},y) - \mathcal {L}(x, \hat{y}_{k_j})\\&\le \liminf _{j} \frac{1}{k_j} \left[ V(z_0 - z)+\sum _{j=1}^{+\infty }\varepsilon _{j}\right] =0. \end{aligned} \end{aligned}$$
(77)

Thus \((x_{\infty },y_{\infty })\) is a saddle-point for the Lagrangian.

Now suppose that the set of saddle-points of \(\mathcal {L}\) is empty. Assume also, for contradiction, that \((\hat{x}_k,\hat{y}_k)\) does not diverge. Then we can extract a bounded subsequence, that consequently admits a weakly converging subsequence. But then, the limit is a saddle-point, which contradicts the assumption. \(\square \)

1.2 Proof of Lemma 22

Lemma 22

Let Assumption 4 hold. Assume that \(\tilde{\mathcal {C}}\ne \emptyset \). Let \(\left( x_k\right) \) be the primal sequence generated by algorithm (35); namely, with \(T=\tau {{\,\textrm{Id}\,}}\), \(\Sigma = \sigma {{\,\textrm{Id}\,}}\), \(\varepsilon _k=0\) for every \(k\in \mathbb {N}\) and \(y_0=y_{-1}+\sigma (Ax_0-b)\). Then, there exists a primal sequence \(\left( u_k\right) \) generated by the same procedure but applied to problem \(\tilde{\mathcal {P}}\) (as stated in (36)) such that \(x_k=u_k\) for every \(k\in \mathbb {N}\).

Proof

As \(\tilde{\mathcal {C}}\ne \emptyset \), there exists \(x^b\in \mathcal {X}\) such that \(A^*Ax^b=A^*b\). First consider the algorithm in (35). Note that, for every \(k\in \mathbb {N}\), \(\tilde{y}_k=y_k+\sigma \left( Ax_k-b\right) \) and multiply the last step by \(A^*\). We get, for every \(k\in \mathbb {N}\),

$$\begin{aligned} \begin{aligned} x_{k+1}&={{\,\textrm{prox}\,}}^{}_{\tau R} (x_k- \tau \nabla F(x_k)-\tau A^*y_{k}-\sigma \tau A^*A(x_k-x^b))\\ A^*y_{k+1}&=A^*y_{k}+\sigma A^*A(x_{k+1}-x^b). \end{aligned} \end{aligned}$$

Recall that \(S:=(A^*A)^{\frac{1}{2}}\) and introduce \(p_k:=A^*y_k\). Then the primal sequence \(\left( x_k\right) \) is equivalently defined by the following recursion: given \(x_0\) and \(p_{0}=A^*y_{0}\), for every \(k\in \mathbb {N}\),

$$\begin{aligned} \begin{aligned} x_{k+1}&={{\,\textrm{prox}\,}}^{}_{\tau R} (x_k- \tau \nabla F(x_k)-\tau p_{k}-\sigma \tau S^2(x_k-x^b))\\ p_{k+1}&=p_{k}+\sigma S^2(x_{k+1}-x^b). \end{aligned} \end{aligned}$$
(78)

As \(A^*y_{-1}\) belongs to \(R(A^*)\) and \(R(A^*)=R(S)\) ([30], Prop 2.18), there exists \(v_{-1}\) such that \(Sv_{-1}=A^*y_{-1}\). Now consider the primal-dual algorithm applied to problem (36) starting at \(u_0=x_0\), \(v_{-1}\) and \(v_0=v_{-1}+\sigma (Su_0-Sx^b)\). It reads as: for every \(k\in \mathbb {N}\),

$$\begin{aligned} \begin{aligned} \tilde{v}_k&=2v_k-v_{k-1}\\ u_{k+1}&={{\,\textrm{prox}\,}}_{\tau R} (u_k-\tau \nabla F(u_k)-\tau S\tilde{v}_k)\\ v_{k+1}&=v_k+\sigma (Su_{k+1}-Sx^b). \end{aligned} \end{aligned}$$

Then, noticing that \(\tilde{v}_k=v_k+\sigma \left( Su_k-Sx^b\right) \) and multiplying the last step by S,

$$\begin{aligned} \begin{aligned} u_{k+1}&={{\,\textrm{prox}\,}}_{\tau R} (u_k-\tau \nabla F(u_k)-\tau Sv_{k}-\sigma \tau S^2(u_k-x^b))\\ Sv_{k+1}&=Sv_{k}+\sigma S^2(u_{k+1}-x^b) . \end{aligned} \end{aligned}$$

Define the change of variable \(q_k:=Sv_k\), so that \(q_{-1}=Sv_{-1}=A^*y_{-1}\) and

$$\begin{aligned} q_0=Sv_0=S\left( v_{-1}+\sigma (Su_0-Sx^b)\right) =A^*(y_{-1}+\sigma (Ax_0-b))=A^*y_0=p_0. \end{aligned}$$

Then the primal sequence \(\left( u_k\right) \) is alternatively defined by the following recursion: for every \(k\in \mathbb {N}\),

$$\begin{aligned} \begin{aligned} u_{k+1}&={{\,\textrm{prox}\,}}_{\tau R} (u_k-\tau \nabla F(u_k)-\tau q_{k}-\sigma \tau S^2(u_k-x^b))\\ q_{k+1}&=q_{k}+\sigma S^2(u_{k+1}-x^b) . \end{aligned} \end{aligned}$$
(79)

Comparing Eq. (78) with Eq. (79), with \((u_0,q_0)=(x_0,p_0)\), we get the claim.

\(\square \)

1.3 Proof of Theorem 23

Theorem 23

Let Assumption 4 hold. Assume that \(\tilde{\mathcal {P}}\) (as stated in 32) admits a saddle-point; namely, that there exists a pair \((\tilde{x},\tilde{v})\in \mathcal {X}\times \mathcal {X}\) such that

$$\begin{aligned} {\left\{ \begin{array}{ll} -A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x}) , \\ A^*A\tilde{x}=A^*b . \end{array}\right. } \end{aligned}$$
(37)

Let \((x_k,y_k)\) be the sequence generated by Eq. (35), namely with initialization \(y_0=y_{-1}+\sigma (Ax_0-b)\), and under Assumption 8. Denote by \((\hat{x}_k)\) the averaged primal iterates. Then there exists \(\tilde{x}_{\infty }\in \tilde{\mathcal {P}}\) such that \(\hat{x}_k\rightharpoonup \tilde{x}_{\infty }\). Moreover, if \(\mathcal {P}= \emptyset \), then \(\hat{y}_k\) diverges.

Proof

From Lemma 22, we know that the sequence \(\left( \hat{x}_k\right) \) generated by Eq. (35) coincides with the primal iterate of a sequence \(\left( \hat{u}_k, \hat{v}_k\right) \) generated by the same algorithm on problem (36). Notice that \(\left\| S \right\| = \Vert (A^*A)^\frac{1}{2} \Vert = \left\| A \right\| \) and so, if Assumption 8 holds, the analogue also holds for problem (36): namely, \(1 - \tau (L + \sigma \left\| S \right\| ^2) \ge 0\). The same is true for Assumption 5. Indeed, defining \(\bar{v}=S\tilde{v}\), \(-S\bar{v}=-A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x})\). Moreover, we have seen already that \(A^*Ax=A^*b\) if and only if \(Sx=Sx^b\), where \(x^b\) is any vector in \(\mathcal {X}\) such that \(A^*Ax^b=A^*b\). Then, \(S\tilde{x}=Sx^b\) and \((\tilde{x},\bar{v})\) is a saddle-point for (36). So, by Proposition 10, we know that the averaged primal-dual sequence \((\hat{u}_k,\hat{v}_k)\) weakly converges to a saddle-point for (36). In particular, there exists \(\tilde{x}_{\infty }\in \tilde{\mathcal {P}}\) such that \(\hat{u}_k\rightharpoonup \tilde{x}_{\infty }\) and so the same holds for \((\hat{x}_k)\). For the second claim, by assumption we have that \(\mathcal {P}=\emptyset \), which implies that \(\mathcal {S}=\emptyset \). All the assumptions of Corollary 21 are verified, so \((\hat{x}_k, \hat{y}_k)\) diverges. As \(\left( \hat{x}_k\right) \) is weakly convergent and so bounded, we conclude that \(\left( \hat{y}_k\right) \) has to diverge. \(\square \)

1.4 Proof of Theorem 24

Theorem 24

Let Assumption 4 hold and suppose that there exists a pair \((\tilde{x},\tilde{v})\in \mathcal {X}\times \mathcal {X}\) such that

$$\begin{aligned} {\left\{ \begin{array}{ll} -A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x}) , \\ A^*A\tilde{x}=A^*b^{\star } \end{array}\right. } \end{aligned}$$
(38)

(namely, a saddle-point for the normal exact problem \(\tilde{\mathcal {P}}^{\star }\)). Let \(b^{\delta }\in \mathcal {Y}\) be a noisy data such that \(\left\| b^{\delta }-b^{\star } \right\| _{ }\le \delta \) for some \(\delta \ge 0\). Moreover, suppose that \(\tilde{\mathcal {C}}^\delta \ne \emptyset \); namely, that there exists \(x^{\delta }\in \mathcal {X}\) such that \(A^*Ax^{\delta }=A^*b^{\delta }\). Let Assumption 8 and Assumption 9 hold and \((x_k,y_k)\) be the sequence generated by the algorithm Eq. (35) on the noisy data \(b^{\delta }\); namely, for the initialization \(y_0=y_{-1}+\sigma (Ax_0-b^{\delta })\),

$$\begin{aligned} {\left\{ \begin{array}{ll} \tilde{y}_{k} = 2 y_k - y_{k - 1} , \\ x_{k+1} = {{\,\textrm{prox}\,}}^{}_{\tau R}(x_k - \tau \nabla F(x_k) - \tau A^*{\tilde{y}}_{k}) ,\\ y_{k + 1} = y_k + \sigma \left( Ax_{k + 1} - b^{\delta }\right) . \end{array}\right. } \end{aligned}$$

Denote by \((\hat{x}_k)\) the averaged primal iterates. Then,

$$\begin{aligned} D^{-A^*A\tilde{v}}(\hat{x}_k,\tilde{x}) \le \frac{C_1}{k} +C_2 \delta + C_4 \delta ^2 k \end{aligned}$$

and

$$\begin{aligned} \left\| A^*A\hat{x}_k-A^*b^{\star } \right\| _{ }^2\le \left\| S \right\| _{ }\left[ \frac{C_5}{k} + C_6 \delta + C_8 \delta ^2 k + C_9 \delta ^2 \right] , \end{aligned}$$

where the constants involved in the bounds are specified in the proof.

Proof

From the assumption \(\tilde{\mathcal {C}}^{\delta }\ne \emptyset \) and Lemma 22, we know that the sequence \(\left( \hat{x}_k\right) \) coincides with the primal iterate of a sequence \(\left( \hat{u}_k, \hat{v}_k\right) \) generated by the same algorithm on problem

$$\begin{aligned} \tilde{\mathcal {P}}^{\delta }=\mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ R(x)+F(x): \ \ Sx=Sx^{\delta } \right\} , \end{aligned}$$
(80)

where \(x^{\delta }\) is any vector in \(\mathcal {X}\) such that \(A^*Ax^{\delta }=A^*b^{\delta }\). As in the proof of the previous theorem, notice that \(\left\| S \right\| = \left\| A \right\| \) and so, as Assumption 8 and Assumption 9 hold by hypothesis, the analogue also holds for problem (80): namely, \(1 - \tau (L + \sigma \left\| S \right\| ^2) \ge 0\), \(\xi - \tau (\xi L+ \sigma \left\| S \right\| ^2)\ge 0\) and \(\sigma (\eta - 1) - \sigma \xi \eta >0\). The same is true for Assumption 5. Indeed, define \(\bar{v}=S\tilde{v}\). Then, from Eq. (38), \(-S\bar{v}=-A^*A\tilde{v}\in \partial R(\tilde{x})+\nabla F(\tilde{x})\) and \((\tilde{x},\bar{v})\) is a saddle-point for

$$\begin{aligned} \tilde{\mathcal {P}}^{\star }=\mathop {\textrm{argmin}}\limits _{x\in \mathcal {X}} \left\{ R(x)+F(x): \ \ Sx=S\tilde{x} \right\} . \end{aligned}$$
(81)

In particular, we can apply Theorem 13 for \(\left( \hat{u}_k, \hat{v}_k\right) \) - averaged primal-dual sequence generated on the noisy problem in (80) - with respect to \((\tilde{x},\bar{v})\) - saddle-point for the exact problem in (81) - to get that

$$\begin{aligned} D^{-S\bar{v}}(\hat{u}_k,\tilde{x}) \le \frac{C_1}{k} +C_2 \tilde{\delta } + C_4 (\tilde{\delta })^2 k \end{aligned}$$

and

$$\begin{aligned} \left\| S\hat{u}_k-S\tilde{x} \right\| _{ }^2\le \frac{C_5}{k} + C_6 \tilde{\delta }+ C_8 (\tilde{\delta })^2 k + C_9 (\tilde{\delta })^2. \end{aligned}$$

The constants in the previous bounds are the same as in (70) with \(z_0=(u_0,v_0)\), \(z^{\star }=(\tilde{x},\bar{v})\), \(C_3=C_7=0\) (because \(C_0=0\) as we suppose \(\varepsilon _k=0\) for every \(k\in \mathbb {N}\)), \(\sigma _m=\sigma _M=\sigma \) and

$$\begin{aligned} \tilde{\delta }:=\Vert Sx^{\delta }-S\tilde{x}\Vert . \end{aligned}$$

From Lemma 22, we recall also that \(u_0=x_0\) and \(v_0=v_{-1}+\sigma (Su_0-Sx^{\delta })\), where \(v_{-1}\) is any element in \(\mathcal {X}\) such that \(Sv_{-1}=A^*y_{-1}\) (\(v_{-1}\) exists due to \(R(A^*)=R(S)\)). Now it remains to show that \(\tilde{\delta }\le \delta \). Denote by \((\mu _i, f_i,g_i )_{i\in \mathbb {N}} \subseteq \mathbb {R}_+ \times \mathcal {X}\times \mathcal {Y}\) the singular value decomposition of the operator A. First, notice that \(S^2(x^{\delta }-\tilde{x})=A^*(b^{\delta }-b^{\star })\) and so that, for every \(i\in \mathbb {N}\),

$$\begin{aligned} \mu _i^2 \langle x^{\delta }-\tilde{x},f_i\rangle =\mu _i \langle b^{\delta }-b^{\star }, g_i \rangle . \end{aligned}$$

Then, for every \(i\in \mathbb {N}\) such that \(\mu _i\ne 0\), \(\mu _i \langle x^{\delta }-\tilde{x},f_i\rangle =\langle b^{\delta }-b^{\star }, g_i \rangle \) and so

$$\begin{aligned} \tilde{\delta }^2&= \Vert Sx^{\delta }-S\tilde{x}\Vert ^2 =\sum _{i\in \mathbb {N}} \left( \mu _i \langle x^{\delta }-\tilde{x},f_i\rangle \right) ^2 =\sum _{\mu _i\ne 0} \left( \mu _i \langle x^{\delta }-\tilde{x},f_i\rangle \right) ^2 \\&= \sum _{\mu _i\ne 0} \left( \langle b^{\delta }-b^{\star }, g_i \rangle \right) ^2 \le \sum _{i\in \mathbb {N}} \left( \langle b^{\delta }-b^{\star }, g_i \rangle \right) ^2 = \Vert b^{\delta }-b^{\star }\Vert ^2 \le \delta ^2. \end{aligned}$$

We conclude the claim simply by noticing that

$$\begin{aligned} D^{-A^*A\tilde{v}}(\hat{x}_k,\tilde{x})=D^{-S\bar{v}}(\hat{u}_k,\tilde{x}) \end{aligned}$$

and

$$\begin{aligned} \left\| A^*A\hat{x}_k-A^*b^{\star } \right\| _{ }=\left\| S^2\hat{u}_k-S^2 \tilde{x} \right\| _{ }\le \left\| S \right\| _{ }\left\| S\hat{u}_k-S\tilde{x} \right\| _{ }. \end{aligned}$$

\(\square \)

A dual view on the implicit bias of gradient descent on least squares

Here we provide an interesting view on why the “implicit” bias of gradient descent on least squares is not so implicit. Recall that these iterations,

$$\begin{aligned} x_{k+1} = x_k - \gamma A^* (A x_k - b) , \end{aligned}$$
(82)

converge, for \(\gamma < 2 / \left\| A \right\| ^2_{\textrm{op}}\), to the minimal Euclidean norm solution of \(Ax = b\):

$$\begin{aligned} \min _{x \in \mathcal {X}} \frac{1}{2} \left\| x \right\| ^2 \quad \text {s.t.} \quad Ax = b , \end{aligned}$$
(83)

provided that Problem (83) is feasible and \(x_0=0\).

It turns out that the iterations (82) correspond, up to multiplication by \(-A^*\), to the iterates of gradient descent to the dual of (83), namely:

$$\begin{aligned} \min _{y \in \mathcal {Y}} \frac{1}{2} \left\| A^* y \right\| ^2 + \langle b, y \rangle , \quad \text {and} \quad y_{k+1} = y_k - \gamma (A A^* y_k + b) . \end{aligned}$$
(84)

By setting \(x_{k+1} =- A^* y_{k+1}\) one recovers the iterates of gradient descent on least squares (82). Therefore the “implicit bias” of gradient descent on least squares is not so implicit: its iterates \(x_k\) are dual to iterates \(y_k\) on Problem (84), which is itself the dual of Problem (83) in which the bias appears explicitly.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Molinari, C., Massias, M., Rosasco, L. et al. Iterative regularization for low complexity regularizers. Numer. Math. 156, 641–689 (2024). https://doi.org/10.1007/s00211-023-01390-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00211-023-01390-8

Mathematics Subject Classification

Navigation