Statistical properties of BayesCG under the Krylov prior

Reid, Tim W.; Ipsen, Ilse C. F.; Cockayne, Jon; Oates, Chris J.

doi:10.1007/s00211-023-01375-7

Statistical properties of BayesCG under the Krylov prior

Published: 12 October 2023

Volume 155, pages 239–288, (2023)
Cite this article

Numerische Mathematik Aims and scope Submit manuscript

Tim W. Reid¹,
Ilse C. F. Ipsen¹,
Jon Cockayne² &
…
Chris J. Oates³

278 Accesses
Explore all metrics

Abstract

We analyse the calibration of BayesCG under the Krylov prior. BayesCG is a probabilistic numeric extension of the Conjugate Gradient (CG) method for solving systems of linear equations with real symmetric positive definite coefficient matrix. In addition to the CG solution, BayesCG also returns a posterior distribution over the solution. In this context, a posterior distribution is said to be ‘calibrated’ if the CG error is well-described, in a precise distributional sense, by the posterior spread. Since it is known that BayesCG is not calibrated, we introduce two related weaker notions of calibration, whose departures from exact calibration can be quantified. Numerical experiments confirm that, under low-rank approximate Krylov posteriors, BayesCG is only slightly optimistic and exhibits the characteristics of a calibrated solver, and is computationally competitive with CG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayes procedures for adaptive inference in inverse problems for the white noise model

Article Open access 22 February 2015

Bayesian high-dimensional semi-parametric inference beyond sub-Gaussian errors

Article 12 November 2020

Sparse Estimation: An MMSE Approach

Article 14 February 2023

References

Cockayne, J., Oates, C.J., Ipsen, I.C.F., Girolami, M.: A Bayesian conjugate gradient method (with discussion). Bayesian Anal. 14(3), 937–1012 (2019). https://doi.org/10.1214/19-BA1145. Includes 6 discussions and a rejoinder from the authors
Reid, T.W., Ipsen, I.C.F., Cockayne, J., Oates, C.J.: BayesCG as an uncertainty aware version of CG. arXiv:2008.03225 (2022)
Cockayne, J., Oates, C.J., Sullivan, T.J., Girolami, M.: Bayesian probabilistic numerical methods. SIAM Rev. 61(4), 756–789 (2019). https://doi.org/10.1137/17M1139357
Article MathSciNet MATH Google Scholar
Hennig, P., Osborne, M.A., Girolami, M.: Probabilistic numerics and uncertainty in computations. Proc. R. Soc. A. 471(2179), 20150142–17 (2015)
Article MathSciNet MATH Google Scholar
Oates, C.J., Sullivan, T.J.: A modern retrospective on probabilistic numerics. Stat. Comput. 29(6), 1335–1351 (2019). https://doi.org/10.1007/s11222-019-09902-z
Article MathSciNet MATH Google Scholar
Cockayne, J., Ipsen, I.C.F., Oates, C.J., Reid, T.W.: Probabilistic iterative methods for linear systems. J. Mach. Learn. Res. 22(232), 1–34 (2021)
MathSciNet MATH Google Scholar
Hart, J., van Bloemen Waanders, B., Herzog, R.: Hyperdifferential sensitivity analysis of uncertain parameters in PDE-constrained optimization. Int. J. Uncertain. Quantif. 10(3), 225–248 (2020). https://doi.org/10.1615/Int.J.UncertaintyQuantification.2020032480
Article MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering, p. 664. Springer, New York (2006)
Petra, N., Zhu, H., Stadler, G., Hughes, T.J.R., Ghattas, O.: An inexact Gauss-Newton method for inversion of basal sliding and rheology parameters in a nonlinear Stokes ice sheet model. J. Glaciol. 58(211), 889–903 (2012). https://doi.org/10.3189/2012JoG11J182
Article Google Scholar
Saibaba, A.K., Hart, J., van Bloemen Waanders, B.: Randomized algorithms for generalized singular value decomposition with application to sensitivity analysis. Numer. Linear Algebra Appl. 28(4), 2364–27 (2021). https://doi.org/10.1002/nla.2364
Article MathSciNet MATH Google Scholar
Bartels, S., Cockayne, J., Ipsen, I.C.F., Hennig, P.: Probabilistic linear solvers: a unifying view. Stat. Comput. 29(6), 1249–1263 (2019). https://doi.org/10.1007/s11222-019-09897-7
Article MathSciNet MATH Google Scholar
Fanaskov, V.: Uncertainty calibration for probabilistic projection methods. Stat. Comput. 31(5), 56–17 (2021). https://doi.org/10.1007/s11222-021-10031-9
Article MathSciNet MATH Google Scholar
Hennig, P.: Probabilistic interpretation of linear solvers. SIAM J. Optim. 25(1), 234–260 (2015). https://doi.org/10.1137/140955501
Article MathSciNet MATH Google Scholar
Wenger, J., Hennig, P.: Probabilistic linear solvers for machine learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6731–6742. Curran Associates Inc, Red Hook (2020)
Google Scholar
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 409–436 (1952). https://doi.org/10.6028/jres.049.044
Article MathSciNet MATH Google Scholar
Higham, N.J.: Functions of Matrices. Theory and Computation. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2008). https://doi.org/10.1137/1.9780898717778
Book MATH Google Scholar
Cockayne, J., Oates, C.J., Ipsen, I.C.F., Girolami, M.: Supplementary material for ‘A Bayesian conjugate-gradient method’. Bayesian Anal. (2019). https://doi.org/10.1214/19-BA1145SUPP
Liesen, J., Strakos, Z.: Krylov Subspace Methods: Principles and Analysis. Oxford University Press, Oxford (2013)
MATH Google Scholar
Berljafa, M., Güttel, S.: Generalized rational Krylov decompositions with an application to rational approximation. SIAM J. Matrix Anal. Appl. 36(2), 894–916 (2015). https://doi.org/10.1137/140998081
Article MathSciNet MATH Google Scholar
Strakoš, Z., Tichý, P.: On error estimation in the conjugate gradient method and why it works in finite precision computations. Electron. Trans. Numer. Anal. 13, 56–80 (2002)
MathSciNet MATH Google Scholar
Kressner, D., Latz, J., Massei, S., Ullmann, E.: Certified and fast computations with shallow covariance kernels. arXiv:2001.09187 (2020)
Villani, C.: Optimal Transport, Old and New. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338, p. 973. Springer, Berlin (2009). https://doi.org/10.1007/978-3-540-71050-9
Gelbrich, M.: On a formula for the $L^2$ Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachr. 147, 185–203 (1990). https://doi.org/10.1002/mana.19901470121
Article MathSciNet MATH Google Scholar
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, New York (1985). https://doi.org/10.1007/978-1-4757-4286-2
Book MATH Google Scholar
Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer 19, 451–559 (2010). https://doi.org/10.1017/S0962492910000061
Article MathSciNet MATH Google Scholar
Cockayne, J., Graham, M.M., Oates, C.J., Sullivan, T.J.: Testing whether a Learning Procedure is Calibrated. arXiv:2012.12670 (2021)
Ross, S.M.: Introduction to Probability Models, 9th edn. Academic Press Inc, Boston (2007)
Google Scholar
Kaltenbach, H.-M.: A Concise Guide to Statistics. Springer Briefs in Statistics, p. 111. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23502-3
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112, 2nd edn. Springer, New York (2021). https://doi.org/10.1007/978-1-0716-1418-1
Book MATH Google Scholar
BCSSTK14: BCS Structural Engineering Matrices (linear equations) Roof of the Omni Coliseum, Atlanta. https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc2/bcsstk14.html
Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2013)
Book MATH Google Scholar
Hairer, E., Nørsett, S.P., Wanner, G.: Solving Ordinary Differential Equations. I, 2nd edn. Springer Series in Computational Mathematics, vol. 8. Springer, Berlin (1993)
Golub, G.H., Meurant, G.: Matrices, moments and quadrature. II. How to compute the norm of the error in iterative methods. BIT 37(3), 687–705 (1997). https://doi.org/10.1007/BF02510247
Article MathSciNet MATH Google Scholar
Meurant, G., Tichý, P.: Approximating the extreme Ritz values and upper bounds for the $A$-norm of the error in CG. Numer. Algorithms 82(3), 937–968 (2019). https://doi.org/10.1007/s11075-018-0634-8
Article MathSciNet MATH Google Scholar
Strakoš, Z.: On the real convergence rate of the conjugate gradient method. Linear Algebra Appl. 154(156), 535–549 (1991). https://doi.org/10.1016/0024-3795(91)90393-B
Article MathSciNet MATH Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, New York (1982)
Book MATH Google Scholar
Ouellette, D.V.: Schur complements and statistics. Linear Algebra Appl. 36, 187–295 (1981). https://doi.org/10.1016/0024-3795(81)90232-9
Article MathSciNet MATH Google Scholar
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)
Book MATH Google Scholar
Campbell, S.L., Meyer, C.D.: Generalized Inverses of Linear Transformations. Classics in Applied Mathematics, vol. 56. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2009). https://doi.org/10.1137/1.9780898719048.ch0
Mathai, A.M., Provost, S.B.: Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker Inc, New York (1992)
MATH Google Scholar
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2003)
Book MATH Google Scholar
Giraud, L., Langou, J., Rozložník, M.: The loss of orthogonality in the Gram–Schmidt orthogonalization process. Comput. Math. Appl. 50(7), 1069–1075 (2005). https://doi.org/10.1016/j.camwa.2005.08.009
Article MathSciNet MATH Google Scholar
Giraud, L., Langou, J., Rozložník, M., van den Eshof, J.: Rounding error analysis of the classical Gram–Schmidt orthogonalization process. Numer. Math. 101(1), 87–100 (2005). https://doi.org/10.1007/s00211-005-0615-4
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the reviewers for their helpful comments, and questions which lead to the addition of Sect. 6. We are also deeply indebted to Jonathan Wenger for his very careful reading of our paper and many perceptive suggestions that greatly improved the paper and widened its outlook through the addition of Sects. 5.5 and 5.6.

Funding

The work was supported in part by NSF Grant DMS-1745654 (TWR, ICFI), NSF Grant DMS-1760374 and DOE Grant DE-SC0022085 (ICFI), and the Lloyd’s Register Foundation Programme on Data Centric Engineering at the Alan Turing Institute (CJO)

Author information

Authors and Affiliations

Department of Mathematics, North Carolina State University, Raleigh, NC, 27695-8205, USA
Tim W. Reid & Ilse C. F. Ipsen
Mathematical Sciences, University of Southampton, Southampton, SO17 1BJ, UK
Jon Cockayne
School of Mathematics and Statistics, Newcastle University, Newcastle-upon-Tyne, NE1 7RU, UK
Chris J. Oates

Authors

Tim W. Reid
View author publications
You can also search for this author in PubMed Google Scholar
Ilse C. F. Ipsen
View author publications
You can also search for this author in PubMed Google Scholar
Jon Cockayne
View author publications
You can also search for this author in PubMed Google Scholar
Chris J. Oates
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilse C. F. Ipsen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Auxiliary results

We present auxiliary results required for proofs in other sections.

The stability of Gaussian distributions implies that a linear transformation of a Gaussian random variable remains Gaussian.

Lemma A.1

(Stability of Gaussian Distributions [37, Section 1.2]) Let $X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}},\varvec{\Sigma })$ be a Gaussian random variable with mean ${\textbf{x}}\in {\mathbb {R}}^n$ and covariance $\varvec{\Sigma }\in {\mathbb {R}}^{n\times n}$. If ${\textbf{y}}\in {\mathbb {R}}^n$ and ${\textbf{F}}\in {\mathbb {R}}^{n\times n}$, then

$$\begin{aligned} Z= {\textbf{y}}+{\textbf{F}}X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{y}}+{\textbf{F}}{\textbf{x}}, {\textbf{F}}\varvec{\Sigma }{\textbf{F}}^T). \end{aligned}$$

The conjugacy of Gaussian distributions implies that the distribution of a Gaussian random variable conditioned on information that linearly depends on the random variable is a Gaussian distribution.

Lemma A.2

(Conjugacy of Gaussian Distributions [38, Section 6.1], [25, Corollary 6.21]) Let $X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}},\varvec{\Sigma }_x)$ and $Y\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{y}},\varvec{\Sigma }_y)$. The jointly Gaussian random variable $\begin{bmatrix}X^T&Y^T\end{bmatrix}^T$ has the distribution

$$\begin{aligned} \begin{bmatrix} X\\ Y\end{bmatrix} \sim {{\,\mathrm{{\mathcal {N}}}\,}}\left( \begin{bmatrix} {\textbf{x}}\\ {\textbf{y}}\end{bmatrix},\begin{bmatrix}\varvec{\Sigma }_x &{} \varvec{\Sigma }_{xy} \\ \varvec{\Sigma }_{xy}^T &{} \varvec{\Sigma }_y\end{bmatrix} \right) , \end{aligned}$$

where $\varvec{\Sigma }_{xy} \equiv {{\,\textrm{Cov}\,}}(X,Y) = {{\,\mathrm{{\mathbb {E}}}\,}}[(X-{\textbf{x}})(Y-{\textbf{y}})^T]$ and the conditional distribution of $X$ given $Y$ is

$$\begin{aligned} (X\mid Y) \sim {{\,\mathrm{{\mathcal {N}}}\,}}(\overbrace{{\textbf{x}}+ \varvec{\Sigma }_{xy}\varvec{\Sigma }_y^\dagger (Y- {\textbf{y}})}^{\text {mean}}, \ \overbrace{\varvec{\Sigma }_x - \varvec{\Sigma }_{xy}\varvec{\Sigma }_y^\dagger \varvec{\Sigma }_{xy}^T}^{\text {covariance}}). \end{aligned}$$

We show how to transform a ${\textbf{B}}$-orthogonal matrix into an orthogonal matrix.

Lemma A.3

Let ${\textbf{B}}\in {\mathbb {R}}^{n\times n}$ be symmetric positive definite, and let ${\textbf{H}}\in {\mathbb {R}}^{n\times n}$ be a ${\textbf{B}}$-orthogonal matrix with ${\textbf{H}}^T{\textbf{B}}{\textbf{H}}= {\textbf{H}}{\textbf{B}}{\textbf{H}}^T = {\textbf{I}}$. Then

$$\begin{aligned} {\textbf{U}}\equiv {\textbf{B}}^{1/2}{\textbf{H}}\end{aligned}$$

is an orthogonal matrix with ${\textbf{U}}^T{\textbf{U}}= {\textbf{U}}{\textbf{U}}^T={\textbf{I}}$.

Proof

The symmetry of ${\textbf{B}}$ and the ${\textbf{B}}$-orthogonality of ${\textbf{H}}$ imply

$$\begin{aligned} {\textbf{U}}^T{\textbf{U}}= {\textbf{H}}^T{\textbf{B}}{\textbf{H}}= {\textbf{I}}. \end{aligned}$$

From the orthonormality of the columns of ${\textbf{U}}$, and the fact that ${\textbf{U}}$ is square follows that ${\textbf{U}}$ is an orthogonal matrix [39, Definition 2.1.3]. $\square $

Definition A.4

[39, Section 7.3] The thin singular value decomposition of the rank-p matrix ${\textbf{G}}\in {\mathbb {R}}^{m\times n}$ is

$$\begin{aligned} {\textbf{G}}= {\textbf{U}}{\textbf{D}}{\textbf{W}}^T, \end{aligned}$$

where ${\textbf{U}}\in {\mathbb {R}}^{m\times p}$ and ${\textbf{W}}\in {\mathbb {R}}^{n\times p}$ are matrices with orthonormal columns and ${\textbf{D}}\in {\mathbb {R}}^{p\times p}$ is a diagonal matrix with positive diagonal elements. The Moore–Penrose inverse of ${\textbf{G}}$ is

$$\begin{aligned} {\textbf{G}}^\dagger = {\textbf{W}}{\textbf{D}}^{-1}{\textbf{U}}^T. \end{aligned}$$

If a matrix has full column-rank or full row-rank, then its Moore–Penrose can be expressed in terms of the matrix itself. Furthermore, the Moore–Penrose inverse of a product is equal to the product of the Moore–Penrose inverses, provided the first matrix has full column-rank and the second matrix has full row-rank.

Lemma A.5

[40, Corollary 1.4.2] Let ${\textbf{G}}\in {\mathbb {R}}^{m\times n}$ and ${\textbf{J}}\in {\mathbb {R}}^{n\times p}$ have full column and row rank respectively, so ${{\,\textrm{rank}\,}}({\textbf{G}})={{\,\textrm{rank}\,}}({\textbf{J}})=n$. The Moore–Penrose inverses of ${\textbf{G}}$ and ${\textbf{J}}$ are

$$\begin{aligned} {\textbf{G}}^\dagger = ({\textbf{G}}^T{\textbf{G}})^{-1}{\textbf{G}}^T \quad \text {and} \quad {\textbf{J}}^\dagger = {\textbf{J}}^T({\textbf{J}}{\textbf{J}}^T)^{-1} \end{aligned}$$

respectively, and the Moore–Penrose inverse of the product equals

$$\begin{aligned} ({\textbf{G}}{\textbf{J}})^\dagger = {\textbf{J}}^\dagger {\textbf{G}}^\dagger . \end{aligned}$$

Below is an explicit expression for the mean of a quadratic form of Gaussians.

Lemma A.6

[41, Sections 3.2b.1–3.2b.3] Let $Z\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_z,\varvec{\Sigma }_z)$ be a Gaussian random variable in ${\mathbb {R}}^n$, and ${\textbf{B}}\in {\mathbb {R}}^{n\times n}$ be symmetric positive definite. The mean of $Z^T{\textbf{B}}Z$ is

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}[Z^T{\textbf{B}}Z]&= {{\,\textrm{trace}\,}}({\textbf{B}}\varvec{\Sigma }_z) + {\textbf{x}}_z^T{\textbf{B}}{\textbf{x}}_z. \end{aligned}$$

We show that the squared Euclidean norm of a Gaussian random variable with an orthogonal projector as its covariance matrix is distributed according to a chi-squared distribution.

Lemma A.7

Let $Z\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{P}})$ be a Gaussian random variable in ${\mathbb {R}}^n$. If the covariance matrix ${\textbf{P}}$ is an orthogonal projector, that is, if ${\textbf{P}}^2 = {\textbf{P}}$ and ${\textbf{P}}= {\textbf{P}}^T$, then

$$\begin{aligned} \Vert X\Vert _2^2= (X^TX) \sim \chi _p^2, \end{aligned}$$

where $p = {{\,\textrm{rank}\,}}({\textbf{P}})$.

Proof

We express the projector in terms of orthonormal matrices and then use the invariance of the 2-norm under orthogonal matrices and the stability of Gaussians.

Since ${\textbf{P}}$ is an orthogonal projector, there exists ${\textbf{U}}_1\in {\mathbb {R}}^{n\times p}$ such that ${\textbf{U}}_1{\textbf{U}}_1^T = {\textbf{P}}$ and ${\textbf{U}}_1^T{\textbf{U}}= {\textbf{I}}_p$. Choose ${\textbf{U}}_2\in {\mathbb {R}}^{n\times (n-p)}$ so that ${\textbf{U}}= \begin{bmatrix} {\textbf{U}}_1&{\textbf{U}}_2 \end{bmatrix}$ is an orthogonal matrix. Thus,

$$\begin{aligned} X^TX= X^T{\textbf{U}}{\textbf{U}}^TX= X^T{\textbf{U}}_1{\textbf{U}}_1^TX+ X^T{\textbf{U}}_2{\textbf{U}}_2^TX. \end{aligned}$$

(A1)

Lemma A.1 implies that $Y= {\textbf{U}}_1^TX$ is distributed according to a Gaussian distribution with mean ${\textbf{0}}$ and covariance ${\textbf{U}}_1^T{\textbf{U}}_1{\textbf{U}}_1^T{\textbf{U}}= {\textbf{I}}_p$. Similarly, $Z= {\textbf{U}}_2^TX$ is distributed according to a Gaussian distribution with mean ${\textbf{0}}$ and covariance ${\textbf{U}}_2^T{\textbf{U}}_1{\textbf{U}}_1^T{\textbf{U}}_2 = {\textbf{0}}$, thus $Z= {\textbf{0}}$.

Substituting $Y$ and $Z$ into (A1) gives $X^TX= Y^TY+ {\textbf{0}}^T{\textbf{0}}$. From $Y\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{I}}_p)$ follows $(X^TX)\sim \chi _p^2$. $\square $

Lemma A.8

If ${\textbf{A}}\in {\mathbb {R}}^{n\times n}$ is symmetric positive definite, and $M\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_\mu \varvec{\Sigma }_\mu )$ and $N\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_\nu ,\varvec{\Sigma }_\nu )$ are independent random variables in ${\mathbb {R}}^n$, then

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}[\Vert M-N\Vert _{\textbf{A}}^2] = \Vert {\textbf{x}}_\mu -{\textbf{x}}_\nu \Vert _{\textbf{A}}^2 + {{\,\textrm{trace}\,}}({\textbf{A}}\varvec{\Sigma }_\mu ) + {{\,\textrm{trace}\,}}({\textbf{A}}\varvec{\Sigma }_\nu ). \end{aligned}$$

Proof

The random variable $M-N$ has mean ${{\,\mathrm{{\mathbb {E}}}\,}}[M-N] = {\textbf{x}}_\mu -{\textbf{x}}_\nu $, and covariance

$$\begin{aligned} \varvec{\Sigma }_{M-N}&\equiv {{\,\textrm{Cov}\,}}(M-N,M-N) \\&= {{\,\textrm{Cov}\,}}(M,M) + {{\,\textrm{Cov}\,}}(N,N) -{{\,\textrm{Cov}\,}}(M,N) - {{\,\textrm{Cov}\,}}(N,M) \\&= {{\,\textrm{Cov}\,}}(M,M) + {{\,\textrm{Cov}\,}}(N,N) = \varvec{\Sigma }_{\mu }+\varvec{\Sigma }_{\nu }, \end{aligned}$$

where the covariances ${{\,\textrm{Cov}\,}}(M,N) ={{\,\textrm{Cov}\,}}(N,M) = 0$ because $M$ and $N$ are independent. Now apply Lemma A.6 to $M-N$. $\square $

Appendix B: Algorithms

We present algorithms for the modified Lanczos method (Sect. B.1), BayesCG with random search directions (Sect. B.2), BayesCG with covariances in factored form (Sect. B.3), and BayesCG under the Krylov prior (Sect. B.4).

1.1 Modified Lanczos method

The Lanczos method [42, Algorithm 6.15] produces an orthonormal basis for the Krylov space ${\mathcal {K}}_g({\textbf{A}},{\textbf{v}}_1)$, while the modified version in Algorithm B.1 produces an ${\textbf{A}}$-orthonormal basis.

Algorithm B.1 reorthogonalizes the basis vectors ${\textbf{v}}_i$ with Classical Gram-Schmidt performed twice, see Lines 10 and 11. This reorthogonalization technique can be implemented efficiently and produces vectors that are orthogonal to machine precision [43, 44].

1.2 BayesCG with random search directions

The version of BayesCG in Algorithm B.2 is designed to be calibrated because the search directions do not depend on ${\textbf{x}}_*$, hence the posteriors do not depend on ${\textbf{x}}_*$ either [6, Section 1.1].

After sampling an initial random search direction ${\textbf{s}}_1\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{I}})$, Algorithm B.2 computes an ${\textbf{A}}\varvec{\Sigma }_0{\textbf{A}}$-orthonormal basis for the Krylov space ${\mathcal {K}}_m({\textbf{A}}\varvec{\Sigma }_0{\textbf{A}},{\textbf{s}}_1)$ with Algorithm B.1. Then Algorithm B.2 computes the BayesCG posteriors directly with (2) and (3) from Theorem 2.1. The numerical experiments in Sect. 5 run Algorithm B.2 with the inverse prior $\mu _0 = {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{A}}^{-1})$.

1.3 BayesCG with covariances in factored form

Algorithm B.3 takes as input a general prior covariance $\varvec{\Sigma }_0$ in factored form, and subsequently maintains the posterior covariances $\varvec{\Sigma }_m$ in factored form as well. Theorem B.1 presents the correctness proof for Algorithm B.3.

Theorem B.1

Under the conditions of Theorem 2.1, if $\varvec{\Sigma }_0 = {\textbf{F}}_0{\textbf{F}}_0^T$ for ${\textbf{F}}_0\in {\mathbb {R}}^{n\times \ell }$ and some $m \le \ell \le n$, then $\varvec{\Sigma }_m = {\textbf{F}}_m{\textbf{F}}_m^T$ with

$$\begin{aligned} {\textbf{F}}_m = {\textbf{F}}_0\left( {\textbf{I}}- {\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m({\textbf{S}}_m^T{\textbf{A}}{\textbf{F}}_0{\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m)^{-1}{\textbf{S}}_m{\textbf{A}}{\textbf{F}}_0\right) \in {\mathbb {R}}^{n\times \ell },\qquad 1\le m\le n. \end{aligned}$$

Proof

Fix m. Substituting $\varvec{\Sigma }_0 = {\textbf{F}}_0{\textbf{F}}_0^T$ into (3) and factoring out ${\textbf{F}}_0$ on the left and ${\textbf{F}}_0^T$ on the right gives $\varvec{\Sigma }_m = {\textbf{F}}_0{\textbf{P}}{\textbf{F}}_0^T$ where

$$\begin{aligned} {\textbf{P}}&\equiv {\textbf{I}}- {\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m({\textbf{S}}_m^T{\textbf{A}}{\textbf{F}}_0{\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m)^{-1}{\textbf{S}}_m{\textbf{A}}{\textbf{F}}_0\\&=({\textbf{I}}- {\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T) \qquad \text {where}\quad {\textbf{Q}}\equiv {\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m. \end{aligned}$$

Show that ${\textbf{P}}$ is a projector,

$$\begin{aligned} {\textbf{P}}^2&= {\textbf{I}}- 2{\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T + {\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T{\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T \\&= {\textbf{I}}- {\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T = {\textbf{P}}. \end{aligned}$$

Hence $\varvec{\Sigma }_m = {\textbf{F}}_0{\textbf{P}}{\textbf{F}}_0^T = {\textbf{F}}_0{\textbf{P}}{\textbf{P}}{\textbf{F}}_0^T = {\textbf{F}}_m{\textbf{F}}_m^T$. $\square $

1.4 BayesCG under the Krylov prior

We present algorithms for BayesCG under full Krylov posteriors (Sect. B.4.1) and under approximate Krylov posteriors (Sect. B.4.2).

1.4.1 Full Krylov posteriors

Algorithm B.4 computes the following: a matrix ${\textbf{V}}$ whose columns are an ${\textbf{A}}$-orthonormal basis for ${\mathcal {K}}_g({\textbf{A}},{\textbf{r}}_0)$; the diagonal matrix $\varvec{\Phi }$ in (5); and the posterior mean ${\textbf{x}}_m$ in (26). The output consists of the posterior mean ${\textbf{x}}_m$, and the factors ${\textbf{V}}_{m+1:g}$ and $\varvec{\Phi }_{m+1:g}$ for the posterior covariance.

1.4.2 Approximate Krylov posteriors

Algorithm B.5 computes rank-d approximate Krylov posteriors in two main steps: (i) posterior mean and iterates ${\textbf{x}}_m$ in Lines 5–14; and (ii) factorization of the posterior covariance ${\widehat{\varvec{\Gamma }}}_m$ in Lines 16–26.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Reid, T.W., Ipsen, I.C.F., Cockayne, J. et al. Statistical properties of BayesCG under the Krylov prior. Numer. Math. 155, 239–288 (2023). https://doi.org/10.1007/s00211-023-01375-7

Download citation

Received: 23 July 2022
Revised: 10 April 2023
Accepted: 14 September 2023
Published: 12 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00211-023-01375-7

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical properties of BayesCG under the Krylov prior

Abstract

Access this article

Similar content being viewed by others

Bayes procedures for adaptive inference in inverse problems for the white noise model

Bayesian high-dimensional semi-parametric inference beyond sub-Gaussian errors

Sparse Estimation: An MMSE Approach

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Auxiliary results

Lemma A.1

Lemma A.2

Lemma A.3

Proof

Definition A.4

Lemma A.5

Lemma A.6

Lemma A.7

Proof

Lemma A.8

Proof

Appendix B: Algorithms

1.1 Modified Lanczos method

1.2 BayesCG with random search directions

1.3 BayesCG with covariances in factored form

Theorem B.1

Proof

1.4 BayesCG under the Krylov prior

1.4.1 Full Krylov posteriors

1.4.2 Approximate Krylov posteriors

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation