Skip to main content
Log in

Statistical properties of BayesCG under the Krylov prior

  • Published:
Numerische Mathematik Aims and scope Submit manuscript

Abstract

We analyse the calibration of BayesCG under the Krylov prior. BayesCG is a probabilistic numeric extension of the Conjugate Gradient (CG) method for solving systems of linear equations with real symmetric positive definite coefficient matrix. In addition to the CG solution, BayesCG also returns a posterior distribution over the solution. In this context, a posterior distribution is said to be ‘calibrated’ if the CG error is well-described, in a precise distributional sense, by the posterior spread. Since it is known that BayesCG is not calibrated, we introduce two related weaker notions of calibration, whose departures from exact calibration can be quantified. Numerical experiments confirm that, under low-rank approximate Krylov posteriors, BayesCG is only slightly optimistic and exhibits the characteristics of a calibrated solver, and is computationally competitive with CG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 2.1
Algorithm 2.2
Fig. 2
Algorithm 4.1
Algorithm 4.2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Cockayne, J., Oates, C.J., Ipsen, I.C.F., Girolami, M.: A Bayesian conjugate gradient method (with discussion). Bayesian Anal. 14(3), 937–1012 (2019). https://doi.org/10.1214/19-BA1145. Includes 6 discussions and a rejoinder from the authors

  2. Reid, T.W., Ipsen, I.C.F., Cockayne, J., Oates, C.J.: BayesCG as an uncertainty aware version of CG. arXiv:2008.03225 (2022)

  3. Cockayne, J., Oates, C.J., Sullivan, T.J., Girolami, M.: Bayesian probabilistic numerical methods. SIAM Rev. 61(4), 756–789 (2019). https://doi.org/10.1137/17M1139357

    Article  MathSciNet  MATH  Google Scholar 

  4. Hennig, P., Osborne, M.A., Girolami, M.: Probabilistic numerics and uncertainty in computations. Proc. R. Soc. A. 471(2179), 20150142–17 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  5. Oates, C.J., Sullivan, T.J.: A modern retrospective on probabilistic numerics. Stat. Comput. 29(6), 1335–1351 (2019). https://doi.org/10.1007/s11222-019-09902-z

    Article  MathSciNet  MATH  Google Scholar 

  6. Cockayne, J., Ipsen, I.C.F., Oates, C.J., Reid, T.W.: Probabilistic iterative methods for linear systems. J. Mach. Learn. Res. 22(232), 1–34 (2021)

    MathSciNet  MATH  Google Scholar 

  7. Hart, J., van Bloemen Waanders, B., Herzog, R.: Hyperdifferential sensitivity analysis of uncertain parameters in PDE-constrained optimization. Int. J. Uncertain. Quantif. 10(3), 225–248 (2020). https://doi.org/10.1615/Int.J.UncertaintyQuantification.2020032480

    Article  MathSciNet  MATH  Google Scholar 

  8. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering, p. 664. Springer, New York (2006)

  9. Petra, N., Zhu, H., Stadler, G., Hughes, T.J.R., Ghattas, O.: An inexact Gauss-Newton method for inversion of basal sliding and rheology parameters in a nonlinear Stokes ice sheet model. J. Glaciol. 58(211), 889–903 (2012). https://doi.org/10.3189/2012JoG11J182

    Article  Google Scholar 

  10. Saibaba, A.K., Hart, J., van Bloemen Waanders, B.: Randomized algorithms for generalized singular value decomposition with application to sensitivity analysis. Numer. Linear Algebra Appl. 28(4), 2364–27 (2021). https://doi.org/10.1002/nla.2364

    Article  MathSciNet  MATH  Google Scholar 

  11. Bartels, S., Cockayne, J., Ipsen, I.C.F., Hennig, P.: Probabilistic linear solvers: a unifying view. Stat. Comput. 29(6), 1249–1263 (2019). https://doi.org/10.1007/s11222-019-09897-7

    Article  MathSciNet  MATH  Google Scholar 

  12. Fanaskov, V.: Uncertainty calibration for probabilistic projection methods. Stat. Comput. 31(5), 56–17 (2021). https://doi.org/10.1007/s11222-021-10031-9

    Article  MathSciNet  MATH  Google Scholar 

  13. Hennig, P.: Probabilistic interpretation of linear solvers. SIAM J. Optim. 25(1), 234–260 (2015). https://doi.org/10.1137/140955501

    Article  MathSciNet  MATH  Google Scholar 

  14. Wenger, J., Hennig, P.: Probabilistic linear solvers for machine learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6731–6742. Curran Associates Inc, Red Hook (2020)

    Google Scholar 

  15. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 409–436 (1952). https://doi.org/10.6028/jres.049.044

    Article  MathSciNet  MATH  Google Scholar 

  16. Higham, N.J.: Functions of Matrices. Theory and Computation. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2008). https://doi.org/10.1137/1.9780898717778

    Book  MATH  Google Scholar 

  17. Cockayne, J., Oates, C.J., Ipsen, I.C.F., Girolami, M.: Supplementary material for ‘A Bayesian conjugate-gradient method’. Bayesian Anal. (2019). https://doi.org/10.1214/19-BA1145SUPP

  18. Liesen, J., Strakos, Z.: Krylov Subspace Methods: Principles and Analysis. Oxford University Press, Oxford (2013)

    MATH  Google Scholar 

  19. Berljafa, M., Güttel, S.: Generalized rational Krylov decompositions with an application to rational approximation. SIAM J. Matrix Anal. Appl. 36(2), 894–916 (2015). https://doi.org/10.1137/140998081

    Article  MathSciNet  MATH  Google Scholar 

  20. Strakoš, Z., Tichý, P.: On error estimation in the conjugate gradient method and why it works in finite precision computations. Electron. Trans. Numer. Anal. 13, 56–80 (2002)

    MathSciNet  MATH  Google Scholar 

  21. Kressner, D., Latz, J., Massei, S., Ullmann, E.: Certified and fast computations with shallow covariance kernels. arXiv:2001.09187 (2020)

  22. Villani, C.: Optimal Transport, Old and New. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338, p. 973. Springer, Berlin (2009). https://doi.org/10.1007/978-3-540-71050-9

  23. Gelbrich, M.: On a formula for the \(L^2\) Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachr. 147, 185–203 (1990). https://doi.org/10.1002/mana.19901470121

    Article  MathSciNet  MATH  Google Scholar 

  24. Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, New York (1985). https://doi.org/10.1007/978-1-4757-4286-2

    Book  MATH  Google Scholar 

  25. Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer 19, 451–559 (2010). https://doi.org/10.1017/S0962492910000061

    Article  MathSciNet  MATH  Google Scholar 

  26. Cockayne, J., Graham, M.M., Oates, C.J., Sullivan, T.J.: Testing whether a Learning Procedure is Calibrated. arXiv:2012.12670 (2021)

  27. Ross, S.M.: Introduction to Probability Models, 9th edn. Academic Press Inc, Boston (2007)

    Google Scholar 

  28. Kaltenbach, H.-M.: A Concise Guide to Statistics. Springer Briefs in Statistics, p. 111. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23502-3

  29. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112, 2nd edn. Springer, New York (2021). https://doi.org/10.1007/978-1-0716-1418-1

    Book  MATH  Google Scholar 

  30. BCSSTK14: BCS Structural Engineering Matrices (linear equations) Roof of the Omni Coliseum, Atlanta. https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc2/bcsstk14.html

  31. Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2013)

    Book  MATH  Google Scholar 

  32. Hairer, E., Nørsett, S.P., Wanner, G.: Solving Ordinary Differential Equations. I, 2nd edn. Springer Series in Computational Mathematics, vol. 8. Springer, Berlin (1993)

  33. Golub, G.H., Meurant, G.: Matrices, moments and quadrature. II. How to compute the norm of the error in iterative methods. BIT 37(3), 687–705 (1997). https://doi.org/10.1007/BF02510247

    Article  MathSciNet  MATH  Google Scholar 

  34. Meurant, G., Tichý, P.: Approximating the extreme Ritz values and upper bounds for the \(A\)-norm of the error in CG. Numer. Algorithms 82(3), 937–968 (2019). https://doi.org/10.1007/s11075-018-0634-8

    Article  MathSciNet  MATH  Google Scholar 

  35. Strakoš, Z.: On the real convergence rate of the conjugate gradient method. Linear Algebra Appl. 154(156), 535–549 (1991). https://doi.org/10.1016/0024-3795(91)90393-B

    Article  MathSciNet  MATH  Google Scholar 

  36. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)

    MATH  Google Scholar 

  37. Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, New York (1982)

    Book  MATH  Google Scholar 

  38. Ouellette, D.V.: Schur complements and statistics. Linear Algebra Appl. 36, 187–295 (1981). https://doi.org/10.1016/0024-3795(81)90232-9

    Article  MathSciNet  MATH  Google Scholar 

  39. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)

    Book  MATH  Google Scholar 

  40. Campbell, S.L., Meyer, C.D.: Generalized Inverses of Linear Transformations. Classics in Applied Mathematics, vol. 56. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2009). https://doi.org/10.1137/1.9780898719048.ch0

  41. Mathai, A.M., Provost, S.B.: Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker Inc, New York (1992)

    MATH  Google Scholar 

  42. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2003)

    Book  MATH  Google Scholar 

  43. Giraud, L., Langou, J., Rozložník, M.: The loss of orthogonality in the Gram–Schmidt orthogonalization process. Comput. Math. Appl. 50(7), 1069–1075 (2005). https://doi.org/10.1016/j.camwa.2005.08.009

    Article  MathSciNet  MATH  Google Scholar 

  44. Giraud, L., Langou, J., Rozložník, M., van den Eshof, J.: Rounding error analysis of the classical Gram–Schmidt orthogonalization process. Numer. Math. 101(1), 87–100 (2005). https://doi.org/10.1007/s00211-005-0615-4

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their helpful comments, and questions which lead to the addition of Sect. 6. We are also deeply indebted to Jonathan Wenger for his very careful reading of our paper and many perceptive suggestions that greatly improved the paper and widened its outlook through the addition of Sects. 5.5 and 5.6.

Funding

The work was supported in part by NSF Grant DMS-1745654 (TWR, ICFI), NSF Grant DMS-1760374 and DOE Grant DE-SC0022085 (ICFI), and the Lloyd’s Register Foundation Programme on Data Centric Engineering at the Alan Turing Institute (CJO)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilse C. F. Ipsen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Auxiliary results

We present auxiliary results required for proofs in other sections.

The stability of Gaussian distributions implies that a linear transformation of a Gaussian random variable remains Gaussian.

Lemma A.1

(Stability of Gaussian Distributions [37, Section 1.2]) Let \(X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}},\varvec{\Sigma })\) be a Gaussian random variable with mean \({\textbf{x}}\in {\mathbb {R}}^n\) and covariance \(\varvec{\Sigma }\in {\mathbb {R}}^{n\times n}\). If \({\textbf{y}}\in {\mathbb {R}}^n\) and \({\textbf{F}}\in {\mathbb {R}}^{n\times n}\), then

$$\begin{aligned} Z= {\textbf{y}}+{\textbf{F}}X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{y}}+{\textbf{F}}{\textbf{x}}, {\textbf{F}}\varvec{\Sigma }{\textbf{F}}^T). \end{aligned}$$

The conjugacy of Gaussian distributions implies that the distribution of a Gaussian random variable conditioned on information that linearly depends on the random variable is a Gaussian distribution.

Lemma A.2

(Conjugacy of Gaussian Distributions [38, Section 6.1], [25, Corollary 6.21]) Let \(X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}},\varvec{\Sigma }_x)\) and \(Y\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{y}},\varvec{\Sigma }_y)\). The jointly Gaussian random variable \(\begin{bmatrix}X^T&Y^T\end{bmatrix}^T\) has the distribution

$$\begin{aligned} \begin{bmatrix} X\\ Y\end{bmatrix} \sim {{\,\mathrm{{\mathcal {N}}}\,}}\left( \begin{bmatrix} {\textbf{x}}\\ {\textbf{y}}\end{bmatrix},\begin{bmatrix}\varvec{\Sigma }_x &{} \varvec{\Sigma }_{xy} \\ \varvec{\Sigma }_{xy}^T &{} \varvec{\Sigma }_y\end{bmatrix} \right) , \end{aligned}$$

where \(\varvec{\Sigma }_{xy} \equiv {{\,\textrm{Cov}\,}}(X,Y) = {{\,\mathrm{{\mathbb {E}}}\,}}[(X-{\textbf{x}})(Y-{\textbf{y}})^T]\) and the conditional distribution of \(X\) given \(Y\) is

$$\begin{aligned} (X\mid Y) \sim {{\,\mathrm{{\mathcal {N}}}\,}}(\overbrace{{\textbf{x}}+ \varvec{\Sigma }_{xy}\varvec{\Sigma }_y^\dagger (Y- {\textbf{y}})}^{\text {mean}}, \ \overbrace{\varvec{\Sigma }_x - \varvec{\Sigma }_{xy}\varvec{\Sigma }_y^\dagger \varvec{\Sigma }_{xy}^T}^{\text {covariance}}). \end{aligned}$$

We show how to transform a \({\textbf{B}}\)-orthogonal matrix into an orthogonal matrix.

Lemma A.3

Let \({\textbf{B}}\in {\mathbb {R}}^{n\times n}\) be symmetric positive definite, and let \({\textbf{H}}\in {\mathbb {R}}^{n\times n}\) be a \({\textbf{B}}\)-orthogonal matrix with \({\textbf{H}}^T{\textbf{B}}{\textbf{H}}= {\textbf{H}}{\textbf{B}}{\textbf{H}}^T = {\textbf{I}}\). Then

$$\begin{aligned} {\textbf{U}}\equiv {\textbf{B}}^{1/2}{\textbf{H}}\end{aligned}$$

is an orthogonal matrix with \({\textbf{U}}^T{\textbf{U}}= {\textbf{U}}{\textbf{U}}^T={\textbf{I}}\).

Proof

The symmetry of \({\textbf{B}}\) and the \({\textbf{B}}\)-orthogonality of \({\textbf{H}}\) imply

$$\begin{aligned} {\textbf{U}}^T{\textbf{U}}= {\textbf{H}}^T{\textbf{B}}{\textbf{H}}= {\textbf{I}}. \end{aligned}$$

From the orthonormality of the columns of \({\textbf{U}}\), and the fact that \({\textbf{U}}\) is square follows that \({\textbf{U}}\) is an orthogonal matrix [39, Definition 2.1.3]. \(\square \)

Definition A.4

[39, Section 7.3] The thin singular value decomposition of the rank-p matrix \({\textbf{G}}\in {\mathbb {R}}^{m\times n}\) is

$$\begin{aligned} {\textbf{G}}= {\textbf{U}}{\textbf{D}}{\textbf{W}}^T, \end{aligned}$$

where \({\textbf{U}}\in {\mathbb {R}}^{m\times p}\) and \({\textbf{W}}\in {\mathbb {R}}^{n\times p}\) are matrices with orthonormal columns and \({\textbf{D}}\in {\mathbb {R}}^{p\times p}\) is a diagonal matrix with positive diagonal elements. The Moore–Penrose inverse of \({\textbf{G}}\) is

$$\begin{aligned} {\textbf{G}}^\dagger = {\textbf{W}}{\textbf{D}}^{-1}{\textbf{U}}^T. \end{aligned}$$

If a matrix has full column-rank or full row-rank, then its Moore–Penrose can be expressed in terms of the matrix itself. Furthermore, the Moore–Penrose inverse of a product is equal to the product of the Moore–Penrose inverses, provided the first matrix has full column-rank and the second matrix has full row-rank.

Lemma A.5

[40, Corollary 1.4.2] Let \({\textbf{G}}\in {\mathbb {R}}^{m\times n}\) and \({\textbf{J}}\in {\mathbb {R}}^{n\times p}\) have full column and row rank respectively, so \({{\,\textrm{rank}\,}}({\textbf{G}})={{\,\textrm{rank}\,}}({\textbf{J}})=n\). The Moore–Penrose inverses of \({\textbf{G}}\) and \({\textbf{J}}\) are

$$\begin{aligned} {\textbf{G}}^\dagger = ({\textbf{G}}^T{\textbf{G}})^{-1}{\textbf{G}}^T \quad \text {and} \quad {\textbf{J}}^\dagger = {\textbf{J}}^T({\textbf{J}}{\textbf{J}}^T)^{-1} \end{aligned}$$

respectively, and the Moore–Penrose inverse of the product equals

$$\begin{aligned} ({\textbf{G}}{\textbf{J}})^\dagger = {\textbf{J}}^\dagger {\textbf{G}}^\dagger . \end{aligned}$$

Below is an explicit expression for the mean of a quadratic form of Gaussians.

Lemma A.6

[41, Sections 3.2b.1–3.2b.3] Let \(Z\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_z,\varvec{\Sigma }_z)\) be a Gaussian random variable in \({\mathbb {R}}^n\), and \({\textbf{B}}\in {\mathbb {R}}^{n\times n}\) be symmetric positive definite. The mean of \(Z^T{\textbf{B}}Z\) is

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}[Z^T{\textbf{B}}Z]&= {{\,\textrm{trace}\,}}({\textbf{B}}\varvec{\Sigma }_z) + {\textbf{x}}_z^T{\textbf{B}}{\textbf{x}}_z. \end{aligned}$$

We show that the squared Euclidean norm of a Gaussian random variable with an orthogonal projector as its covariance matrix is distributed according to a chi-squared distribution.

Lemma A.7

Let \(Z\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{P}})\) be a Gaussian random variable in \({\mathbb {R}}^n\). If the covariance matrix \({\textbf{P}}\) is an orthogonal projector, that is, if \({\textbf{P}}^2 = {\textbf{P}}\) and \({\textbf{P}}= {\textbf{P}}^T\), then

$$\begin{aligned} \Vert X\Vert _2^2= (X^TX) \sim \chi _p^2, \end{aligned}$$

where \(p = {{\,\textrm{rank}\,}}({\textbf{P}})\).

Proof

We express the projector in terms of orthonormal matrices and then use the invariance of the 2-norm under orthogonal matrices and the stability of Gaussians.

Since \({\textbf{P}}\) is an orthogonal projector, there exists \({\textbf{U}}_1\in {\mathbb {R}}^{n\times p}\) such that \({\textbf{U}}_1{\textbf{U}}_1^T = {\textbf{P}}\) and \({\textbf{U}}_1^T{\textbf{U}}= {\textbf{I}}_p\). Choose \({\textbf{U}}_2\in {\mathbb {R}}^{n\times (n-p)}\) so that \({\textbf{U}}= \begin{bmatrix} {\textbf{U}}_1&{\textbf{U}}_2 \end{bmatrix}\) is an orthogonal matrix. Thus,

$$\begin{aligned} X^TX= X^T{\textbf{U}}{\textbf{U}}^TX= X^T{\textbf{U}}_1{\textbf{U}}_1^TX+ X^T{\textbf{U}}_2{\textbf{U}}_2^TX. \end{aligned}$$
(A1)

Lemma A.1 implies that \(Y= {\textbf{U}}_1^TX\) is distributed according to a Gaussian distribution with mean \({\textbf{0}}\) and covariance \({\textbf{U}}_1^T{\textbf{U}}_1{\textbf{U}}_1^T{\textbf{U}}= {\textbf{I}}_p\). Similarly, \(Z= {\textbf{U}}_2^TX\) is distributed according to a Gaussian distribution with mean \({\textbf{0}}\) and covariance \({\textbf{U}}_2^T{\textbf{U}}_1{\textbf{U}}_1^T{\textbf{U}}_2 = {\textbf{0}}\), thus \(Z= {\textbf{0}}\).

Substituting \(Y\) and \(Z\) into (A1) gives \(X^TX= Y^TY+ {\textbf{0}}^T{\textbf{0}}\). From \(Y\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{I}}_p)\) follows \((X^TX)\sim \chi _p^2\). \(\square \)

Lemma A.8

If \({\textbf{A}}\in {\mathbb {R}}^{n\times n}\) is symmetric positive definite, and \(M\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_\mu \varvec{\Sigma }_\mu )\) and \(N\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_\nu ,\varvec{\Sigma }_\nu )\) are independent random variables in \({\mathbb {R}}^n\), then

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}[\Vert M-N\Vert _{\textbf{A}}^2] = \Vert {\textbf{x}}_\mu -{\textbf{x}}_\nu \Vert _{\textbf{A}}^2 + {{\,\textrm{trace}\,}}({\textbf{A}}\varvec{\Sigma }_\mu ) + {{\,\textrm{trace}\,}}({\textbf{A}}\varvec{\Sigma }_\nu ). \end{aligned}$$

Proof

The random variable \(M-N\) has mean \({{\,\mathrm{{\mathbb {E}}}\,}}[M-N] = {\textbf{x}}_\mu -{\textbf{x}}_\nu \), and covariance

$$\begin{aligned} \varvec{\Sigma }_{M-N}&\equiv {{\,\textrm{Cov}\,}}(M-N,M-N) \\&= {{\,\textrm{Cov}\,}}(M,M) + {{\,\textrm{Cov}\,}}(N,N) -{{\,\textrm{Cov}\,}}(M,N) - {{\,\textrm{Cov}\,}}(N,M) \\&= {{\,\textrm{Cov}\,}}(M,M) + {{\,\textrm{Cov}\,}}(N,N) = \varvec{\Sigma }_{\mu }+\varvec{\Sigma }_{\nu }, \end{aligned}$$

where the covariances \({{\,\textrm{Cov}\,}}(M,N) ={{\,\textrm{Cov}\,}}(N,M) = 0\) because \(M\) and \(N\) are independent. Now apply Lemma A.6 to \(M-N\). \(\square \)

Appendix B: Algorithms

We present algorithms for the modified Lanczos method (Sect. B.1), BayesCG with random search directions (Sect. B.2), BayesCG with covariances in factored form (Sect. B.3), and BayesCG under the Krylov prior (Sect. B.4).

1.1 Modified Lanczos method

The Lanczos method [42, Algorithm 6.15] produces an orthonormal basis for the Krylov space \({\mathcal {K}}_g({\textbf{A}},{\textbf{v}}_1)\), while the modified version in Algorithm B.1 produces an \({\textbf{A}}\)-orthonormal basis.

Algorithm B.1
figure f

Modified Lanczos Method

Algorithm B.1 reorthogonalizes the basis vectors \({\textbf{v}}_i\) with Classical Gram-Schmidt performed twice, see Lines 10 and 11. This reorthogonalization technique can be implemented efficiently and produces vectors that are orthogonal to machine precision [43, 44].

1.2 BayesCG with random search directions

The version of BayesCG in Algorithm B.2 is designed to be calibrated because the search directions do not depend on \({\textbf{x}}_*\), hence the posteriors do not depend on \({\textbf{x}}_*\) either [6, Section 1.1].

After sampling an initial random search direction \({\textbf{s}}_1\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{I}})\), Algorithm B.2 computes an \({\textbf{A}}\varvec{\Sigma }_0{\textbf{A}}\)-orthonormal basis for the Krylov space \({\mathcal {K}}_m({\textbf{A}}\varvec{\Sigma }_0{\textbf{A}},{\textbf{s}}_1)\) with Algorithm B.1. Then Algorithm B.2 computes the BayesCG posteriors directly with (2) and (3) from Theorem 2.1. The numerical experiments in Sect. 5 run Algorithm B.2 with the inverse prior \(\mu _0 = {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{A}}^{-1})\).

Algorithm B.2
figure g

BayesCG with random search directions

1.3 BayesCG with covariances in factored form

Algorithm B.3 takes as input a general prior covariance \(\varvec{\Sigma }_0\) in factored form, and subsequently maintains the posterior covariances \(\varvec{\Sigma }_m\) in factored form as well. Theorem B.1 presents the correctness proof for Algorithm B.3.

Theorem B.1

Under the conditions of Theorem 2.1, if \(\varvec{\Sigma }_0 = {\textbf{F}}_0{\textbf{F}}_0^T\) for \({\textbf{F}}_0\in {\mathbb {R}}^{n\times \ell }\) and some \(m \le \ell \le n\), then \(\varvec{\Sigma }_m = {\textbf{F}}_m{\textbf{F}}_m^T\) with

$$\begin{aligned} {\textbf{F}}_m = {\textbf{F}}_0\left( {\textbf{I}}- {\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m({\textbf{S}}_m^T{\textbf{A}}{\textbf{F}}_0{\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m)^{-1}{\textbf{S}}_m{\textbf{A}}{\textbf{F}}_0\right) \in {\mathbb {R}}^{n\times \ell },\qquad 1\le m\le n. \end{aligned}$$

Proof

Fix m. Substituting \(\varvec{\Sigma }_0 = {\textbf{F}}_0{\textbf{F}}_0^T\) into (3) and factoring out \({\textbf{F}}_0\) on the left and \({\textbf{F}}_0^T\) on the right gives \(\varvec{\Sigma }_m = {\textbf{F}}_0{\textbf{P}}{\textbf{F}}_0^T\) where

$$\begin{aligned} {\textbf{P}}&\equiv {\textbf{I}}- {\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m({\textbf{S}}_m^T{\textbf{A}}{\textbf{F}}_0{\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m)^{-1}{\textbf{S}}_m{\textbf{A}}{\textbf{F}}_0\\&=({\textbf{I}}- {\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T) \qquad \text {where}\quad {\textbf{Q}}\equiv {\textbf{F}}_0^T{\textbf{A}}{\textbf{S}}_m. \end{aligned}$$

Show that \({\textbf{P}}\) is a projector,

$$\begin{aligned} {\textbf{P}}^2&= {\textbf{I}}- 2{\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T + {\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T{\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T \\&= {\textbf{I}}- {\textbf{Q}}({\textbf{Q}}^T{\textbf{Q}})^{-1}{\textbf{Q}}^T = {\textbf{P}}. \end{aligned}$$

Hence \(\varvec{\Sigma }_m = {\textbf{F}}_0{\textbf{P}}{\textbf{F}}_0^T = {\textbf{F}}_0{\textbf{P}}{\textbf{P}}{\textbf{F}}_0^T = {\textbf{F}}_m{\textbf{F}}_m^T\). \(\square \)

Algorithm B.3
figure h

BayesCG with covariances in factored form

1.4 BayesCG under the Krylov prior

We present algorithms for BayesCG under full Krylov posteriors (Sect. B.4.1) and under approximate Krylov posteriors (Sect. B.4.2).

1.4.1 Full Krylov posteriors

Algorithm B.4 computes the following: a matrix \({\textbf{V}}\) whose columns are an \({\textbf{A}}\)-orthonormal basis for \({\mathcal {K}}_g({\textbf{A}},{\textbf{r}}_0)\); the diagonal matrix \(\varvec{\Phi }\) in (5); and the posterior mean \({\textbf{x}}_m\) in (26). The output consists of the posterior mean \({\textbf{x}}_m\), and the factors \({\textbf{V}}_{m+1:g}\) and \(\varvec{\Phi }_{m+1:g}\) for the posterior covariance.

Algorithm B.4
figure i

BayesCG under the Krylov prior with full posteriors

1.4.2 Approximate Krylov posteriors

Algorithm B.5 computes rank-d approximate Krylov posteriors in two main steps: (i) posterior mean and iterates \({\textbf{x}}_m\) in Lines 5–14; and (ii) factorization of the posterior covariance \({\widehat{\varvec{\Gamma }}}_m\) in Lines 16–26.

Algorithm B.5
figure j

BayesCG under the Krylov prior [2, Algorithm 3.1]

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reid, T.W., Ipsen, I.C.F., Cockayne, J. et al. Statistical properties of BayesCG under the Krylov prior. Numer. Math. 155, 239–288 (2023). https://doi.org/10.1007/s00211-023-01375-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00211-023-01375-7

Mathematics Subject Classification

Navigation