Multilevel dimension-independent likelihood-informed MCMC for large-scale inverse problems

Tiangang Cui; Gianluca Detommaso; Robert Scheichl

doi:10.1088/1361-6420/ad1e2c

1. Introduction

Inverse problems aim to estimate unknown parameters of mathematical models from noisy and indirect observations. The unknown parameters, often represented as functions, are related to the observed data through a forward model, such as a differential equation, that maps realisations of parameters to observables. For ill-posed inverse problems, there may exist many feasible realisations of parameters that are consistent with the observed data, and small perturbations in the data may lead to large perturbations in unregularised parameter estimates. The Bayesian approach [25, 36, 37] casts the solution of inverse problems as the posterior probability distribution of the model parameters conditioned on the data. This offers a natural way to integrate the forward model and the data together with prior knowledge and a stochastic description of measurement and/or model errors to remove the ill-posedness and to quantify uncertainties in parameters and parameter-dependent predictions. As a result, parameter estimations, model predictions, and associated uncertainty quantifications can be issued in the form of marginal distributions or expectations of some quantities of interest (QoI) over the posterior. Due to the typically high parameter dimensions and the high computational cost of the forward models, characterising the posterior and computing posterior expectations are in general computationally challenging tasks. Integrating multilevel Markov chain Monte Carlo (MCMC) [14, 22], likelihood-informed parameter reduction [11, 35, 40] and dimension-independent MCMC [4, 8, 10, 33], we present here an integrated framework to significantly accelerate the computation of posterior expectations for large-scale inverse problems.

In inverse problems, unknown parameters are often cast as functions, and hence the Bayesian inference has to be carried out over typically high-dimensional discretisations of the parameters that resolve the spatial and/or temporal variability of the underlying problem sufficiently. Examples are the permeability field of a porous medium [9, 14, 21, 23] or Brownian forcing of a stochastic ordinary differential equation [3]. In those settings, efficient MCMC methods have been developed to sample the posterior and compute posterior expectations with convergence rates that are independent of the discretised parameter dimension; these include (preconditioned) Crank–Nicolson (pCN) methods [4, 8, 19] that establish the foundation for designing and analysing MCMC algorithms in a function space setting, stochastic Newton methods [28, 30] that utilise Hessian information to accelerate the convergence, as well as operator-weighted methods [10, 26, 33] that generalise pCN methods using (potentially location-dependent) operators to adapt to the geometry of the posterior.

Discretisation also arises in the numerical solution of the forward model, e.g. finite-element discretisation of PDEs. As many degrees of freedom are needed, it can be computationally demanding to accurately resolve the forward model, which is required to simulate the posterior density. A natural way to reduce the computational cost is to utilise a hierarchy of forward models related to a sequence of grid discretisations, ranging from computationally cheaper and less accurate coarse models to more costly but more accurate fine models. Corresponding to this hierarchy of models, the parameters can also be represented by a sequence of discretised functions with increasing dimensions. This yields a hierarchy of posterior distributions. By allocating different numbers of MCMC simulations to sample posteriors across different levels and by combining all those sample-based posterior estimations using a telescoping sum [15], the multilevel MCMC [14, 22] provides accelerated and unbiased estimates of posterior expectations.

We present a non-trivial integration of the dimension-independent likelihood-informed (DILI) MCMC [10] and the multilevel MCMC in [14] to explore the hierarchy of posterior distributions. This integration offers several advantages: First, DILI-MCMC employs an intrinsic likelihood-informed subspace (LIS) [11]—which involves a number of forward and adjoint model simulations—to design accelerated operator-weighted proposals. By exploiting the multilevel structure of the discretised parameters and discretised forward models, we design a Rayleigh-Ritz procedure to significantly reduce the computational effort in building a hierarchical LIS and operating with DILI proposals. Second, the resulting DILI-MCMC can drastically improve the sampling efficiency of MCMC at each level, and hence reduce the integration error of multilevel Monte Carlo for a fixed CPU time budget. Numerical results confirm the improved computational efficiency of the proposed multilevel DILI approach.

We note that the DILI proposal has been used before in the multilevel sequential Monte Carlo (SMC) setting [2], but in a very different way. We use derivative information of the likelihood to recursively construct the LIS via matrix–free eigenvalue solves, whereas [2] uses multilevel SMC to estimate the full-rank empirical posterior covariance matrix and then builds the LIS from this posterior covariance matrix. Moreover, we construct DILI proposals by exploiting the structure of the hierarchical LIS to couple Markov chains across levels, whereas [2] employs the original DILI proposal in the mutation step of SMC to improve mixing.

The paper is structured as follows. Section 2 introduces the framework of Bayesian inverse problems and MCMC sampling while section 3 discusses the general framework of multilevel MCMC. The Rayleigh–Ritz procedure for the recursive construction of the hierarchical LIS is presented in section 4. The coupled DILI proposals that can exploit the hierarchical LIS are introduced in section 5. Section 6 provides numerical experiments to demonstrate the efficacy of the resulting MLDILI method, while finally, in section 7, we provide some concluding remarks.

2. Background

In this section, we review the Bayesian formulation of inverse problems, the DILI MCMC approach, posterior discretisation, as well as the bias-variance decomposition for MCMC algorithms.

2.1. Bayesian inference framework

Suppose the parameter of interest is some function u in a separable Hilbert space $\mathcal{H}(\Omega)$ defined over a given bounded domain $\Omega \subset \mathbb{R}^d$ . We introduce a prior probability measure µ₀ to represent the a priori information about the function u. The inner product on $\mathcal{H}$ is denoted by $\langle \cdot \, , \cdot \rangle_\mathcal{H}$ , with associated norm denoted by $\| \cdot \|_\mathcal{H}$ . For brevity, where misinterpretation is not possible, we will drop the subscript $\mathcal{H}$ . We assume that the prior measure is Gaussian with mean $m_0 \in \mathcal{H}$ and a self-adjoint, positive definite covariance operator $\Gamma_{\text{pr}}$ that is trace-class, so that the prior provides a full probability measure on $\mathcal{H}$ .

Given observed data ${\boldsymbol{y}}\in\mathbb{R}^d$ and the forward model $F:\mathcal{H} \to \mathbb{R}^d$ , we define the likelihood function $\mathcal{L}({\boldsymbol{y}} | u)$ of y given u. Denoting the posterior probability measure by µ_y, the posterior distribution on any infinitesimal volume $du \subseteq \mathcal{H}$ is given by

$\begin{align} \mu_{y}\left(\textrm{d}u\right)\propto \mathcal{L}\left({\boldsymbol{y}}|u\right)\mu_0\left(\textrm{d}u\right)\,. \end{align} \tag{ 1 }$

Making the simplifying assumption that the observational noise is additive and Gaussian with zero mean and covariance matrix $\Gamma_{\text{obs}}$ , the observation model has the form

$\begin{align} {\boldsymbol{y}} = F\left(u\right)+{\boldsymbol{e}},\quad {\boldsymbol{e}}\sim\mathcal{N}\left(0,\Gamma_{\text{obs}}\right)\,, \end{align} \tag{ 2 }$

and it follows immediately that the likelihood function satisfies

$\begin{align} \mathcal{L}\left({\boldsymbol{y}}|u\right) \propto\exp\left(-\eta\left(u;{\boldsymbol{y}}\right)\right)\,, \end{align} \tag{ 3 }$

where $\eta({\boldsymbol{y}};u)$ is the data-misfit functional defined by

$\begin{align} \eta\left(u; {\boldsymbol{y}}\right)\equiv\frac{1}{2} \left({\boldsymbol{y}}-F\left(u\right)\right)^\top \Gamma_{\text{obs}}^{-1} \left({\boldsymbol{y}}-F\left(u\right)\right)\,. \end{align} \tag{ 4 }$

Assumption 2.1. We assume that the forward model $F: \mathcal{H} \rightarrow \mathbb{R}^d$ satisfies:

(i)
For all ε > 0, there exists a constant $K(\varepsilon) \gt 0$ such that, for all $u \in \mathcal{H}$ ,
$\begin{align*} | F\left(u\right) | \unicode{x2A7D} \exp\left( K\left(\varepsilon\right) + \varepsilon\|u\|_\mathcal{H}^2 \right) . \end{align*}$
(ii)
For any $u \in \mathcal{H}$ , there exists a bounded linear operator $\mathit{J}(u): \mathcal{H} \rightarrow \mathbb{R}^d$ such that
$\begin{align*} \lim_{\delta u \rightarrow 0} \frac{|F\left(u+\delta u\right) - F\left(u\right) - \mathrm{J}\left(u\right) \delta u|}{\|\delta u\|_\mathcal{H}} = 0, \quad \forall \delta u \in \mathcal{H}. \end{align*}$
In particular, this also implies the Lipschitz continuity of F.

Given observations ${\boldsymbol{y}} \in \mathbb{R}^d$ and a forward model that satisfies assumption 2.1, [36] shows that the resulting data-misfit function is sufficiently bounded and locally Lipschitz, and thus the posterior measure is dominated by the prior measure. The second condition states that the forward model is first-order Fréchet differentiable, and hence the Gauss–Newton approximation of the Hessian of the data-misfit functional is bounded.

Suppose we have some QoI that is a functional of the parameter u denoted by $Q:\mathcal{H}\to\mathbb{R}^{q}$ , e.g. flow rate. Then, posterior-based model predictions can be formulated as expectations of that QoI over the posterior. We will denote them by

$\begin{align*} \mathbb{E}_{\mu_y}\left[ Q \right] \equiv \mathbb{E}_{U\sim \mu_y} \left[ Q\left(U\right) \right]. \end{align*}$

MCMC methods construct a Markov chain of correlated random variables $U^{(1)}, \ldots, U^{(N)}$ for which the posterior is the invariant distribution. Then, one can estimate expected QoI(s) using Monte Carlo integration:

$\begin{align} \mathbb{E}_{\mu_y}\left[ Q \right] \approx \frac{1}{N} {\textstyle \sum_{i = 1}^{N}} Q\left(U^{\left(i\right)}\right). \end{align} \tag{ 5 }$

2.2. Dimension-independent likelihood-informed MCMC on function space

The Metropolis–Hastings (MH) algorithm [20, 29] provides a general framework to design transition kernels that have the posterior as their invariant distribution to generate a Markov chain of random variables that targets the posterior.

Definition 2.2 (Metropolis-Hastings Kernel). Given the current state $U^{(k)} = u^\ast$ , a candidate state uʹ can be drawn from a proposal distribution $q(u^\ast,\cdot)$ . We define a pair of probability measures

$\begin{align} \nu\left(\textrm{d}u^\ast,\textrm{d}u^{{\prime}}\right) & = q\left(u^\ast,\textrm{d}u^{{\prime}}\right) \mu_y\left(\textrm{d}u^\ast\right) \nonumber\\ \nu^\bot\left(\textrm{d}u^\ast,\textrm{d}u^{{\prime}}\right) & = q\left(u^{{\prime}},\textrm{d}u^\ast\right) \mu_y\left(\textrm{d}u^{{\prime}}\right). \end{align} \tag{ 6 }$

Then, the next state of the Markov chain is set to $U^{(k+1)} = u^{{\prime}}$ with probability

$\begin{align} \alpha\left(u^\ast,u^{{\prime}}\right) = \min \left\{1, \frac{\textrm{d}\nu^\bot}{\textrm{d}\nu}\left(u^\ast,u^{{\prime}}\right) \right\} , \end{align} \tag{ 7 }$

and to $U^{(k+1)} = u^\ast$ otherwise.

MH algorithms require the absolute continuity condition $\nu^\bot \ll \nu$ to define a valid transition kernel with non-zero acceptance probability as the dimension goes to infinity [39]. We will refer to a MH algorithm as well-defined or dimension-independent if this absolute continuity condition holds. For probability measures over function spaces in the setting considered here, the sequence of papers [4, 8, 18, 19, 36] provide a viable way to construct well-defined MH algorithms using a pCN discretisation of a particular Langevin SDE. The pCN proposal has the form

$\begin{align} u^{{\prime}} = a \left(u^\ast - m_0\right) + m_0 - \gamma \left(1-a\right) \Gamma_{\mathrm{pr}} \nabla_{u}\eta\left(u^\ast;{\boldsymbol{y}}\right) + \sqrt{1 - a^2} \Gamma_{\mathrm{pr}}^{\frac12} \xi, \end{align} \tag{ 8 }$

where $\xi \sim \mathcal{N}(0, \mathrm{I})$ and $\gamma \in \{0, 1\}$ is a tuning parameter to switch between Langevin (γ = 1) and Ornstein–Uhlenbeck proposal (γ = 0). It is required that $a \in (-1,1)$ . The pCN proposal (8) satisfies the desired absolute continuity condition and the acceptance probability does not go to zero as the discretisation of u is refined. In addition, [18, 33] establish that, under mild conditions, the spectral gaps of the MCMC transition kernels defined by (generalised) pCN proposals do exist, and that they are independent of the dimension of the discretised parameters. Thus, the statistical efficiency of pCN proposals is also dimension-independent.

The pCN proposal (8) scales uniformly in all directions with respect to the norm induced by the prior covariance. Since the posterior necessarily contracts the prior along parameter directions that are informed by the likelihood, the Markov chain produced by the standard pCN proposal decorrelates more quickly in the likelihood-informed parameter subspace than in the orthogonal complement, which is prior-dominated [10, 26]. Thus, proposed moves of pCN can be effectively too small in prior-dominated directions, resulting in poor mixing.

The DILI MCMC [10] provides a systematic way to design proposals that adapt to the anisotropic structure of the posterior while retaining dimension-independent performance. It considers operator-weighted proposals in the form of

$\begin{align} u^{{\prime}} = \Gamma_{\mathrm{pr}}^{\frac12}\mathit{A}\Gamma_{\mathrm{pr}}^{-\frac12} \left(u^\ast-m_0\right) + m_0 - \Gamma_{\mathrm{pr}}^{\frac12} \mathit{G} \Gamma_{\mathrm{pr}}^{\frac12} \nabla_{u}\eta\left(u^\ast;{\boldsymbol{y}}\right) + \Gamma_{\mathrm{pr}}^{\frac12}\mathit{B} \xi, \end{align} \tag{ 9 }$

where $\mathit{A}$ , $\mathit{B}$ , and $\mathit{G}$ are bounded, self-adjoint operators on $\textrm{Im}(\Gamma_{\mathrm{pr}}^{-1/2})$ that satisfy certain properties to be discussed below. In this paper, we set $\mathit{G}$ to zero throughout and thus consider only non-Langevin type proposals. By applying a whitening transform

$\begin{align} v = \Gamma_{\mathrm{pr}}^{-\frac12} \left(u - m_0\right) \end{align} \tag{ 10 }$

to the parameter u and by denoting (in a slight abuse of notation) the associated data-misfit functional again by $\eta(v;y) \leftarrow \eta(\Gamma_{\mathrm{pr}}^{1/2} v + m_0 ; y)$ , the proposal (9) simplifies to

$\begin{align} v^{{\prime}} = \mathit{A} v^\ast + \mathit{B} \xi . \end{align} \tag{ 11 }$

The following theorem provides sufficient conditions for constructing the operators $\mathit{A}$ and $\mathit{B}$ such that the proposal (11) yields a well-defined MH algorithm, as well as a formula for the acceptance probability.

Theorem 2.3. Suppose that the posterior measure µ_y is equivalent to the prior measure µ₀ and that the self-adjoint operators $\mathit{A}$ and $\mathit{B}$ commute, that is, they can be defined by a common set of eigenfunctions $\{\psi_i \in \mathrm{Im}(\Gamma_{\mathrm{pr}}^{-1/2}) : i \in \mathbb{N}\}$ with corresponding eigenvalues $\{a_i\}_{i = 1}^{\infty}$ and $\{b_i\}_{i = 1}^{\infty}$ , respectively. Suppose further that

$\begin{align*} \left\{a_i\right\}_{i = 1}^{\infty},\ \left\{b_i\right\}_{i = 1}^{\infty} \subset \mathbb{R} \backslash \left\{0\right\} \quad \text{and} \quad \sum_{i = 1}^\infty \left( a_i^2 + b_i^2 - 1 \right)^2 < \infty. \end{align*}$

Then, the proposal (11) delivers a well-defined MCMC algorithm and the acceptance probability is given by

$\begin{align*} \alpha\left(v^\ast, v^{{\prime}}\right) = \min \left\{1, \frac{\exp\left({- \eta\left(v^{{\prime}\,}; y\right) - \frac{1}{2} \langle v^{{\prime}\,}, \mathit{B}^{-2} \left(\mathit{A}^2 + \mathit{B}^2 - \mathit{I}^{}\right) v^{{\prime}\,} \rangle} \right)}{\exp\left({- \eta\left(v^\ast; y\right) - \frac{1}{2} \langle v^\ast, \mathit{B}^{-2} \left(\mathit{A}^2 + \mathit{B}^2 - \mathit{I}^{}\right) v^\ast \rangle} \right)} \right\}. \end{align*}$

Proof. The above assumptions are simplified versions of those in theorem 3.1 of [10]. The acceptance probability directly follows from corollary 3.5 of [10]. □

The DILI proposal (11) enables different scalings in the proposal moves along different parameter directions. By choosing appropriate eigenfunctions $\{\psi_i\}_{i = 1}^{\infty}$ and eigenvalues $\{a_i, b_i\}_{i = 1}^{\infty}$ , it can capture the geometry of the posterior, and thus can potentially improve the mixing of the resulting Markov chain.

The LIS [11, 12] provides a viable way to construct such operators $\mathit{A}$ and $\mathit{B}$ . It is spanned by the leading eigenfunctions of the eigenvalue problem

$\begin{align} \mathbb{E}_{V \sim \mu^\ast} \left[ \mathit{H}\left(V\right) \right] \psi_i = \lambda_i \psi_i, \end{align} \tag{ 12 }$

where $\mathit{H}(v)$ is some information metric of the likelihood function (with respect to the transformed parameter v), for example, the Hessian of the data-misfit functional or the Fisher information, and $\mu^\ast$ is some reference measure, for example, the posterior or the Laplace approximation of the posterior. In the LIS, spanned by $\{\psi_1, \ldots, \psi_r\}$ , the posterior may significantly differ from the prior. Thus, we prescribe inhomogeneous eigenvalues $\{a_i\}_{i = 1}^r$ and $\{b_i\}_{i = 1}^r$ to ensure that the proposal follows the possibly relatively tight geometry of the posterior. In the complement of the LIS, where the posterior does not differ significantly from the prior, we can use the original pCN proposal and set $\{a_i\}_{i\gt r}$ and $\{b_i\}_{i \gt r}$ to some constant values $a_\perp$ and $b_\perp$ , respectively. Further details on the computation of the LIS basis and the choice of eigenvalues will be discussed in the multilevel context in later sections.

2.3. Posterior discretisation and bias-variance decomposition

When the forward model involves a partial/ordinary differential Equation and the parameter is defined as a spatial/temporal stochastic process, it is necessary in practice to discretise the parameter and the forward model using appropriate numerical methods.

A common way to discretise the parameter is the Karhunen–Loéve expansion, which also serves the purpose of the whitening transform. Given the prior mean $m_0(x)$ and the prior covariance $\Gamma_{\mathrm{pr}}$ , we express the unknown parameter u as the linear combination of the first R eigenfunctions $\{\phi_1, \ldots, \phi_{R}\}$ of the eigenvalue problem $\Gamma_{\mathrm{pr}} \phi_j = \omega_j \phi_j$ , such that

$\begin{align} \textstyle u_R\left(x\right) = m_0\left(x\right) + \sum_{j = 1}^R \sqrt{\omega_j} \,\phi_j\left(x\right)\, v_j, \quad x \in \Omega. \end{align} \tag{ 13 }$

The discretised prior $p_R(\boldsymbol v)$ associated with the random coefficients $\boldsymbol v = [v_1, \ldots, v_R]^\top$ is Gaussian with zero mean and covariance equal to the R × R identity matrix $\mathit{I}_R$ . In this context, the selection of the truncation dimension R is typically based on the rate of decay of the eigenvalues ω_j, for example such that $\sum_{j = 1}^R \omega_j / \sum_{j = 1}^\infty \omega_j \unicode{x2A7E} \tau$ with a threshold $\tau \in (0,1)$ close to one. In this way, the truncated representation encapsulates a specific percentage of the total prior variance.

We discretise the forward model using a numerical method, such as finite elements or finite differences, with M degrees of freedom, which yields a discretised forward model $F_{R,M}$ mapping from the discretised coefficients $\boldsymbol v$ to the observables. In this way, the posterior measure (1) can be discretised, leading to the finite-dimensional density

$\begin{align} \pi_{R,M}\left(\boldsymbol v | {\boldsymbol{y}}\right) \propto \exp\left( -\eta_{R,M}\left(\boldsymbol v; {\boldsymbol{y}}\right) \right)\, p_R\left(\boldsymbol v\right), \end{align} \tag{ 14 }$

where

$\begin{align*} \eta_{R,M}\left(\boldsymbol v; {\boldsymbol{y}}\right) = \frac{1}{2} \left({\boldsymbol{y}}-F_{R,M}\left(\boldsymbol v\right)\right)^\top \Gamma_{\text{obs}}^{-1} \left({\boldsymbol{y}}-F_{R,M}\left(\boldsymbol v\right)\right) \end{align*}$

is the discretised data-misfit function. Correspondingly, we also define the discretised QoI $Q_{R,M}(\boldsymbol v)$ , which maps the discretise coefficient vector $\boldsymbol v$ to the discretised QoI.

The discretised parameters and forward models can be indexed by the discretisation level. We consider a hierarchy of L + 1 levels of discretised parameter spaces with dimensions $R_0 \!\unicode{x2A7D}\! R_1 \!\unicode{x2A7D}\! \ldots \!\unicode{x2A7D}\! R_L$ and a hierarchy of discretised forward models with $M_0 \!\unicode{x2A7D}\! M_1 \!\unicode{x2A7D}\! \ldots\!\unicode{x2A7D}\! M_L$ degrees of freedom. Discretised parameter, forward model and QoI on level $\ell$ are denoted by

$\begin{align*} \boldsymbol v_\ell = \left[v_1, \ldots, v_{R_\ell}\right]^\top, \quad F_\ell\left(\boldsymbol v_\ell\right) \equiv F_{R_\ell,M_\ell}\left(\boldsymbol v_\ell\right) \ \ \text{and} \ \ Q_\ell\left(\boldsymbol v_\ell\right) \equiv Q_{R_\ell,M_\ell}\left(\boldsymbol v_\ell\right), \end{align*}$

respectively. Thus, the discretised data-misfit function, prior and posterior on level $\ell$ are

$\begin{align} \eta_\ell\left(\boldsymbol v_\ell; {\boldsymbol{y}}\right) \!\equiv\! \eta_{R_\ell,M_\ell}\left(\boldsymbol v_\ell; {\boldsymbol{y}}\right), \;\; p_\ell\left(\boldsymbol v_\ell\right) \!\equiv\! p_{R_\ell}\left(\boldsymbol v_\ell\right) , \; \textrm{and} \;\; \pi_\ell\left(\boldsymbol v_\ell | {\boldsymbol{y}}\right) \!\equiv\! \pi_{R_\ell,M_\ell}\left(\boldsymbol v_\ell | {\boldsymbol{y}}\right), \end{align} \tag{ 15 }$

respectively, with the associated posterior expectation $\mathbb{E}_{\pi_\ell}\big[Q_\ell\big] \equiv \mathbb{E}_{ {\boldsymbol V}_{\!\!\ell} \sim \pi_\ell} \big[ Q_\ell^{}( {\boldsymbol V}_{\!\!\ell} ^{}) \big]$ .

Assumption 2.4.

(i)
The bias of the posterior expectation on level $\ell$ can be bounded in terms of the number of degrees of freedom of the forward model such that
$\begin{align} \big|\mathbb{E}_{\mu_y}\left[Q\right] - \mathbb{E}_{\pi_\ell}\left[Q_\ell\right] \big| = \mathcal{O}\left(M_\ell^{-\vartheta_\textrm{b}}\right), \end{align} \tag{ 16 }$
for some constant $\vartheta_\textrm{b} \gt0$ .
(ii)
For the computational cost of carrying out one step of MCMC (including a forward model simulation) it is assumed that there exists a constant $\vartheta_\textrm{c} \gt 0$ such that
$\begin{align} C_\ell = \mathcal{O}\left(M_\ell^{\vartheta_\textrm{c}}\right). \end{align} \tag{ 17 }$

Implicitly, condition (i) in assumption 2.4 also assumes that $R_\ell$ is sufficiently large such that on level $\ell$ the bias due to parameter approximation is dominated by the error due to the forward model approximation. This condition can be verified for certain classes of model problems. For instance, for finite element methods applied to elliptic PDEs (which is the model problem used in the numerical experiments of this work), the convergence analysis in [14, section 4.2] shows that the discretisation error satisfies that $|\mathbb{E}_{\mu_y}[Q] - \mathbb{E}_{\pi_{\ell}}[Q_{\ell}]| = \mathcal{O}(M_\ell^{-\vartheta_\textrm{b}} + R_{\ell}^{-\vartheta_\textrm{b}^{{\prime}}})$ for some constants $\vartheta_\textrm{b}, \vartheta_\textrm{b}^{{\prime}}\gt0$ . Thus, by choosing $R_\ell = M_\ell^{\vartheta_\textrm{b}/\vartheta_\textrm{b}^{{\prime}}}$ , the two error contributions are balanced. The constant ϑ_c in Condition (ii) of assumption 2.4 depends on the underlying linear solver and/or numerical integrator, so that a theoretical upper bound on ϑ_c is often known.

Consider discretisation level L and let $\{\boldsymbol V_{\!\!L}^{(j)}\}_{j = 1}^{N_{\text{MC}}}$ be a Markov chain produced by a MCMC algorithm converging in distribution to π_L. An estimate for the expectation $\mathbb{E}_{\pi_L}\big[Q_L\big]$ is

$\begin{align} Y^{\,\text{MC}} \equiv \frac{1}{N_{\text{MC}}} {\textstyle \sum_{j = 1}^{N_{\text{MC}}} } Q_L^{} \left(\boldsymbol V_{\!\!L}^{\left(j\right)}\right) \ \approx \ \mathbb{E}_{\pi_L}\left[Q_L\right]\,. \end{align} \tag{ 18 }$

The focus of this work is the asymptotic performance of algorithms, and hence the initialization bias of MCMC and the computational cost due to burn-in are not discussed. The mean-squared-error (MSE) of the Monte Carlo estimator (18) allows a bias-variance decomposition of the form

$\begin{align} \text{MSE}\left(Y^{\,\text{MC}}\right) = \underbrace{\Big| \mathbb{E}_{\mu_y}\left[Q\right] - \mathbb{E}_{\pi_L}\left[Q_L\right] \Big|^2}_\textrm{Square of Bias} + \underbrace{\mathrm{Var}_{\pi_L}\left(Q_L\right) \Big/ N^{\,\text{eff}}_{\text{MC}}}_{\mathrm{Var}\left(Y^{\,\text{MC}}\right)}, \end{align} \tag{ 19 }$

where $N^{\,\text{eff}}_{\text{MC}}$ is the effective sample size of the Markov chain $\{\boldsymbol V_{\!\!L}^{(j)}\}_{j = 1}^{N_{\text{MC}}}$ . This effective sample size is proportional to the total sample size, i.e. $N_{\text{MC}}^{\,\text{eff}} = N_{\text{MC}}/ \tau_{\text{MC}}$ , where $\tau_{\text{MC}} \unicode{x2A7E} 1$ is the integrated autocorrelation time (IACT) of the Markov chain. The work of [18] shows that for pCN-type algorithms the IACT $\tau_{\text{MC}}$ is dimension-independent given a local Lipschitz assumption, which is often satisfied for inverse problems governed by elliptic PDEs.

Choosing $N_{\text{MC}}$ such that the two terms in (19) of the MCMC estimator are balanced and using assumption 2.4, the total computational cost to achieve $\text{MSE}(Y^{\,\text{MC}}) \lt \varepsilon^2$ is

$\begin{align} C^{\text{MC}} = \mathcal{O}\left(\tau_{\text{MC}}\,\varepsilon^{-2-\vartheta_\textrm{c}/\vartheta_\textrm{b} }\right)\,. \end{align} \tag{ 20 }$

Thus, one of the key aims in accelerating MCMC sampling is to reduce $\tau_{\text{MC}}$ , which can be achieved, e.g. via DILI MCMC proposals. In addition, the multilevel method will allow us to improve the asymptotic rate of growth of the cost of the standard MCMC estimator in (20) with respect to ε, as well as to further reduce $\tau_{\text{MC}}$ on the higher levels. These two things are achieved in multilevel MCMC, by using coarse level samples as proposals on the higher levels and by dealing with the high numerical correlation between subsequent MCMC samples produced by standard proposal mechanisms on the coarsest level (level zero). Thus, most samples are drawn on the computationally least costly level zero, as well as shifting most of the work for removing the initialization bias to level zero, all contributing to the practical advatages of the multilevel algorithms compared to their single-level counterparts.

3. Multilevel MCMC

By exploiting the hierarchy of posteriors, the rate of the computational cost in (20) can be reduced significantly using the multilevel idea in [14]. We expand the posterior expectation in the telescoping sum

$\begin{align} \textstyle \mathbb{E}_{\pi_L}\left[Q_L^{}\right] = \mathbb{E}_{\pi_0}\left[Q_0^{}\right] + \sum_{\ell = 1}^L\left(\mathbb{E}_{\pi_\ell}\left[Q_\ell^{}\right]-\mathbb{E}_{\pi_{\ell\text{-}1}}\left[Q_{\ell\text{-}1}^{}\right]\right)\,. \end{align} \tag{ 21 }$

For level zero, the sample set $\{\! \boldsymbol V_{\!\!0}^{(0,\,j)} \!\}_{j = 1}^{N_0}$ is assumed to be drawn via some MCMC method that converges to $\pi_0(\,\cdot\,|{\boldsymbol{y}})$ and the first term in the telescoping sum (21) is estimated via

$\begin{align*} Y_0 \equiv \frac{1}{N_0}{\textstyle\sum_{j = 1}^{N_0}} D_0^{\left(j\right)} \ \approx \ \mathbb{E}_{\pi_0}\left[Q_0\right], \quad \text{where} \quad D_0^{\left(j\right)} = Q^{}_0\left(\boldsymbol V_{\!\!0}^{\left(0,\,j\right)}\right). \end{align*}$

Since the two expectations in the difference $\mathbb{E}_{\pi_\ell}[Q_\ell^{}]-\mathbb{E}_{\pi_{\ell\text{-}1}}[Q_{\ell\text{-}1}^{}]$ are with respect to different discretisations of the posterior, special treatment is required for $\ell \gt 0$ . Let $\Delta_{\ell, \ell\text{-}1}(\boldsymbol v_{\ell}, \boldsymbol v_{\ell\text{-}1})$ be the joint density of $\boldsymbol v_{\ell}$ and $\boldsymbol v_{\ell\text{-}1}$ such that

$\begin{align} \quad \int \!\!\Delta_{\ell, \ell\text{-}1}\left(\boldsymbol v_{\ell}, \boldsymbol v_{\ell\text{-}1}\right) \,\textrm{d}\boldsymbol v_{\ell\text{-}1} \! = \! \pi_\ell\left(\boldsymbol v_{\ell}| {\boldsymbol{y}}\right) \ \ \textrm{and } \ \int\!\! \Delta_{\ell, \ell\text{-}1}\left(\boldsymbol v_{\ell}, \boldsymbol v_{\ell\text{-}1}\right) \, \textrm{d} \boldsymbol v_{\ell} \! = \! \pi_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}| {\boldsymbol{y}}\right), \end{align} \tag{ 22 }$

that is, the posteriors $\pi_\ell(\boldsymbol v_{\ell}| {\boldsymbol{y}})$ and $\pi_{\ell\text{-}1}(\boldsymbol v_{\ell\text{-}1}| {\boldsymbol{y}})$ are the two marginals. Then, the difference between expectations can be expressed as

$\begin{align} \mathbb{E}_{\pi_\ell}\left[Q^{}_\ell\right]-\mathbb{E}_{\pi_{\ell\text{-}1}}\left[Q^{}_{\ell\text{-}1}\right] = \mathbb{E}_{\Delta_{\ell, \ell\text{-}1}} \left[ D_\ell \right] , \quad \textrm{where} \quad D_\ell = Q_\ell\left( {\boldsymbol V}_{\!\!\ell} \right) - Q_{\ell\text{-}1}\left( {\boldsymbol V}_{\!\!\ell\text{-}1} \right) \end{align} \tag{ 23 }$

and $( {\boldsymbol V}_{\!\!\ell} , {\boldsymbol V}_{\!\!\ell\text{-}1} ) \sim \Delta_{\ell, \ell\text{-}1}(\cdot, \cdot)$ . The construction of the joint density and the associated sampling procedure will be critical to reduce the computational complexity.

Suppose the samples $\big\{\big( {\boldsymbol V}_{\!\!\ell} ^{(\ell, j)}, {\boldsymbol V}_{\!\!\ell\text{-}1} ^{(\ell, j)}\big)\big\}_{j = 1}^{N_\ell}$ form a Markov chain that converges in distribution to $\Delta_{\ell, \ell\text{-}1}(\cdot, \cdot)$ and

$\begin{align*} D_\ell^{\left(j\right)} = Q^{}_\ell\left( {\boldsymbol V}_{\!\!\ell} ^{\left(\ell,j\right)}\right)-Q_{\ell\text{-}1}^{}\left( {\boldsymbol V}_{\!\!\ell\text{-}1} ^{\left(\ell,j\right)}\right). \end{align*}$

Then, the remaining terms in (21), for $\ell = 1, \ldots,L$ , are estimated by

$\begin{align*} Y_\ell \equiv \frac{1}{N_\ell}{\textstyle\sum_{j = 1}^{N_\ell}} D_\ell^{\left(j\right)} \ \approx \ \mathbb{E}_{\pi_\ell}\left[Q^{}_\ell\right]-\mathbb{E}_{\pi_{\ell\text{-}1}}\left[Q^{}_{\ell\text{-}1}\right] \end{align*}$

and the multilevel MCMC estimator for $\mathbb{E}_{\pi_L}[Q_L]$ is defined by

$\begin{align} \quad\;\; \textstyle \mathbb{E}_{\pi_L} \left[ Q_L^{} \right] \approx Y^{\,\text{ML}} &\equiv {\textstyle\sum_{\ell = 0}^L} Y_\ell\,, \end{align} \tag{ 24 }$

The mean square error of this estimator can again be decomposed as follows:

$\begin{align} \text{MSE}\left(Y^{\,\text{ML}}\right) & \equiv \underbrace{\big|\mathbb{E}_{\mu_y}\left[Q\right] - \mathbb{E}_{\pi_{L}}\left[Q_L\right]\big|^2}_\textrm{Square of Bias} + \underbrace{\textstyle \sum_{\ell = 0}^L \left( \mathrm{Var}\left(Y_\ell\right) + \, \sum_{k \neq \ell}^L \mathrm{Cov}\left( Y_\ell, Y_k \right) \right)}_{\mathrm{Var}\left(Y^{\,\text{ML}}\right)}. \end{align} \tag{ 25 }$

3.1. Variance management

For optimal efficiency, we now choose the numbers of samples $N_\ell$ , $\ell = 0,\ldots,N$ , such as to minimise $\mathrm{Var}(Y^{\,\text{ML}})$ for fixed computational effort. This includes the within-level variance $\mathrm{Var}(Y_\ell)$ and the cross-level variance $\mathrm{Cov}( Y_\ell, Y_k )$ for $k \neq \ell$ . We will provide justifications on managing these variances using the following assumptions.

Remark 3.1. Suppose the effective sample sizes are proportional to the total sample sizes, i.e. $N_\ell^{\,\text{eff}} = N_\ell \big/ \tau_{\ell}$ , for all $\ell$ , where $\tau_{\ell} \unicode{x2A7E} 1$ is the IACT of the Markov chain $D_\ell^{(j)}$ . Then, the within-level variance has the form

$\begin{align} \mathrm{Var}\left(Y_\ell\right) = \frac{1}{N_\ell^{\,\text{eff}}} \mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) = \frac{\tau_{\ell}}{N_\ell} \mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) , \end{align} \tag{ 26 }$

where we set $\mathrm{Var}_{\Delta_{0,-1}}(D_0) = \mathrm{Var}_{\pi_0}(Q_0)$ and have

$\begin{align*} \mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) = \mathrm{Var}_{\pi_{\ell}}\left(Q_\ell\right) + \mathrm{Var}_{\pi_{\ell\text{-}1}}\left(Q_{\ell\text{-}1}\right) - 2\textrm{Cov}_{\Delta_{\ell,\ell\text{-}1}}\left(Q_\ell, Q_{\ell\text{-}1}\right) \unicode{x2A7E} 0, \;\; \forall \ell > 0, \end{align*}$

by the Cauchy–Schwarz inequality. Thus, to reduce $\mathrm{Var}(Y_\ell)$ , the joint density should be constructed in such a way that $\textrm{Cov}_{\Delta_{\ell,\ell\text{-}1}}(Q_\ell, Q_{\ell\text{-}1})$ is positive and (if possible) maximised. In addition, the MCMC simulation should be made statistically efficient in the sense that $\tau_{\ell}$ is as close to one as possible.

Assumption 3.2. The variance $\mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}(D_\ell)$ converges to zero as $M_\ell \rightarrow \infty$ and

$\begin{align} \mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) = \mathcal{O}\left(M_\ell^{-\vartheta_\textrm{v} }\right)\,, \end{align} \tag{ 27 }$

for some constant $\vartheta_\textrm{v} \gt 0$ .

Proposition 3.3. Suppose that there exists an r < 1 such that

$\begin{align} \frac{\mathrm{Cov}\left( Y_\ell, Y_k \right)}{\max\left\{\mathrm{Var}\left(Y_\ell\right), \mathrm{Var}\left(Y_k\right) \right\}} \unicode{x2A7D} r^{|k-l|}, \quad \text{for all} \ \ k \neq \ell, \end{align} \tag{ 28 }$

i.e. the cross-level covariance is insignificant compared to the within-level variance. Then

$\begin{align} \mathrm{Var}\left(Y^{\,\text{ML}}\right) = \sum_{\ell = 0}^L \left( \mathrm{Var}\left(Y_\ell\right) + \, \sum_{k \neq \ell}^L \mathrm{Cov}\left( Y_\ell, Y_k \right) \right) \unicode{x2A7D} \frac{1+r}{1-r} \sum_{\ell = 0}^L \mathrm{Var}\left(Y_\ell\right) \,. \end{align} \tag{ 29 }$

Proof. Without loss of generality, we can assume the variances $\{\mathrm{Var}(Y_\ell) \}_{\ell = 0}^L$ are ordered as $\mathrm{Var}(Y_\ell) \unicode{x2A7E} \mathrm{Var}(Y_k)$ for $\ell \lt k$ . Then we have the bound

$\begin{align*} \textstyle \mathrm{Var}\left(Y^{\,\text{ML}}\right) & = \sum_{\ell = 0}^L \left( \mathrm{Var}\left(Y_\ell\right) + 2 \,\sum_{k > \ell}^L \mathrm{Cov}\left( Y_\ell, Y_k \right) \right) \\ & \unicode{x2A7D} \sum_{\ell = 0}^L \mathrm{Var}\left(Y_\ell\right) \left( 1 + 2 \, \sum_{k > \ell}^\infty \frac{\mathrm{Cov}\left( Y_\ell, Y_k \right)}{\mathrm{Var}\left(Y_\ell\right)} \right) \\ & \unicode{x2A7D} \sum_{\ell = 0}^L \mathrm{Var}\left(Y_\ell\right) \left( 1 + 2 \, \sum_{k > \ell}^\infty r ^{\left(k-\ell\right)} \right) \ = \ \frac{1+r}{1-r} \sum_{\ell = 0}^L \mathrm{Var}\left(Y_\ell\right). \end{align*}$

□

Using proposition 3.3 and (26), the variance of the multilevel estimator satisfies

$\begin{align*} \mathrm{Var}\left(Y^{\,\text{ML}}\right) = \mathcal{O} \left( {\textstyle \sum_{\ell = 0}^L} \frac{\tau_{\ell} }{N_\ell } \mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) \right). \end{align*}$

The total computational cost is $C^{\text{ML}} = \sum_{\ell = 0}^L N_\ell\,C_\ell$ . This way, for a fixed variance, the computational cost is minimised by choosing the sample size

$\begin{align} \textstyle N_\ell \propto \sqrt{\tau_{\ell}\,\mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) \big/ C_\ell }, \end{align} \tag{ 30 }$

which leads to a total computational cost that satisfies

$\begin{align} C^{\text{ML}} \propto {\sum_{\ell = 0}^L} \sqrt{\tau_{\ell}\,C_\ell\,\mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}\left(D_\ell\right) }. \end{align} \tag{ 31 }$

Theorem 3.4. Suppose assumptions 2.4, 3.2 and (28) in proposition 3.3 hold. For the multilevel MCMC estimator to satisfy $\text{MSE}(Y^{\,\text{ML}}) \lt \varepsilon^2$ , the multilevel MCMC with $N_\ell$ chosen as in (30) requires an overall computational cost

$\begin{align} C^{\mathrm{ML}} = \,\begin{cases} \mathcal{O}\left(\varepsilon^{-2}\right) &\mbox{if } \vartheta_\textrm{v}>\vartheta_\textrm{c}\\ \mathcal{O}\left(\varepsilon^{-2}|\log\varepsilon|^2\right) &\mbox{if } \vartheta_\textrm{v} = \vartheta_\textrm{c} \\ \mathcal{O}\left(\varepsilon^{-2-\left(\vartheta_\textrm{c} - \vartheta_\textrm{v}\right)/\vartheta_\textrm{b} }\right) &\mbox{if } \vartheta_\textrm{v} < \vartheta_\textrm{c} \end{cases}\,. \end{align} \tag{ 32 }$

Proof. Follows directly from the multilevel Monte Carlo complexity theorems in [7, 15]. □

It is difficult to rigorously verify assumption (28) in proposition 3.3, but it is often observed that the cross-level variances $\mathrm{Cov}( Y_\ell, Y_k )$ rapidly decay to zero in practice, as the Markov chains used for computing $Y_\ell$ and Y_k with $\ell \neq k$ are statistically independent. For example, in [24] independent Markov chains are constructed and in [14] a subsampling strategy of the coarser chains is employed to ensure independence. Nevertheless, the bound on the computational complexity of multilevel MCMC is reduced under assumption (28) compared to that presented in [14], which has an extra $|\log\varepsilon|$ factor. For any positive values of $\vartheta_\textrm{b},\vartheta_\textrm{v},\vartheta_\textrm{c}$ , the multilevel MCMC approach asymptotically requires less computational effort than single-level MCMC. To choose optimal numbers of samples on the various levels, estimates of the IACTs $\tau_\ell$ , the variances $\mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}(D_\ell)$ , and the computational costs $C_\ell$ are needed. Such quantities may not be known a priori, but they can all be obtained and adaptively improved (on the fly) as the simulation progresses.

3.2. Notations

To map vectors and matrices across adjacent levels of discretisation we define the following notation. Given the canonical basis $(\hat {\boldsymbol{e}}_1, \hat{\boldsymbol{e}}_2, \dots, \hat{\boldsymbol{e}}_{R_\ell})$ of the parameter space at level $\ell$ , where $\hat{\boldsymbol{e}}_j \in \mathbb{R}^{R_\ell}$ , we define the basis matrices $\Theta_{\ell, c} \equiv (\hat{\boldsymbol{e}}_1, \hat{\boldsymbol{e}}_2,\dots, \hat{\boldsymbol{e}}_{R_{\ell\text{-}1}})$ and $\Theta_{\ell,\, f} \equiv (\hat {\boldsymbol{e}}_{R_{\ell\text{-}1}+1}, \dots, \hat {\boldsymbol{e}}_{R_\ell})$ , which correspond to the parameter coefficients 'active' at level $\ell\text{-}1$ and the additional coefficients. Here the subscripts c and f denote the coefficients that are 'active' on the coarse level and the coefficients that are 'active' only on the fine level, respectively. We can split the parameter $\boldsymbol v_\ell$ into two components

$\begin{align} \boldsymbol v_\ell = \begin{bmatrix} \boldsymbol v_{\ell, c} \\ \boldsymbol v_{\ell,\, f}\end{bmatrix}, \;\; \textrm{where}\;\; \boldsymbol v_{\ell, c} = \Theta_{\ell, c}^\top \boldsymbol v_\ell \;\;\textrm{and}\;\; \boldsymbol v_{\ell,\, f} = \Theta_{\ell,\, f}^\top \boldsymbol v_\ell, \end{align} \tag{ 33 }$

which correspond to the coefficients on the previous level $\ell\text{-}1$ and the additional coefficients. Given a matrix $\mathit{A}_\ell \in \mathbb{R}^{R_\ell \times R_\ell}$ , we partition the matrix as

$\begin{align} \mathit{A}_\ell = \begin{bmatrix} \mathit{A}_{\ell,cc} &\mathit{A}_{\ell,cf}\\ \mathit{A}_{\ell,fc} &\mathit{A}_{\ell,\, f\,f} \end{bmatrix}, \end{align} \tag{ 34 }$

where $\mathit{A}_{\ell,cc} \equiv \Theta_{\ell, c}^\top \mathit{A}_\ell \Theta_{\ell, c}$ and $\mathit{A}_{\ell,\, f\,f}$ , $\mathit{A}_{\ell,fc}$ and $\mathit{A}_{\ell,cf}$ are defined analogously. The matrices $\Theta_{\ell, c}$ and $\Theta_{\ell,\, f}$ are never constructed explicitly. Operations with those matrices only involve the selection of the corresponding rows or columns of the matrix or vector.

4. Multilevel LIS

We aim to employ the DILI method (cf section 2.2) as the proposal mechanism for multilevel MCMC. Since the computation of the LIS basis used by the DILI proposal can be costly, here we develop a Rayleigh–Ritz procedure to recursively compute new, multilevel LISs using the model hierarchy. The resulting hierarchical LIS basis can be used to generalise DILI proposals to the multilevel setting and to improve the efficiency of multilevel MCMC sampling. In section 4.1, we define the concept of LIS in the multilevel context. In sections 4.2 and 4.3, we present the recursive construction of the multilevel LIS using the Rayleigh–Ritz procedure.

4.1. Setup

For each level $\ell \in \{0, 1, \ldots, L\}$ , we denote the linearisation of the forward model $F_\ell$ at a given parameter $\boldsymbol v_\ell$ by

$\begin{align*} \mathit{J}_\ell\left(\boldsymbol v_\ell\right) = \nabla_{\boldsymbol v_\ell} F_\ell\left(\boldsymbol v_\ell\right). \end{align*}$

This yields the Gauss–Newton approximation of the Hessian of the data-misfit functional at $\boldsymbol v_\ell$ (hereafter referred to as the Gauss–Newton Hessian) in the form of

$\begin{align} \mathit{H}_\ell\left(\boldsymbol v_\ell\right) = \mathit{J}_\ell\left(\boldsymbol v_\ell\right)^\top \Gamma_{\mathrm{obs}}^{-1} \, \mathit{J}_\ell\left(\boldsymbol v_\ell\right). \end{align} \tag{ 35 }$

The Gauss–Newton Hessian in (35) corresponds to the Fisher information matrix of the likelihood with additive Gaussian noise. It is commonly used in statistics to measure the local sensitivity of the parameter-to-likelihood map. The leading eigenvectors of $\mathit{H}_\ell(\boldsymbol v_\ell)$ (corresponding to the largest eigenvalues) indicate parameter directions along which the likelihood function varies rapidly.

However, to extract the global sensitivity of the parameter-to-likelihood map from the local sensitivity information contained in the Gauss–Newton Hessian, it is necessary to compute the expectation of $\mathit{H}_\ell(\boldsymbol v_\ell)$ with respect to some reference distribution $p_\ell^\ast(\boldsymbol v_\ell)$ , i.e.

$\begin{align} \mathbb{E}_{{\boldsymbol V}_{\!\!\ell} ^{} \sim p_\ell^\ast} \left[ \mathit{H}_\ell\left( {\boldsymbol V}_{\!\!\ell} \right) \right]. \end{align} \tag{ 36 }$

Finally, this is approximated using the sample average with $K_\ell$ random samples drawn from the reference distribution, which yields

$\begin{align} \mathbb{E}_{{\boldsymbol V}_{\!\!\ell} ^{} \sim p_\ell^\ast} \left[ \mathit{H}_\ell\left( {\boldsymbol V}_{\!\!\ell} \right) \right] \approx \widehat{\mathit{H}}_\ell \equiv \frac{1}{K_\ell} {\sum_{k = 1}^{K_\ell} } \mathit{H}_\ell\left(\boldsymbol v_\ell^{\left(k\right)}\right), \;\, \textrm{where} \;\; \boldsymbol v_\ell^{\left(k\right)} \sim p_\ell^\ast\left(\cdot\right). \end{align} \tag{ 37 }$

Note that the matrix $\widehat{\mathit{H}}_\ell$ is symmetric and positive semidefinite. Different choices of the reference distribution, such as the prior or the posterior, lead to different ways to construct the LIS and different performance characteristics.

Remark 4.1. Following the discussion in [12, 40], using the posterior as the reference leads to sharp approximation properties [13, 40] compared to other choices. However, the posterior exploration relies on MCMC sampling, and thus this choice requires adaptively estimating LIS during the MCMC sampling. The Laplace approximation to the posterior provides a reasonable alternative in a wide range of problems where the posterior is unimodal. We use the Laplace approximation as the reference distribution in this work.

The choice of the reference distribution can have an impact on the quality of the LIS basis and on the IACT of the Markov chains produced by DILI MCMC, but it does not affect the convergence of MCMC, as DILI samples the full parameter space and only uses the LIS to reduce the IACT and thus to accelerate posterior sampling.

It is often computationally infeasible to explicitly form the Gauss–Newton Hessian matrix (35). However, all we need are matrix-vector-products with the Gauss–Newton Hessian matrix. This requires only applications of the linearised forward model $\mathit{J}_\ell(\boldsymbol v_\ell)$ and its adjoint $\mathit{J}_\ell(\boldsymbol v_\ell)^\top$ , which are well-established operations in the PDE-constraint optimisation literature. We refer the readers to recent applications in Bayesian inverse problems for further details, e.g. [5, 28, 30].

4.2. Base level LIS

At the base level, we use the samples $\{\boldsymbol v_0^{(k)}\}_{k = 1}^{K_0}$ drawn from the reference $p_0^\ast(\cdot)$ to construct the sample-averaged Gauss–Newton Hessian, $\widehat{\mathit{H}}_0$ . Then, we use the Rayleigh quotient $\langle \boldsymbol \phi, \widehat{\mathit{H}}_\ell \, \boldsymbol \phi \rangle \,/\, \langle \boldsymbol \phi, \boldsymbol \phi \rangle$ to measure the (quadratic) change in the parameter-to-likelihood map along a parameter direction φ . Hence, the LIS can be identified via a sequence of optimisation problems of the form

$\begin{align} \boldsymbol{\psi}_{0,k+1} = \textrm{arg max}_{\| \boldsymbol \phi \| = 1} \langle \boldsymbol \phi, \widehat{\mathit{H}}_0 \, \boldsymbol \phi \rangle, \ \textrm{ subject to } \ \langle \boldsymbol \phi, \boldsymbol{\psi}_{0,i} \rangle = 0, \quad \textrm{for} i = 1, \ldots, k, \end{align} \tag{ 38 }$

where $\boldsymbol{\psi}_{0,1}$ is the solution to the unconstrained optimisation problem. The sequence of optimisation problems in (38) is equivalent to finding the leading eigenvectors of $\widehat{\mathit{H}}_0$ .

Definition 4.2 (Base level LIS). Given the sample–averaged Gauss–Newton Hessian $\widehat{\mathit{H}}_0$ on level 0 and a threshold ϱ > 0, we solve the eigenproblem

$\begin{align} \widehat{\mathit{H}}_0 \, \boldsymbol{\psi}_{0,i} = \lambda_{0,i} \boldsymbol{\psi}_{0,i}, \end{align} \tag{ 39 }$

and then use the r₀ leading eigenvectors with eigenvalues $\lambda_{0,i} \gt \varrho$ , for $i = 1, \ldots, r_0$ , to define the LIS basis $\Psi_{0, r_0} = [\boldsymbol{\psi}_{0,1} , \boldsymbol{\psi}_{0,2} \ldots, \boldsymbol{\psi}_{0,r_0} ]$ , which spans an r₀-dimensional subspace in $\mathbb{R}^R_0$ .

The eigenvalues in (39) provide empirical sensitivity measures of the likelihood function relative to the prior (which here is i.i.d. Gaussian) along corresponding eigenvectors [11, 40]. Eigenvectors corresponding to eigenvalues less than 1 can be interpreted as parameter directions where the likelihood is dominated by the prior. Thus, we typically choose a value less than one for the truncation threshold, i.e. ϱ < 1.

4.3. LIS enrichment

Because the computational cost of a matrix vector product with the Gauss–Newton Hessian scales at least linearly with the degrees of freedom $M_\ell$ of the forward model on level $\ell$ , constructing the LIS can be computationally costly. We present a new approach to accelerate the LIS construction by employing a recursive LIS enrichment using the hierarchy of forward models and parameter discretisations. The resulting hierarchy of LISs will be used to reduce the computational complexity of constructing and operating with the resulting DILI proposals.

We reuse the LIS bases computed on the coarser levels by 'lifting' them and then recursively enrich them at each new level using a Rayleigh–Ritz procedure, rather than recomputing the entire basis from scratch on each level. Ideally, the subspace added on each level will have decreasing dimension, as the model and parameter approximations were assumed to converge with $\ell \to \infty$ and thus no longer provide additional information for the parameter inference.

Definition 4.3 (Lifted LIS basis). Suppose we have an orthogonal LIS basis $\Psi_{\ell\text{-}1, r} \in \mathbb{R}^{R_{\ell\text{-}1} \times r_{\ell\text{-}1}}$ on level $\ell\text{-}1$ . We lift $\Psi_{\ell\text{-}1, r}$ from the coarse parameter space $\mathbb{R}^{R_{\ell\text{-}1}}$ to the fine parameter space $\mathbb{R}^{R_\ell}$ using the basis matrix $\Theta_{\ell, c}$ defined in section 3.2. The lifted LIS basis vectors are collected in the matrix

$\begin{align} \Psi_{\ell, c} = \Theta_{\ell, c} \, \Psi_{\ell\text{-}1, r}. \end{align} \tag{ 40 }$

Proposition 4.4. The lifted LIS basis matrix $\Psi_{\ell, c}$ has orthonormal columns that span an $r_{\ell\text{-}1}$ -dimensional subspace in $\mathbb{R}^{R_\ell}$ , i.e. $\Psi_{\ell, c}^\top \Psi_{\ell, c}^{} = \mathit{I}_{r_{\ell\text{-}1}}$ .

Proof. The proof directly follows as the matrix $\Theta_{\ell, c}$ has orthonormal columns. □

Given $K_\ell$ samples $\{\boldsymbol v_\ell^{(k)}\}_{k = 1}^{K_\ell}$ from the reference distribution $p_\ell^\ast(\cdot)$ , let $\widehat{\mathit{H}}_\ell$ be the resulting sample-averaged Gauss–Newton Hessian. To enrich the lifted LIS basis $\Psi_{\ell, c}$ we now identify likelihood-sensitive parameter directions in the null space $\textrm{null}(\Psi_{\ell, c})$ by recursively optimising the Rayleigh quotient in the orthogonal complement of $\textrm{range}(\Psi_{\ell, c})$ , i.e.

$\begin{align} \;\;\;\boldsymbol{\psi}_{\ell,k+1}& = \textrm{arg max}_{\| \boldsymbol \phi \| = 1} \langle \boldsymbol \phi, \widehat{\mathit{H}}_\ell \, \boldsymbol \phi \rangle,\nonumber\\ \textrm{subject to } \Pi_{\ell, c} \phi& = 0 \textrm{ and } \langle \boldsymbol \phi, \boldsymbol{\psi}_{\ell,i} \rangle = 0, \textrm{ for } i = 1, \ldots, k, \nonumber \end{align} \tag{ 41 }$

where $\Pi_{\ell, c} = \Psi_{\ell, c}^{} \Psi_{\ell, c}^\top$ is an orthogonal projector. This optimisation problem can be solved as an eigenvalue problem using the Rayleigh–Ritz procedure [34].

Theorem 4.5. The optimisation problem (41) is equivalent to finding the leading eigenvectors of the projected eigenproblem

$\begin{align} \left(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \right) \, \widehat{\mathit{H}}_\ell\, \boldsymbol{\psi}_{\ell,i} = \gamma_{\ell,i} \boldsymbol{\psi}_{\ell,i}, \quad \|\boldsymbol{\psi}_{\ell,i}\| = 1. \end{align} \tag{ 42 }$

Proof. This result follows from the properties of orthogonal projectors and of the stationary points of the Rayleigh quotient. Here, we sketch the proof as follows. The constraint $\Pi_{\ell, c} \boldsymbol \phi = 0$ implies $\boldsymbol \phi = (\mathit{I}_{R_\ell} - \Pi_{\ell, c} ) \boldsymbol \phi$ , since $(\mathit{I}_{R_\ell} - \Pi_{\ell, c} )$ is also an orthogonal projector. Hence, the optimisation problem becomes

$\begin{align*} \boldsymbol{\psi}_{\ell,k+1} \!=\! \textrm{arg max}_{\| \boldsymbol \phi \| = 1} \langle \boldsymbol \phi, \left(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \right) \widehat{\mathit{H}}_\ell \left(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \right) \boldsymbol \phi \rangle, \ \ \text{subject to} \ \ \langle \boldsymbol \phi, \boldsymbol{\psi}_{\ell,i} \rangle = 0, \ i = 1, \ldots, k. \end{align*}$

The solutions (for $k = 1, 2, \ldots$ ) to these optimisation problems are given by the leading eigenvectors of the eigenproblem

$\begin{align*} \left(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \right) \, \widehat{\mathit{H}}_\ell\, \left(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \right)\, \boldsymbol{\psi}_{\ell,i} = \gamma_{\ell,i} \boldsymbol{\psi}_{\ell,i}. \end{align*}$

However, since $\boldsymbol{\psi}_{\ell,i} \in \textrm{range} \big(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \big)$ this is equivalent to

$\begin{align*} \left(\mathit{I}_{R_\ell} - \Pi_{\ell, c} \right) \, \widehat{\mathit{H}}_\ell\, \boldsymbol{\psi}_{\ell,i} = \gamma_{\ell,i} \boldsymbol{\psi}_{\ell,i}. \end{align*}$

□

Definition 4.6 (LIS enrichment on level $\ell$ ). The leading $s_\ell$ (normalised) eigenvectors of the eigenproblem (42) with eigenvalues $\gamma_{\ell,i} \gt \varrho$ are denoted by

$\begin{align} \Psi_{\ell,\, f} = \left[\boldsymbol{\psi}_{\ell,1}, \dots, \boldsymbol{\psi}_{\ell,s_\ell}\right]. \end{align} \tag{ 43 }$

They are added to the lifted LIS basis from level $\ell\text{-}1$ to form the enriched LIS basis

$\begin{align} \Psi_{\ell, r} = \left[\Psi_{\ell, c}, \Psi_{\ell,\, f}\right] \end{align} \tag{ 44 }$

on level $\ell$ , where the basis vectors in (43) denote the auxiliary 'fine scale' directions added on level $\ell$ . By construction, all the LIS basis vectors at level $\ell$ are mutually orthogonal. That is, $\Psi_{\ell, r}^\top \Psi_{\ell, r}^{} = \mathit{I}_{r_\ell}$ . We also have $r_\ell = r_{\ell\text{-}1} + s_\ell$ .

By construction, the LIS basis $\Psi_{\ell, r}$ is block upper triangular and can be recursively defined as

$\begin{align} \Psi_{\ell, r} = \left[\Psi_{\ell, c}, \Psi_{\ell,\, f}\right] = \begin{bmatrix} \Psi_{\ell\text{-}1,r} &\mathit{Z}_{\ell,c}\\ 0 &\mathit{Z}_{\ell,f} \end{bmatrix}, \end{align} \tag{ 45 }$

where $\mathit{Z}_{\ell,c} = \Theta_{\ell, c}^\top \Psi_{\ell,\, f} \in \mathbb{R}^{R_{\ell\text{-}1} \times s_{\ell} }$ , $\mathit{Z}_{\ell,f} = \Theta_{\ell,\, f}^\top \Psi_{\ell,\, f} \in \mathbb{R}^{(R_{\ell} - R_{\ell\text{-}1}) \times s_{\ell} }$ , and $\Psi_{\ell\text{-}1, r} \in \mathbb{R}^{R_{\ell\text{-}1} \times r_{\ell\text{-}1}}$ . We have $s_\ell = r_{\ell} - r_{\ell\text{-}1}$ and define $s_0 = r_0$ for consistency. The hierarchical LIS reduces the computational cost of operating with the LIS basis and the associated storage cost. This is critical for building efficient multilevel DILI proposals that will be discussed later. In addition, the recursive LIS enrichment is computationally more efficient, since the amount of costly PDE solves on the finer levels will be significantly reduced. In appendix A, we develop heuristics to demonstrate the reduction factors of the hierarchical construction of LIS basis in terms of the storage and the number of matrix vector products.

5. Multilevel DILI MCMC

To compute the multilevel MCMC estimator, we need to construct Markov chains $\smash{\{{\boldsymbol V}_{\!\!\ell\text{-}1} ^{(\ell, j)} \}}$ and $\{{\boldsymbol V}_{\!\!\ell} ^{(\ell, j)} \}$ for adjacent levels $\ell\text{-}1$ and $\ell$ with invariant densities $\smash{\pi_{\ell\text{-}1}(\boldsymbol v_{\ell\text{-}1} | {\boldsymbol{y}})}$ and $\pi_{\ell}(\boldsymbol v_{\ell}| {\boldsymbol{y}})$ , respectively. As discussed in remark 3.1, it is crucial that the QoIs produced by the two Markov chains $\smash{\{{\boldsymbol V}_{\!\!\ell\text{-}1} ^{(\ell, j)} \}}$ and $\{{\boldsymbol V}_{\!\!\ell} ^{(\ell, j)} \}$ are positively correlated, i.e. $\textrm{Cov}_{\Delta_{\ell,\ell\text{-}1}}(Q_\ell, Q_{\ell\text{-}1}) \gt 0$ , so that the within-level variance $\mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}(D_\ell)$ is reduced. Here, we design a computationally efficient way in section 5.1 to couple DILI proposals within the original MLMCMC [14], we introduce the computational framework in section 5.2, and then provide an alternative sampling strategy in section 5.3 that is more suitable for a parallel implementation.

5.1. Coupled DILI proposal

Let ${\boldsymbol V}_{\!\!\ell\text{-}1} ^{(\ell,j)} = \boldsymbol v_{\ell\text{-}1}^\ast$ and ${\boldsymbol V}_{\!\!\ell} ^{(\ell,j)} = \boldsymbol v_\ell^\ast$ be the jth states of the Markov chains at levels $\ell\text{-}1$ and $\ell$ , respectively. The state at level $\ell$ has the form ${\boldsymbol v_\ell^\ast = (\boldsymbol v_{\ell, c}^\ast, \boldsymbol v_{\ell,\, f}^\ast)}$ , corresponding to the coarse part of the parameters (shared with level $\ell\text{-}1$ ) and the refined part, respectively. The two Markov chains are called coupled at the jth state if $\boldsymbol v_{\ell, c}^\ast = \boldsymbol v_{\ell\text{-}1}^\ast$ . Thus, assuming the two chains to be coupled at the jth state, we first present the general form of the multilevel MCMC for generating the next pair of coupled states, and then design the hierarchical DILI proposal within this general framework.

Following [14], we assume that we can generate independent posterior samples $\mathcal{V}_{\ell\text{-}1} = \{\boldsymbol v_{\ell\text{-}1}^{(i)} \}_{i = 1}^{N_{\ell}}$ on level $\ell\text{-}1$ . In practice, this is achieved (approximatively) by sub-sampling a Markov chain that targets the level $\ell\text{-}1$ posterior with a sub-sampling rate that depends on the sample autocorrelation [14, section 3]. In other words, coupled posterior samples from $\pi(\boldsymbol v_{\ell\text{-}1} | {\boldsymbol{y}})$ and $\pi(\boldsymbol v_{\ell} | {\boldsymbol{y}})$ are generated by using the posterior $\pi(\boldsymbol v_{\ell\text{-}1} | {\boldsymbol{y}})$ on level $\ell\text{-}1$ as the proposal distribution for the Markov chain on level $\ell$ , thus reducing the within-level variance $\mathrm{Var}_{\Delta_{\ell,\ell\text{-}1}}(D_\ell)$ .

The proposed candidate $\boldsymbol v^{{\prime}}_{\ell\text{-}1} \sim \pi_{\ell\text{-}1}(\cdot | {\boldsymbol{y}})$ is assumed to be independent of the current state $\boldsymbol v_{\ell\text{-}1}^\ast$ . To sample from the refined posterior $\pi_{\ell}(\boldsymbol v_{\ell}| {\boldsymbol{y}})$ , we then consider the factorised proposal

$\begin{align} q \left( \boldsymbol v_{\ell}^{{\prime}} \,\big|\, \boldsymbol v_{\ell}^\ast \right) = q \left( \boldsymbol v_{\ell,c}^{{\prime}}\,, \,\boldsymbol v_{\ell,f}^{{\prime}} \,\big|\, \boldsymbol v_{\ell}^\ast \right) = \pi_{\ell\text{-}1}^{} \left(\boldsymbol v_{\ell, c}^{{\prime}} | {\boldsymbol{y}} \right) \, q \left( \boldsymbol v_{\ell,f}^{{\prime}} \,|\, \boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell,c}^{{\prime}} \right) , \end{align} \tag{ 46 }$

where the coarse part $\boldsymbol v_{\ell, c}^{{\prime}}$ of the proposal is set to be the (independent) proposal $\boldsymbol v_{\ell\text{-}1}^{{\prime}}$ from level $\ell\text{-}1$ . The proposal candidate $\boldsymbol v^{{\prime}}_\ell$ conditioned on $\boldsymbol v_\ell^\ast$ can then be expressed as

$\begin{align} & \boldsymbol v^{{\prime}}_{\ell,c} = \boldsymbol v^{{\prime}}_{\ell\text{-}1} & \left( \textrm{copy from level } \ell\text{-}1 \textrm{ proposal}\right) , \end{align} \tag{ 47 }$

$\begin{align} & \boldsymbol v_{\ell,f}^{{\prime}} \sim q \left( \, \cdot \,|\, \boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell,c}^{{\prime}} \right) & \left(\textrm{conditional proposal}\right). \end{align} \tag{ 48 }$

Based on the factorised proposal (46), the acceptance probability for the chain targeting the level $\ell$ posterior $\pi^{}_\ell\big(\boldsymbol v^{}_{\ell} \,|\, {\boldsymbol{y}}\big)$ is of the form

$\begin{align} \alpha^\textrm{ML}_\ell\left(\boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell}^{{\prime}} \right) = \min\left\{1, \frac{\pi_\ell\left(\boldsymbol v_{\ell}^{{\prime}} | {\boldsymbol{y}}\right)\,\pi_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^\ast| {\boldsymbol{y}}\right)}{\pi_\ell\left(\boldsymbol v_{\ell}^\ast | {\boldsymbol{y}}\right)\,\pi_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^{\prime}| {\boldsymbol{y}}\right)} \frac{q \left( \boldsymbol v_{\ell,\, f}^\ast | \boldsymbol v_{\ell}^{\prime}, \boldsymbol v_{\ell\text{-}1}^\ast \right)}{q \left( \boldsymbol v_{\ell,f}^{\prime} | \boldsymbol v_{\ell}^\ast,\boldsymbol v_{\ell\text{-}1}^{\prime} \right)} \right\}. \end{align} \tag{ 49 }$

Figure 1 shows a schematic of the coupling strategy. The double arrows represent the coupling of the two MCMC states, as well as the coupling of the two proposal candidates across levels. The dashed arrows represent the proposal and acceptance/rejection steps. The top half represents the Markov chain on level $\ell\text{-}1$ . The bottom half represents the Markov chain on level $\ell$ . Since all the proposal candidates are coupled, all states that follow the acceptance of a proposal candidate on level $\ell$ are also coupled with the corresponding state on level $\ell\text{-}1$ .

**Figure 1.** This diagram illustrates the coupling strategy with double arrows representing the coupling of two MCMC states as well as the coupling of two proposal candidates across levels. The dashed arrows represent the proposal and the accept/reject steps. At each iteration, the proposal candidates are coupled by construction, while the samples $\boldsymbol v_{\ell}^{(\ell, j)}$ and $\boldsymbol v_{\ell\text{-}1}^{(\ell, j)}$ on two adjacent levels are only coupled whenever the proposal on level $\ell$ is accepted, i.e. for all cases $j, j+1, j+2$ here.
Download figure:
Standard image High-resolution image

5.1.1. DILI proposal.

Then, we design the DILI proposal using the hierarchical LIS introduced in section 4. Recall that the discretised DILI proposal (11) is

$\begin{align} \boldsymbol v^{{\prime}}_\ell = \mathit{A}_\ell^{} \boldsymbol v_\ell^\ast + \mathit{B}_\ell^{} \boldsymbol\xi_\ell, \quad \textrm{where } \boldsymbol\xi_\ell \sim \mathcal{N}\left(0, \mathit{I}_{R_\ell}\right), \end{align} \tag{ 50 }$

as it was introduced in [10]. Suppose we have a LIS basis $\Psi_{\ell, r} \in \mathbb{R}^{R_\ell \times r_\ell}$ . By treating the likelihood-informed parameter directions and the prior-dominated directions separately, we can construct the matrices $\mathit{A}_\ell$ and $\mathit{B}_\ell$ as

$\begin{align} \mathit{A}_\ell & = \Psi_{\ell,r} \, \mathit{A}_{\ell,r} \, {\Psi_{\ell,r}^\top} + a_\perp \left( \mathit{I}_{R_\ell} - \Pi_{\ell}\right) \in \mathbb{R}^{R_ \ell\times R_\ell}, \end{align} \tag{ 51 }$

$\begin{align} \mathit{B}_\ell^2 & = \Psi_{\ell,r} \, \mathit{B}_{\ell,r}^2 \, {\Psi_{\ell,r}^\top} + b_\perp^{2} \left( \mathit{I}_{R_\ell} - \Pi_{\ell}\right)\in \mathbb{R}^{R_ \ell\times R_\ell}, \end{align} \tag{ 52 }$

where $A_{\ell,r},B_{\ell,r}\in\mathbb{R}^{r_ \ell\times r_\ell}$ , $a_\perp$ and $b_\perp\in\mathbb{R}$ and $\Pi_{\ell} = \Psi_{\ell, r}^{} \Psi_{\ell, r}^\top$ are rank- $r_\ell$ orthogonal projectors.

Corollary 5.1. In the proposal (50), suppose that $\mathit{A}_{\ell, r}, \mathit{B}_{\ell, r}\in \mathbb{R}^{r_\ell \times r_\ell}$ are non-singular matrices satisfying $\mathit{A}_{\ell,r}^2 + \mathit{B}_{\ell,r}^2 = \mathit{I}_{\ell,r}$ , and $a_\perp$ and $b_\perp$ are scalars satisfying $a_\perp^2 + b_\perp^2 = 1$ . Then, the corresponding proposal distribution $q(\boldsymbol v_{\ell}^{{\prime}}|\boldsymbol v_{\ell}^*)$ satisfies the conditions of theorem 2.3 and has the prior as its invariant measure, i.e. this proposal has acceptance probability one if we use it to sample the prior. The acceptance probability as samples from $\pi_\ell\big(\boldsymbol v_{\ell} | {\boldsymbol{y}}\big)$ is

$\begin{align} \alpha \left(\boldsymbol v_\ell^\ast, \boldsymbol v^{\prime}_\ell\right) = \min \left\{1, \exp\left[ \eta_\ell^{} \left(\boldsymbol v_\ell^\ast; {\boldsymbol{y}} \right) - \eta_\ell^{} \left(\boldsymbol v^{\prime}_\ell; {\boldsymbol{y}} \right) \right] \right\}. \end{align} \tag{ 53 }$

Proof. Given $\mathit{A}_{\ell,r}^2 + \mathit{B}_{\ell,r}^2 = \mathit{I}_{\ell,r}$ , the symmetric matrices $\mathit{A}_{\ell, r}$ and $\mathit{B}_{\ell, r}$ can be simultaneously diagonalised under some orthogonal transformation. Thus, the operators $\mathit{A}_\ell$ and $\mathit{B}_\ell$ can be simultaneously diagonalised, where the eigenspectrum of $\mathit{A}_\ell$ consists of the eigenvalues of $\mathit{A}_{\ell, r}$ and $a_\perp$ , and the same applies to $\mathit{B}_\ell$ . This way, it is easy to check that the proposal distribution $q(\boldsymbol v_{\ell}^{{\prime}}|\boldsymbol v_{\ell}^*)$ has the prior as invariant measure and that the conditions of theorem 2.3 are satisfied. The form of the acceptance probability to sample from $\pi_\ell\big(\boldsymbol v_{\ell} | {\boldsymbol{y}}\big)$ directly follows from the acceptance probability defined in theorem 2.3. □

We use the empirical posterior covariance, commonly used in adaptive MCMC [16, 17, 31] to construct matrices $\mathit{A}_{\ell, r}$ and $\mathit{B}_{\ell, r}$ for our DILI proposal (50). On each level, the empirical covariance matrix $\Sigma_{\ell, r}\in \mathbb{R}^{r_\ell \times r_\ell}$ is estimated from past posterior samples projected onto the LIS. Given a jump size $\Delta t$ , we can then define the matrices $\mathit{A}_{\ell, r}^{}$ and $\mathit{B}_{\ell, r}^{2}$ by

$\begin{align*} \textstyle \mathit{A}_{\ell, r} & = \textstyle \left(2 \mathit{I}_{r_\ell} \!+\! \Delta t \Sigma_{\ell, r}\right)^{-1}\left(2 \mathit{I}_{r_\ell} \!-\! \Delta t \Sigma_{\ell, r}\right) = \mathit{I}_{r_\ell} - 2 \left( \mathit{I}_{r_\ell} \!+\! \frac{\Delta t}{2}\Sigma_{\ell, r} \!\right)^{\!-1} \left(\frac{\Delta t}{2}\Sigma_{\ell, r} \right), \\ \mathit{B}_{\ell, r}^2 & = \textstyle \mathit{I}_{r_\ell}^{} - \mathit{A}_{\ell, r}^2 = 4 \, \left( 2\,\mathit{I}_{r_\ell} + \left(\frac{\Delta t}{2} \Sigma_{\ell, r} \right)^{-1} + \frac{\Delta t}{2} \Sigma_{\ell, r} \right)^{-1}, \end{align*}$

respectively. The operators $\mathit{A}_{\ell, r}$ and $\mathit{B}_{\ell, r}$ satisfy $\mathit{A}_{\ell, r}^2 + \mathit{B}_{\ell, r}^2 = \mathit{I}_{\ell, r}^{}$ by construction.

By estimating the empirical covariance within the subspace, common conditions such as the diminishing adaptation [1, 32] for the convergence of adaptive MCMC can be easily satisfied. In addition, we adopt a finite adaptation strategy in our numerical implementation, in which only the samples generated post adaptation are used for estimating QoIs.

5.1.2. Conditional DILI proposal.

On level 0, the vanilla DILI proposal (cf [10]) can be used to sample the Markov chain with invariant distribution $\pi_{0}(\boldsymbol v_{0}| {\boldsymbol{y}})$ . On level $\ell$ , to simulate coupled Markov chains using the proposal mechanism defined in (46)–(48), a key step is to use DILI to generate the fine components $\boldsymbol v_{\ell,\, f}^{{\prime}}$ of the proposal candidate and thus to fix the conditional probability $q(\boldsymbol v_{\ell,f}^{{\prime}} | \boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell,c}^{{\prime}} )$ . Defining the precision matrix

$\begin{align} \mathit{P}_\ell^{} = \mathit{B}_\ell^{-2} = \Psi_{\ell, r}^{} \, \mathit{B}_{\ell,r}^{-2} \,\Psi_{\ell, r}^\top + b_\perp^{-2} \left( \mathit{I}_{R_\ell}^{} - \Pi_{\ell}^{}\right), \end{align} \tag{ 54 }$

the DILI proposal (50) can be split as follows:

$\begin{align} \begin{bmatrix} \boldsymbol v^{{\prime}}_{\ell, c} \\ \boldsymbol v^{{\prime}}_{\ell,\, f} \end{bmatrix} = \mathrm{A}_\ell^{} \boldsymbol v_\ell^\ast + \begin{bmatrix} \boldsymbol r_{\ell, c} \\ \boldsymbol r_{\ell,\, f} \end{bmatrix}, \quad \begin{bmatrix} \boldsymbol r_{\ell, c} \\ \boldsymbol r_{\ell,\, f} \end{bmatrix} \sim \mathcal{N}\left(0, \begin{bmatrix} \mathit{P}_{\ell,cc} &\mathit{P}_{\ell,cf}\\ \mathit{P}_{\ell,fc} &\mathit{P}_{\ell,\, f\,f} \end{bmatrix}^{-1}\right), \end{align} \tag{ 55 }$

where the partitions of the vectors and of the matrix $\mathit{P}_\ell$ correspond to the parameter coordinates shared with level $\ell\text{-}1$ and the refined parameter coordinates on level $\ell$ .

To draw candidate samples from the factorised proposal distribution $\pi_{\ell\text{-}1}^{} \big(\boldsymbol v_{\ell, c}^{{\prime}} | {\boldsymbol{y}} \big)\, q ( \boldsymbol v_{\ell,f}^{{\prime}} | \boldsymbol v_{\ell}^\ast,$ $\boldsymbol v_{\ell,c}^{{\prime}} )$ defined in (46) we use the procedure outlined in algorithm 1, which employs the DILI proposal in the form of (55) for the conditional distribution $q( \boldsymbol v_{\ell,f}^{{\prime}} | \boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell,c}^{{\prime}} )$ .

Corollary 5.2. Using the above procedure to draw candidates from the factorised proposal distribution $\pi_{\ell\text{-}1}^{} \big(\boldsymbol v_{\ell, c}^{{\prime}} | {\boldsymbol{y}} \big) q \big( \boldsymbol v_{\ell,f}^{{\prime}} | \boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell,c}^{{\prime}} \big)$ , the acceptance probability to sample from the posterior distribution $\pi_\ell(\boldsymbol v_{\ell} | {\boldsymbol{y}})$ is

$\begin{align} \alpha^\mathrm{ML}_\ell \left(\boldsymbol v_\ell^\ast, \boldsymbol v^{\prime}_\ell\right) = \min \left\{1, \exp\!\left[ \left(\eta_\ell^{} \left(\boldsymbol v_\ell^\ast; {\boldsymbol{y}}\right) \!-\! \eta_{\ell\text{-}1}^{} \left(\boldsymbol v_{\ell\text{-}1}^\ast; {\boldsymbol{y}}\right) \right) \!-\! \left(\eta_\ell^{} \left(\boldsymbol v_\ell^{{\prime}}; {\boldsymbol{y}}\right) \!-\! \eta_{\ell\text{-}1}^{} \left(\boldsymbol v_{\ell\text{-}1}^{{\prime}}; {\boldsymbol{y}}\right) \right) \right] \right\}. \end{align}$

Proof. See appendix B. □

Algorithm 1. Conditional DILI proposal.
Input: A proposal $\boldsymbol v^{{\prime}}_{\ell,c}$ drawn from $\pi_{\ell\text{-}1}^{} (\boldsymbol v_{\ell, c}^{{\prime}} \| {\boldsymbol{y}} )$ using a sub-sampled Markov chain.
Output: A joint, candidate proposal $\boldsymbol v_\ell^{\prime} = (\boldsymbol v_{\ell,c}^{\prime}, \boldsymbol v_{\ell,f}^{\prime})$ on the fine level based on (55).
1. procedure Conditional DILI proposal
2. With $\boldsymbol v^{{\prime}}_{\ell,c}$ and $\boldsymbol v_{\ell}^\ast$ known, compute the 'residual' $\boldsymbol r_{\ell, c} = \boldsymbol v^{{\prime}}_{\ell, c} - \Theta_{\ell, c}^\top\,\mathit{A}_\ell^{} \, \boldsymbol v_\ell^\ast$ using (55).
3. Draw a random variable $\boldsymbol r_{\ell,\, f}$ conditioned on $\boldsymbol r_{\ell, c}$ such that jointly $(\boldsymbol r_{\ell, c}, \boldsymbol r_{\ell,\, f}) \sim \mathcal{N}(0, \mathit{P}_\ell^{-1})$ .
Due to (55), the fine-level components of the proposed candidate $\boldsymbol v^{{\prime}}_{\ell,f}$ then satisfy
$\begin{equation} \boldsymbol v^{{\prime}}_{\ell,\,f} = \Theta_{\ell,\, f}^\top\, \mathit{A}_\ell^{}\, \boldsymbol v_\ell^\ast + \boldsymbol r_{\ell,\, f}^{} , \quad \boldsymbol r_{\ell,\, f}^{} \sim \mathcal{N}\left(\!-\!\mathit{P}_{\ell,\, f\,f}^{-1}\mathit{P}_{\ell,\, fc}^{}\boldsymbol r_{\ell,c}^{}, \mathit{P}_{\ell,\, f\,f}^{-1} \right) . \end{equation} \tag{ 56 }$
4. end procedure

5.1.3. Generating conditional samples.

The computational cost of the coupling procedure is dictated by the multiplication with $\mathit{A}_\ell$ in Step 2 and the generation of conditional proposal samples in Step 3. The multiplication with $\mathit{A}_\ell$ has a computational complexity of $\textstyle \mathcal{O}( \sum_{j = 0}^{\ell} R_{j} s_{j})$ using the low-rank representation (51) and the upper-triangular hierarchical LIS basis in (45), which has the form

$\begin{align*} \Psi_{\ell, r} = \left[\Psi_{\ell, c}, \Psi_{\ell,\, f}\right] = \begin{bmatrix} \Psi_{\ell\text{-}1,r} &\mathrm{Z}_{\ell,c}\\ 0 &\mathrm{Z}_{\ell,f} \end{bmatrix}. \end{align*}$

We can also exploit the hierarchical LIS to reduce the computational cost of generating conditional proposal samples. As shown in equation (54), given the LIS basis $\Psi_{\ell,r}$ , the precision matrix $\mathit{P}_\ell$ is dictated by the matrix $\mathit{B}_{\ell,r}^{-2}$ , which has the block form

$\begin{align} \mathit{B}_{\ell,r}^{-2} = \begin{bmatrix} \Xi_{\ell,cc} &\Xi_{\ell,cf}\\ \Xi_{\ell,fc} &\Xi_{\ell,\, f\,f} \end{bmatrix}, \end{align} \tag{ 57 }$

corresponding to the splitting of the enriched LIS basis into $\Psi_{\ell, c}$ and $\Psi_{\ell,\, f}$ . Generating conditional proposal samples only involves the blocks $\mathit{P}_{\ell,\, f\,f}$ and $\mathit{P}_{\ell,\, fc}$ in the matrix $\mathit{P}_\ell$ , i.e.

$\begin{align} \mathit{P}_{\ell,\, f\,f} & = \mathit{Z}_{\ell,f} \, \left(\Xi_{\ell,\, f\,f} - b_\perp^{-2}\,\mathit{I} \,\right) \, \mathit{Z}_{\ell,f}^\top + b_\perp^{-2}\,\mathit{I}_{\ell,f}\,, \end{align} \tag{ 58 }$

$\begin{align} \mathit{P}_{\ell,fc} & = \mathit{Z}_{\ell,f} \Xi_{\ell,fc} \Psi_{\ell\text{-}1,r}^\top + \mathit{Z}_{\ell,f} \Xi_{\ell,\, f\,f} \mathit{Z}_{\ell,c}^\top - b_\perp^{-2} \mathit{Z}_{\ell,f}\mathit{Z}_{\ell,c}^\top\,, \end{align} \tag{ 59 }$

which in turn only require the blocks $\Xi_{\ell,fc} \in \mathbb{R}^{s_{\ell} \times r_{\ell\text{-}1}}$ and $\Xi_{\ell,\, f\,f} \in \mathbb{R}^{s_{\ell} \times s_{\ell} }$ in the matrix $\mathit{B}_{\ell,r}^{-2}$ .

We derive low-rank operations to avoid the direct inversion or factorisation of the matrices $\mathit{P}_{\ell,\, f\,f}$ and $\mathit{P}_{\ell,\, fc}$ in the generation of conditional samples and to reduce the computational cost. Suppose the block $\mathit{Z}_{\ell,f} \in \mathbb{R}^{(R_\ell - R_{\ell\text{-}1}) \times s_\ell}$ has the thin QR factorisation

$\begin{align} \mathit{Z}_{\ell,f} = \mathit{U}_{\ell} \mathit{T}_\ell, \end{align} \tag{ 60 }$

where $\mathit{U}_\ell$ has orthonormal columns and $\mathit{T}_\ell$ is upper triangular. Then the matrix $\mathit{P}_{\ell,\, f\,f}$ can be expressed as

$\begin{align*} \mathit{P}_{\ell,\, f\,f} = b_\perp^{-2} \left( \mathit{U}_{\ell} \left( \mathit{T}_\ell \left(b_\perp^{2}\,\Xi_{\ell,\, f\,f} - \mathit{I} \right) \mathit{T}_\ell^\top \right) \mathit{U}_{\ell}^\top + \,\mathit{I}_{\ell,f} \right). \end{align*}$

Computing the $s_{\ell} \times s_{\ell}$ eigendecomposition

$\begin{align} \mathit{T}_\ell \left(b_\perp^{2}\,\Xi_{\ell,\, f\,f} - \mathit{I} \right) \mathit{T}_\ell^\top = \mathit{W}_\ell \mathit{D}_\ell \mathit{W}_\ell^\top, \end{align} \tag{ 61 }$

where $\mathit{W}_\ell$ and $\mathit{D}_\ell$ are respectively orthogonal and diagonal matrices, we have

$\begin{align*} \mathit{P}_{\ell,\, f\,f} = b_\perp^{-2} \left( \Phi_\ell \, \mathit{D}_\ell \, \Phi_\ell^\top + \,\mathit{I}_{\ell,f} \right), \quad \text{with} \quad \Phi_\ell : = \mathit{U}_{\ell} \mathit{W}_\ell . \end{align*}$

Note that $\Phi_\ell \in \mathbb{R}^{(R_\ell - R_{\ell\text{-}1}) \times s_\ell}$ has orthonormal columns, so that

$\begin{align} \mathit{P}_{\ell,\, f\,f}^{-1} \mathit{P}_{\ell,fc} & = b_\perp^2 \, \Phi_\ell \underbrace{\left( \left(\mathit{D}_\ell + \mathit{I}\right)^{-1} \mathit{W}_\ell^\top \mathit{T}_\ell \right)}_{s_{\ell} \times s_{\ell} } \underbrace{\left( \Xi_{\ell,fc} \Psi_{\ell\text{-}1,r}^\top + \Xi_{\ell,\, f\,f} \mathit{Z}_{\ell,c}^\top - b_\perp^{-2} \mathit{Z}_{\ell,c}^\top \right)}_{s_{\ell} \times R_{\ell\text{-}1} }, \end{align} \tag{ 62 }$

$\begin{align} \mathit{P}_{\ell,\, f\,f}^{-\frac12} & = b_\perp \, \left( \Phi_\ell \underbrace{\left( \left(\mathit{D}_\ell + \mathit{I}\right)^{-\frac12} - \mathit{I} \right)}_{s_{\ell} \times s_{\ell} } \Phi_\ell^\top + \mathit{I}_{\ell,f} \right) . \end{align} \tag{ 63 }$

Using these representations of the matrices $\smash{ \mathit{P}_{\ell,\, f\,f}^{-1} \mathit{P}_{\ell,fc} }$ and $\smash{ \mathit{P}_{\ell,\, f\,f}^{-1/2}}$ , the conditional Gaussian in (56) can be simulated efficiently using

$\begin{align} \boldsymbol r_{\ell,\, f}^{} | \boldsymbol r_{\ell, c}^{} = - \mathit{P}_{\ell,\, f\,f}^{-1}\,\mathit{P}_{\ell,\, fc}^{}\,\boldsymbol r_{\ell,c}^{} + \mathit{P}_{\ell,\, f\,f}^{-\frac12} \xi, \textit{ where }\xi \sim \mathcal{N}\left(0,\, \mathit{I}_{\left(R_{\ell} - R_{\ell\text{-}1}\right)}\right). \end{align} \tag{ 64 }$

The associated computational cost is $\mathcal{O}(R_{\ell} s_{\ell} )$ .

5.2. Final MLDILI algorithm

Here, we assemble all the elements of the multilevel DILI method defined in the previous sections in algorithmic form. For the base level ( $\ell = 0$ ), the LIS construction and the DILI–MCMC sampling are presented in algorithm 2. The recursive LIS construction and the coupled DILI–MCMC are presented in algorithm 3.

Algorithm 2. Base level algorithm.
Input: A set of samples $\mathcal{W}_0 = \{\boldsymbol v_0^{(k)} \}_{k = 1}^{K_0}$ drawn from the base level reference $p_0^\ast(\cdot)$ , the number of MCMC iterations N₀, and an initial MCMC state $\boldsymbol V_0^{(0)}$ .
Output: A LIS basis $\Psi_{0, r}$ and a Markov chain of posterior samples $\mathcal{V}_{0} = \{\boldsymbol V_0^{(j)} \}_{j = 1}^{N_0}$ .
1. procedure Base level LIS and MCMC
2. Use $\mathcal{W}_0$ to solve the eigenproblem in (39) to obtain the base level LIS basis $\Psi_{0, r}$ .
3. Estimate the empirical covariance matrix $\Sigma_{0, r}$ from the samples in $\mathcal{W}_0$ and define theoperators $\mathit{A}_0$ and $\mathit{B}_0$ as in (51) and (52).
4. for $j = 1, \ldots, N_0$ do
5. Propose a candidate $\boldsymbol v_0^{\prime}$ using the base level proposal in (50).
6. Compute the acceptance probability $\alpha(\boldsymbol V_0^{(j -1 )},\boldsymbol v_0^{\prime})$ defined in (53).
7. With probability $\alpha(\boldsymbol V_0^{(j -1 )},\boldsymbol v_0^{\prime})$ , set $\boldsymbol V_0^{(j)} = \boldsymbol v_0^{\prime}$ , otherwise set $\boldsymbol V_0^{(j)} = \boldsymbol V_0^{(j-1)}$ .
8. end for
9. end procedure
Note: Optionally, $\Sigma_{0, r}$ , $\mathit{A}_0$ and $\mathit{B}_0$ can be adaptively updated within the MCMC after a pre-fixednumber of iterations, cf [1, 17].

Algorithm 3. Level– $\ell$ algorithm.
Input: A set of samples $\mathcal{W}_{\ell} = \{ \boldsymbol v_\ell^{(k)} \}_{k = 1}^{K_\ell}$ from the level– $\ell$ reference $p_\ell^\ast(\cdot)$ , the number of MCMC iterations $N_\ell$ , a set of MCMC samples $\mathcal{V}_{\ell\text{-}1} = \{ \boldsymbol v_{\ell\text{-}1}^{(j)} \}_{j = 1}^{N_\ell\text{-}1}$ on level $\ell\text{-}1$ and an initial MCMC state $\boldsymbol V_{\ell}^{(0)}$ .
Output: A LIS basis $\Psi_{\ell, r}$ and a Markov chain of posterior samples $\mathcal{V}_{\ell} = \{\boldsymbol V_\ell^{(j)} \}_{j = 1}^{N_\ell}$ .
1. procedure Level– $\ell$ LIS and MCMC
2. Lift previous LIS basis, $\Psi_{\ell, c} = \Theta_{\ell, c} \, \Psi_{\ell\text{-}1, r}$ .
3. Use $\mathcal{W}_\ell$ to solve the eigenproblem in (42) to obtain the auxiliary LIS vectors $\Psi_{\ell,\, f}$ .
4. Estimate the empirical covariance matrix $\Sigma_{\ell, r}$ from the samples in $\mathcal{W}_\ell$ and define theoperators $\mathit{A}_\ell$ and $\mathit{B}_\ell$ as in (51) and (52).
5. Compute the matrices $\mathit{P}_{\ell,\, f\,f}^{-1}\,\mathit{P}_{\ell,\, fc}^{}$ and $\mathit{P}_{\ell,\, f\,f}^{-\frac12}$ as in (62) and (63).
6. for $j = 1, \ldots, N_\ell$ do
7. Propose a candidate $\boldsymbol v_\ell^{\prime} = (\boldsymbol v_{\ell,c}^{\prime}, \boldsymbol v_{\ell,f}^{\prime})$ using Algorithm 1, which needs $\mathcal{V}_{\ell\text{-}1}$ .
8. Compute the acceptance probability $\alpha_\ell^\textrm{ML}(\boldsymbol V_\ell^{(j -1 )},\boldsymbol v_\ell^{\prime})$ defined in corollary 5.2.
9. With probability $\alpha_\ell^\textrm{ML}(\boldsymbol V_\ell^{(j -1 )},\boldsymbol v_\ell^{\prime})$ , set $\boldsymbol V_\ell^{(j)} = \boldsymbol v_\ell^{\prime}$ , otherwise set $\boldsymbol V_\ell^{(j)} = \boldsymbol V_\ell^{(j-1)}$ .
10. end for
11. end procedure
Note: Optionally, $\Sigma_{\ell, r}$ , $\mathit{A}_\ell$ , $\mathit{B}_\ell$ , and the matrices $\mathit{P}_{\ell,\, f\,f}^{-1}\,\mathit{P}_{\ell,\, fc}^{}$ and $\mathit{P}_{\ell,\, f\,f}^{-\frac12}$ can be adaptively updated withinMCMC after a pre-fixed number of iterations.

In both algorithms, we need to use both the LIS basis $\Psi_{\ell, r}$ and an empirical covariance matrix $\Sigma_{\ell, r}$ projected onto the LIS to define operators $\mathit{A}_\ell$ and $\mathit{B}_\ell$ in the DILI proposal. Computing the LIS basis needs some reference distribution $p_\ell^\ast(\cdot)$ . We employ the Laplace approximation to the posterior (e.g. [28, 30]). This way, all the samples from $p_\ell^\ast(\cdot)$ can be generated in parallel and prior to the DILI–MCMC simulation. The empirical covariance $\Sigma_{\ell, r}$ can be estimated using either samples drawn from the reference distribution (before the start of MCMC) or adaptively using posterior samples generated in MCMC. The latter option is the classical adaptive MCMC method [17]. The adaptation of $\Sigma_{\ell, r}$ is optional in algorithms 2 and 3. Similar to the adaptation of the covariance, the LIS basis can also be adaptively updated using newly generated posterior samples during MCMC simulations. The implementation details for the adaptation of the LIS can be found in algorithm 1 of [10].

5.3. Pooling strategy

Finally, we present an alternative proposal strategy that fully exploits the power of multilevel MCMC but reduces the dependencies of samples on different levels for a better parallel performance. In this pooling strategy, we simulate coupled multilevel Markov chains level-by-level.

Given a set of posterior samples $\mathcal{V}_{\ell\text{-}1} = \{\boldsymbol v_{\ell\text{-}1}^{(i)} \}_{i = 1}^{N_{\ell\text{-}1}}$ on level $\ell\text{-}1$ with $\boldsymbol v_{\ell\text{-}1}^{(i)} \sim \pi_{\ell\text{-}1}(\cdot | {\boldsymbol{y}})$ , we again generate $N_\ell$ samples on level $\ell$ using the multilevel proposal mechanism (46)–(48) with conditional DILI proposals as described in algorithm 1. However, here the inputs to algorithm 1, i.e. the proposals $\boldsymbol v^{{\prime}}_{\ell,c}$ , are drawn uniformly at random (with replacement) from the set $\mathcal{V}_{\ell\text{-}1}$ , in contrast to using proposals $\boldsymbol v^{{\prime}}_{\ell\text{-}1} \sim \pi_{\ell\text{-}1}(\cdot | {\boldsymbol{y}})$ from a sub-sampled Markov chain on level $\ell\text{-}1$ , as discussed in section 5.1 above. Thus, in this pooling strategy the empirical distribution of the samples in $\mathcal{V}_{\ell\text{-}1}$ is used as an approximation of $\pi_{\ell\text{-}1}(\cdot | {\boldsymbol{y}})$ .

Due to variance reduction from level to level in the multilevel MCMC algorithm (cf equation (30)) and the excellent mixing of our MLDILI algorithm, the effective sample size of $\mathcal{V}_{\ell\text{-}1}$ will in general be significantly larger than the number of samples $N_\ell$ in the sample set $\mathcal{V}_{\ell} = \{\boldsymbol v_{\ell}^{(i)} \}_{i = 1}^{N_{\ell}}$ that we plan to generate at level $\ell$ . Thus, after some burn-in phase the set $\mathcal{V}_{\ell\text{-}1}$ will contain (approximately) independent samples from the coarse level posterior which are needed in the construction of the Markov chain on level $\ell$ in algorithm 3 (Line 7).

With the pooling strategy, it is possible to run multiple Markov chains at the coarse level and form the pool using the union of coarse level samples. It parallelises much more easily and also provides flexibility if the user decides to run further refined levels to improve the discretisation accuracy—one can simply reuse the pool of previously computed samples before the refinement as the coarse level proposal. Despite the practical usefulness, we note that the formal proof of convergence of the pooling strategy remains unclear and will need to be addressed in future research.

6. Numerical experiments

In this section, the algorithms are tested on a model problem involving an elliptic PDE with random coefficients described in section 6.1. Numerical comparisons are then given in section 6.2.

6.1. Setup

We consider an elliptic PDE in a domain $\Omega = [0,1]^2$ with boundary $\partial \Omega$ , which models, e.g. the pressure distribution $p(\boldsymbol x)$ of a stationary fluid in a porous medium described by a spatially heterogeneous permeability field $k(\boldsymbol x)$ . Here, $\boldsymbol x\in \Omega$ denotes the spatial coordinate and $\boldsymbol n(\boldsymbol x)$ denotes the outward normal vector along the boundary.

The goal is to recover the permeability field from pressure observations. We assume that the permeability field follows a log–normal prior, and thus we denote the permeability field by $k(\boldsymbol x) = \exp(u(\boldsymbol x))$ , where $u(\boldsymbol x)$ is a random function equipped with a Gaussian process prior. In this setting, the pressure $p(\boldsymbol x)$ depends implicitly on the (random) realisation of $u(\boldsymbol x)$ .

For a given realisation $u(\boldsymbol x)$ , the pressure satisfies the elliptic PDE

$\begin{align} -\nabla\cdot\left(e^{u\left(\boldsymbol x\right)}\nabla p\left(\boldsymbol x\right)\right) = 0, \quad \boldsymbol x\in \Omega. \end{align} \tag{ 65 }$

On the left and right boundaries, we specify Dirichlet boundary conditions, while on the top and bottom we assume homogeneous Neumann boundary conditions:

$\begin{align} \begin{cases} ~~p\left(\boldsymbol x\right) = 0, &\ \ \text{for} \ \ \boldsymbol x\in\partial \Omega_{\text{left}}\,, \\ ~~p\left(\boldsymbol x\right) = 1, &\ \ \text{for} \ \ \boldsymbol x\in\partial \Omega_{\text{right}} \ \ \text{and} \\ ~~e^{u\left(\boldsymbol x\right)} \nabla p\left(\boldsymbol x\right)\cdot \boldsymbol n\left(\boldsymbol x\right) = 0, &\ \ \text{for} \ \ \boldsymbol x\in\left\{\partial \Omega_{\text{top}}, \partial \Omega_{\text{bottom}}\right\}. \end{cases} \end{align} \tag{ 66 }$

As the quantity of interest, we define the outflow through the left vertical boundary, i.e.

$\begin{align} Q\left(u\right) = -\int_0^1 e^{u\left(\boldsymbol x\right)}\frac{\partial p\left(\boldsymbol x\right)}{\partial x_1}\Big|_{x_1 = 0}\,\textrm{d}x_2\, = - \int_\Omega e^{u\left(\boldsymbol x\right)} \nabla p\left(\boldsymbol x\right)\cdot\nabla \varphi\left(\boldsymbol x\right)\,\textrm{d}\boldsymbol x\,, \end{align} \tag{ 67 }$

where $\varphi(\boldsymbol x)$ is a linear function taking value one on $\partial \Omega_{\text{left}}$ and zero on $\partial \Omega_{\text{right}}$ , as suggested in [38].

The Gaussian process prior for $u(\boldsymbol x)$ is defined by the exponential kernel $k(\boldsymbol x, \boldsymbol x^{{\prime}}) = \exp(-5|\boldsymbol x - \boldsymbol x^{{\prime}}|)$ . Figure 2(left) displays the true (synthetic) permeability field in $\log_{10}$ scale. Noisy observations of the pressure field are collected from 71 sensors located as in figure 2(right), with a signal-to-noise ratio 50. A likelihood function can then be defined as in (3), which, together with the prior, characterises the posterior distribution in (1).

**Figure 2.** Setup of elliptic inverse problem. Top left: 'true' permeability field used for generating the synthetic data set. Top right: observation sensors (red dots) and pressure field corresponding to 'true' permeability field. Bottom row: realisations of the permeability drawn from the prior.
Download figure:
Standard image High-resolution image

In practice, (65)–(67) has to be solved numerically. We use standard, piecewise bilinear finite elements (FEs) on a hierarchy of nested Cartesian grids with mesh size $h_\ell = \tfrac{1}{20} \times 2^{-\ell}$ , for $\ell = 0,1,2,3$ . Furthermore, we approximate the unknown function $u(\boldsymbol x)$ by truncated Karhunen-Loève expansions with $R_\ell = 50 + 100 \times 2^\ell$ random modes, respectively.

6.2. Comparisons

Let us now test and compare our algorithms on the model problem described above. First, we proceed as in section 4 to build a LIS at every level, using both the non-recursive and recursive constructions. Table 1 summarises the number of basis functions obtained in each case with truncation threshold $\rho = 10^{-2}$ , as well as the storage reduction factor given by the recursive procedure at each level.

Table 1. LIS dimensions: results of non-recursive construction (single-level LIS for each $\ell$ ) reported in first row; for the recursive construction, the number of vectors added on the current level and the total LIS dimension are given in the second and third row, respectively; the fourth row displays the storage reduction factor for the recursive procedure at each level.

Level	0	1	2	3
Non-recursive	80	91	97	100
Recursive (added on level $\ell$ )	80	21	19	12
Recursive (total)	80	101	120	132
Storage reduction factor	1	0.74	0.60	0.43

Because the recursive LIS construction recycles LIS bases from previous levels and enriches them with a number of auxiliary LIS vectors on each level, it is expected that the total number of basis functions obtained by the enriching procedure at each level is slightly higher than the direct (spectral) LIS on the same level. However, in the recursive construction, the dimension of the auxiliary set of vectors is expected to decrease as the level increases, requiring less storage and less computational effort on finer levels, since the posterior distributions were assumed to converge with $\ell \to \infty$ . For problems with parametrisations where the parameter dimension increases more rapidly with the discretisation level—e.g. using the same FE grid to discretise the prior covariance, the setting used in the original DILI paper [10]—we expect the reduction factor to be even smaller.

In the comparison of sampling performances, we denote by MLpCN the MLMCMC algorithm using the pCN proposal for the additional parameters on each level (as in [14]). The MLMCMC algorithm using the recursive LIS and the coupled DILI proposals, as summarised in algorithms 2 and 3, is denoted by MLDILI. The IACTs of Markov chains constructed by MLpCN and MLDILI are reported in table 2. The IACTs for two functionals are reported for each algorithm. In the 'refined parameters' case, at every level $\ell$ we report the average IACTs of the refined parameters $\boldsymbol v_{\ell,f}$ . This quantifies how well the algorithm performs in exploring the posterior distribution. In the second case, we consider the IACT of the level- $\ell$ corrections of the quantity of interest $D_\ell = Q_\ell( {\boldsymbol V}_{\!\!\ell} ) - Q_{\ell\text{-}1}( {\boldsymbol V}_{\!\!\ell\text{-}1} )$ .

Table 2. Comparison of IACTs of Markov chains generated by MLDILI and MLpCN. This table reports the IACTs of the refined parameters and the level- $\ell$ correction of the quantity of interest $D_\ell = Q_\ell( {\boldsymbol V}_{\!\!\ell} ) - Q_{\ell\text{-}1}( {\boldsymbol V}_{\!\!\ell\text{-}1} )$ .

	Refined parameters		$D_\ell$
Level	MLDILI	MLpCN	MLDILI	MLpCN
0	34	4300	9.0	4100
1	11	45	4.6	4.9
2	3.6	48	2.4	2.8
3	2.0	24	1.8	1.9

In the 'refined parameters' case, we observe a significant improvement for MLDILI over MLpCN: the coupled DILI proposal is able to reduce the IACT at every level compared to that obtained by MLpCN. At the base level, DILI is able to reduce the IACT by two orders of magnitude compared to that of pCN. This suggests that coarse parameter modes are very informed by the data, and thus utilising the DILI proposal is highly beneficial. In the case of the quantity of interest, we observe an even more impressive improvement at the base level (a factor of 456!), while the IACTs of MLDILI and MLpCN on the finer levels are comparable. This suggests that the posterior distribution of the chosen quantity of interest (the integrated flux over the boundary) is not affected strongly by the high frequency parameter modes on the finer levels. Nevertheless, in both cases, using DILI provides a huge acceleration compared to pCN. Figure 3 compares the IACTs of DILI and pCN on level 0, for both the first parameter component and the quantity of interest.

**Figure 3.** Autocorrelation functions of the chains $\{\big(\boldsymbol V^{(j)}_{0}\big)_1 \}$ and $\{Q_0(\boldsymbol V^{(j)}_0)\}$ on the coarsest level (Blue: DILI. Red: pCN).
Download figure:
Standard image High-resolution image

**Figure 3.** Autocorrelation functions of the chains $\{\big(\boldsymbol V^{(j)}_{0}\big)_1 \}$ and $\{Q_0(\boldsymbol V^{(j)}_0)\}$ on the coarsest level (Blue: DILI. Red: pCN).
Download figure:
Standard image High-resolution image

The IACTs for the level- $\ell$ corrections of the quantity of interest in table 2 suggest that using a mixed strategy—in which one employs the LIS and DILI only at the coarsest level and uses pCN in refined levels—is also a reasonable approach in cases where the important likelihood-informed directions that have any influence on the quantity of interest are already well enough identified in the base-level LIS. We refer to this as the MLmixed strategy.

We compare the computational performance of the three multilevel algorithms (MLDILI, MLpCN, MLmixed) with the two single level algorithms using DILI and pCN proposals. The finite element model and all MCMC algorithms are implemented in MATLAB; we use sparse Cholesky factorisation [6] to solve the finite element systems and ARPACK [27] to solve the eigenproblems. All simulations are carried out on a workstation equipped with 28 cores (two Intel Xeon E5-2680 CPUs). The performance of MLmixed is only estimated using the IACTs and the actual computing times measured in the MLDILI and MLpCN runs.

The computational complexities of the five algorithms for approximating $\mathbb{E}_{\pi}[Q]$ on (discretisation) levels $L = 1, 2$ and 3 with Q defined in (67) are compared in figure 4(right). In the multilevel estimators, the coarsest level is always $\ell = 0$ , so that the number of levels is $2,3$ and 4, respectively. The sampling error tolerance on each level is adapted to the corresponding bias error due to finite element discretisation and parameter truncation, such that the squared bias is equal to the variance of the estimator. The bias errors were estimated beforehand to be $9\times 10^{-3}$ , $4\times 10^{-3}$ , and $2\times 10^{-3}$ on levels $L = 1, 2$ and 3, leading to a total error of $1.27\times 10^{-2}$ , $5.7\times 10^{-3}$ , $2.8\times 10^{-3}$ , respectively. Those bias estimates are plotted in figure 4(left) together with estimates of $\mathrm{Var}_{\pi_\ell}(Q_\ell)$ and $\mathrm{Var}_{\Delta_{\ell, \ell\text{-}1}}( Q_\ell - Q_{\ell\text{-}1} )$ , which suggest that $\theta_b \approx 0.5$ and $\theta_v \approx 0.5$ in assumptions 2.4(i) and 3.2. This agrees with the theoretical results in [14]. The cost per sample is dominated by the sparse Cholesky factorisation on each level and scales roughly like $\mathcal{O}(M_\ell^{1.2})$ , so that $\theta_c \approx 1.2$ in assumption 2.4(ii). Optimally scaling multigrid solvers exist for this model problem, but for the FE problem sizes considered here they are more costly in absolute terms. Moreover, we can also exploit the fact that the adjoint problem is identical to the forward problem here, so that the Cholesky factors can be reused for the adjoint solves required in the LIS construction.

**Figure 4.** Left: the variance $\mathrm{Var}_{\pi_\ell}(Q_\ell)$ (blue) and the bias $\epsilon_\ell = \big| \mathbb{E}_{\mu_y}\big[Q\big] - \mathbb{E}_{\pi_\ell}\big[Q_\ell\big] \big|$ (yellow) at each level, and the cross-level variances $\mathrm{Var}_{\Delta_{\ell, \ell\text{-}1}}( D_{\ell\text{-}1} ) = \mathrm{Var}_{\Delta_{\ell, \ell\text{-}1}}( Q_\ell - Q_{\ell\text{-}1} )$ (red) used for estimating the CPU time for various MCMC methods. Right: total CPU time (in seconds) for various methods to achieve different total error tolerances. The LISs are constructed by recycling Cholesky factors. The dotted line represents a CPU day.
Download figure:
Standard image High-resolution image

**Figure 4.** Left: the variance $\mathrm{Var}_{\pi_\ell}(Q_\ell)$ (blue) and the bias $\epsilon_\ell = \big| \mathbb{E}_{\mu_y}\big[Q\big] - \mathbb{E}_{\pi_\ell}\big[Q_\ell\big] \big|$ (yellow) at each level, and the cross-level variances $\mathrm{Var}_{\Delta_{\ell, \ell\text{-}1}}( D_{\ell\text{-}1} ) = \mathrm{Var}_{\Delta_{\ell, \ell\text{-}1}}( Q_\ell - Q_{\ell\text{-}1} )$ (red) used for estimating the CPU time for various MCMC methods. Right: total CPU time (in seconds) for various methods to achieve different total error tolerances. The LISs are constructed by recycling Cholesky factors. The dotted line represents a CPU day.
Download figure:
Standard image High-resolution image

Let us now discuss the results. Single level pCN becomes impractical in this example, since the data is very informative and leads to an extremely low effective sample size. Some of this bad statistical efficiency is inherited by MLpCN, at least in absolute terms, due to the poor effective sample size on level 0. Asymptotically this effect disappears and the rate of growth of the cost is smallest for MLpCN with an observed assymptotic cost of about $\mathcal{O}(\epsilon^{-2.3})$ . As observed in [14], this is better than the theoretically predicted asymptotic rate and is likely a pre-asymptotic effect due to the high cost on level 0. Unsurprisingly, given the low IACTs reported in table 2, the methods based on DILI proposals all perform significantly better. MLDILI and MLmixed perform almost identically, since the corresponding IACTs on all levels are very similar. They are consistently better than single-level DILI and the asymptotic rate of growth of the cost is also better, $\mathcal{O}(\epsilon^{-3.4})$ versus $\mathcal{O}(\epsilon^{-4.1})$ . Both rates are consistent with the theoretically predicted rates in theorem 3.4, given the estimates for $\theta_b, \theta_v, \theta_c$ above. For the highest accuracies, MLDILI is almost 4 times faster than DILI, and due to the better asymptotic behaviour this reduction factor will grow as ε → 0. For grid level L = 4, even MLpCN is expected to outperform single-level DILI, but the computational costs of the estimators for higher accuracies are starting to become impractical even using the multilevel acceleration, as the dashed line representing one CPU day in figure 4(right) indicates.

The dominating cost in solving the eigenproblems (39) and (42) is the Cholesky factorisation. As mentioned above, sparse direct solvers are used to solve the stationary forward model and we are able to recycle the Cholesky factors from the forward solve to compute the actions of the adjoint model in (39) and (42) for each sample. As a result, the computational cost of building the LIS is negligible compared to that of the MCMC simulation here (for both the single level and the recursive construction). This also explains why MLmixed performs almost identically to MLDILI.

However, in many other applications this is not possible due to the high storage cost or when the adjoint is different. Each action of the adjoint problem typically has a comparable cost to solving the forward model in the stationary case. It can even be more expensive than solving the forward model in time-dependent problems. To provide a thorough comparison in that case, we also report the total CPU time of all the estimators in figure 5 when the LIS setup cost is included. Here, we compute both the single level LIS and the recursive LIS without storing the Cholesky factors, to mimic the behaviour in the general, large-scale case. In this setup, we observe that a significant amount of computing effort is spent on building the LIS, and thus MLmixed and MLDILI significantly outperform the single level DILI for all error thresholds. MLmixed is more than 4 times faster than DILI even for the largest error threshold of $1.27 \times 10^{-2}$ . The construction of the single-level LIS requires two times more CPU time than performing the actual MCMC simulation in that case. In comparison, a significant number of adjoint model solves can be saved by the recursive LIS construction. Furthermore, we do expect that the computational cost for constructing the recursive LIS will stop increasing, since the dimension of the auxiliary LIS will eventually be zero at higher levels. Overall, for large–scale problems where the adjoint cannot be cheaply computed by recycling the forward model simulation, the recursive LIS construction, and hence the MLDILI, is clearly more computationally efficient than the single level DILI.

**Figure 5.** Left: total CPU time (in seconds) for the single level and recursive constructions of the LISs at level $2, 3$ and 4. Right: total CPU time (in seconds) for various methods to achieve different error tolerances. The LISs are constructed without recycling Cholesky factors. The dotted line represents a CPU day.
Download figure:
Standard image High-resolution image

**Figure 5.** Left: total CPU time (in seconds) for the single level and recursive constructions of the LISs at level $2, 3$ and 4. Right: total CPU time (in seconds) for various methods to achieve different error tolerances. The LISs are constructed without recycling Cholesky factors. The dotted line represents a CPU day.
Download figure:
Standard image High-resolution image

7. Conclusion

We integrate the DILI MCMC from [10] into the multilevel MCMC framework in [14] to improve the computational efficiency of estimating the expectation of functionals of interests over posterior measures. Several novel elements are introduced in this integration. We first design a Rayleigh–Ritz procedure to recursively construct likelihood informed subspaces that exploit the hierarchy of model discretisations. The resulting hierarchical LIS needs lower computational effort to construct and has lower operation cost compared to the original LIS proposed in [11]. Then, we present a new pooling strategy to couple Markov chains on consecutive levels. This enables more flexible parallelisation and management of computing resources. Finally, we design new coupled DILI proposals by exploiting the hierarchical LIS, so that the DILI proposal can be applied in the multilevel MCMC setting. We also demonstrate the efficacy of our integrated approach on a model inverse problem governed by an elliptic PDE.

Acknowledgment

T C acknowledges support from the Australian Research Council under the Grant DP210103092. G D was supported by the EPSRC Centre for Doctoral Training in Statistical Applied Mathematics at Bath (EP/L015684/1). R S acknowledges support by the Deutsche Forschungsgemeinschaft (German Research Foundation) under Germany's Excellence Strategy EXC 2181/1 – 390900948 (the Heidelberg STRUCTURES Excellence Cluster).

Data availability statement

No new data were created or analysed in this study.

Appendix A: Computational complexity of hierarchical LIS

Here we develop heuristics—under the following set of restrictive assumptions—to compare the complexities of the construction of the hierarchical LIS and of the single-level LIS, constructed directly on level L.

Assumption A.1.

(i)
The parameter dimensions satisfy $R_\ell = R_0 e ^ {\beta_\textrm{p} \ell}$ for some $\beta_\textrm{p} \gt 0$ .
(ii)
The number of auxiliary LIS basis vectors satisfies $s_\ell \unicode{x2A7D} s_0 \, e ^ {- \beta_\textrm{r} \ell}$ for some $\beta_\textrm{r} \gt 0$ .
(iii)
The degrees of freedom in the forward model satisfy $M_\ell = M_0 \, e ^ {\beta_\textrm{m} \ell}$ for some $\beta_\textrm{m} \gt 0$ .
(iv)
The computational cost of a matrix vector product with one sample of the Gauss–Newton Hessian $\mathit{H}_\ell(\boldsymbol v_\ell^{(k)})$ is proportional to one evaluation of the forward model and thus $\mathcal{O}(M_\ell^{\vartheta_\textrm{c}})$ (cf assumption 2.4).
(v)
The number of samples to compute the sample-averaged Gauss–Newton Hessian is the same on all levels, i.e. $K_\ell = K$ independent of $\ell$ .
(vi)
For the single-level LIS constructed on level L, we assume that the LIS dimension satisfies $r^\mathrm{single}_L \unicode{x2A7E} c\,r_0$ for some constant c > 0.

The storage cost of the hierarchical LIS basis and the storage cost of the single-level LIS basis on level L are, respectively,

$\begin{align*} \zeta_\textrm{multi} = {\textstyle \sum_{l = 0}^{L}}\, R_\ell\,s_\ell, \quad \textrm{and} \quad \zeta_\textrm{single} = \, R_L\, r^\textrm{single}_L\,. \end{align*}$

The floating point operations for one matrix vector product with the hierarchical LIS basis and with the single-level LIS basis are $O\big( \zeta_\textrm{ multi} \big)$ and $O\big( \zeta_\textrm{single} \big)$ , respectively, with the same hidden constant.

Corollary A.2. The reduction factor of storing and operating with the hierarchical LIS basis (as opposed to the standard single-level LIS on level L) satisfies the upper bound

$\begin{align} \frac{\zeta_\mathrm{multi}}{\zeta_\mathrm{single} } \unicode{x2A7D} \frac{1}{c}\, \min\left( L+1, \frac{1}{1 - e^{- | \beta_\textrm{p} - \beta_\textrm{r} |} }\right) \,e^{- \min\left(\beta_\textrm{p}, \beta_\textrm{r}\right) L} \,. \end{align} \tag{ 68 }$

Proof. Using assumption A.1, the required storage for the hierarchical and for the single-level LIS bases can be bounded by

$\begin{align*} \zeta_\textrm{multi} = {\textstyle \sum_{l = 0}^{L}}\, R_\ell\,s_\ell \unicode{x2A7D} R_0 \, s_0 {\textstyle \sum_{l = 0}^{L}} \, e^{\left(\beta_\textrm{p} - \beta_\textrm{r}\right) \ell} \ \ \ \text{and} \ \ \ \zeta_\textrm{single} = \, R_L\, r_L \unicode{x2A7E} c\, R_0 \, s_0\, e ^ {\beta_\textrm{p} L}. \end{align*}$

Thus, the reduction factor satisfies

$\begin{align} \frac{\zeta_\textrm{multi}}{\zeta_\textrm{single} } \unicode{x2A7D} \frac{1}{c} \, e^{-\beta_\textrm{p} L}\left({\textstyle \sum_{l = 0}^{L}} \, e ^ {\left(\beta_\textrm{p} - \beta_\textrm{r}\right) \ell}\right). \end{align} \tag{ 69 }$

We first consider the case $\beta_\textrm{p} \neq \beta_\textrm{r}$ . Using the property of geometric series, we have

$\begin{align*} {\textstyle \sum_{l = 0}^{L}} \, e^{\left(\beta_\textrm{p} - \beta_\textrm{r}\right) \ell} = \frac{1 - e^{\left(\beta_\textrm{p} - \beta_\textrm{r}\right) \left(L+1\right)} }{1 - e^{\left(\beta_\textrm{p} - \beta_\textrm{r} \right) }}. \end{align*}$

For the case $\beta_\textrm{p} \lt \beta_\textrm{r}$ , the reduction factor satisfies

$\begin{align} \frac{\zeta_\textrm{multi}}{\zeta_\textrm{single} } \unicode{x2A7D} \frac{1}{c} \, e^{-\beta_\textrm{p} L} \frac {1 - e^{\left(\beta_\textrm{p} - \beta_\textrm{r}\right) \left(L+1\right)} }{1 - e^{\left(\beta_\textrm{p} - \beta_\textrm{r} \right) }}, \end{align} \tag{ 70 }$

whereas for $\beta_\textrm{p} \gt \beta_\textrm{r}$ , the reduction factor satisfies

$\begin{align} \frac{\zeta_\textrm{multi}}{\zeta_\textrm{single} } \unicode{x2A7D} \frac{1}{c} \, e^{- \beta_\textrm{p} L} \frac{1 - e^{\left(\beta_\textrm{p} - \beta_\textrm{r}\right) \left(L+1\right)} }{1 - e^{\left(\beta_\textrm{p} - \beta_\textrm{r} \right) }} = \frac{1}{c} \, e^{-\beta_\textrm{r} L} \frac{1 - e^{\left(\beta_\textrm{r} - \beta_\textrm{p}\right) \left(L+1\right)} } {1 - e^{\left(\beta_\textrm{r} - \beta_\textrm{p} \right) }}. \end{align} \tag{ 71 }$

In both cases, the reduction factor can be expressed as

$\begin{align} \frac{\zeta_\textrm{multi}}{\zeta_\textrm{single} } \unicode{x2A7D} \frac{1}{c} \, e^{-\min\left(\beta_\textrm{p}, \beta_\textrm{r}\right) L} \frac {1 - a^{L+1} }{1 - a}, \end{align} \tag{ 72 }$

where $a = e^{- | \beta_\textrm{p} - \beta_\textrm{r} |} \in (0, 1)$ . Using induction, one can easily show that

$\begin{align} \frac{1 - a^{L+1} }{1 - a} \unicode{x2A7D} \min\left( L+1, \frac{1}{1 - a} \right) , \quad \forall L \unicode{x2A7E} 0, \forall a \in \left(0, 1\right), \end{align} \tag{ 73 }$

which completes the proof for $\beta_\textrm{p} \neq \beta_\textrm{r}$ .

For $\beta_\textrm{p} = \beta_\textrm{r} = \min(\beta_\textrm{p}, \beta_\textrm{ r})$ the result of corollary A.2 follows directly from (69) since in that case $\sum_{l = 0}^{L} \, e ^ {(\beta_\textrm{p} - \beta_\textrm{r}) \ell } = L+1$ . □

Using a similar derivation, we can also obtain the reduction factor for constructing the hierarchical LIS basis. The number of matrix vector products (with the sample-averaged Gauss–Newton Hessian $\widehat{\mathit{H}}_0$ ) in the construction of the base level LIS via the eigenproblems (39) is linear in the number of leading eigenvectors obtained, i.e. $\mathcal{O}(s_0)$ . The same holds for the number of matrix vector products with $\widehat{\mathit{H}}_\ell$ in the construction of the auxiliary LIS vectors in the recursive enrichment solving the eigenproblems in (42). Thus, the overall computational complexities for constructing the hierarchical LIS basis is

$\begin{align*} \chi_\textrm{multi} = \mathcal{O}\left( {\textstyle K \sum_{l = 0}^{L}} \, s_\ell \, M_\ell^{\vartheta_\textrm{c}} \right). \end{align*}$

Similarly, the construction of the single level LIS on level L is

$\begin{align*} \chi_\textrm{single} = \mathcal{O}\left(K r^\textrm{single}_L \, M_L^{\vartheta_\textrm{c}} \right), \end{align*}$

where the prefactors are the same. The following corollary can be proved in the same way as corollary A.2, since we have assumed that $M_\ell^{\vartheta_\textrm{c}} = M_0^ {\vartheta_\textrm{c}} \, e ^ {\beta_\textrm{m} \vartheta_\textrm{c} \ell}$ .

Corollary A.3. The reduction factor of building the hierarchical LIS basis (as opposed to the standard single-level LIS basis on level L) satisfies the upper bound

$\begin{align} \frac{\chi_\mathrm{multi}}{\chi_\mathrm{single} } \unicode{x2A7D} \frac{1}{c}\, \min\left( L+1, \frac{1}{1 - e^{- | \beta_\textrm{m}\vartheta_\textrm{c} - \beta_\textrm{r} |} }\right) \, e^{- \min\left(\beta_\textrm{m}\vartheta_\textrm{c} \,,\, \beta_\textrm{r}\right) L}\,. \end{align} \tag{ 74 }$

Appendix B: Proof of corollary 5.2

Due to the acceptance probability (49), we have

$\begin{align*} \beta^{}_\ell\left(\boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell}^{{\prime}} \right) & = \min\left\{1, \frac{\pi^{}_\ell\left(\boldsymbol v_{\ell}^{{\prime}} \,|\, {\boldsymbol{y}}\right)\,\pi^{}_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^\ast\,|\, {\boldsymbol{y}}\right)}{\pi^{}_\ell\left(\boldsymbol v_{\ell}^\ast \,|\, {\boldsymbol{y}}\right)\,\pi^{}_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^{\prime}\,|\, {\boldsymbol{y}}\right)} \, \frac{q \left( \boldsymbol v_{\ell,\, f}^\ast \,|\, \boldsymbol v_{\ell}^{\prime}, \boldsymbol v_{\ell\text{-}1}^\ast \right)}{q \left( \boldsymbol v_{\ell,f}^{\prime} \,|\, \boldsymbol v_{\ell}^\ast,\boldsymbol v_{\ell\text{-}1}^{\prime} \right)} \right\}, \end{align*}$

where, by definition,

$\begin{align*} \frac{\pi^{}_\ell\left(\boldsymbol v_{\ell}^{{\prime}} \,|\, {\boldsymbol{y}}\right)\,\pi^{}_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^\ast\,|\, {\boldsymbol{y}}\right)}{\pi^{}_\ell\left(\boldsymbol v_{\ell}^\ast \,|\, {\boldsymbol{y}}\right)\,\pi^{}_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^{\prime}\,|\, {\boldsymbol{y}}\right)} = \frac{p^{}_\ell\left(\boldsymbol v_{\ell}^{{\prime}}\right)\,p^{}_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^\ast\right)}{p^{}_\ell\left(\boldsymbol v_{\ell}^\ast\right)\,p^{}_{\ell\text{-}1}\left(\boldsymbol v_{\ell\text{-}1}^{\prime}\right)} \, \frac{\exp\left( \!\!-\! \eta^{}_\ell \left(\boldsymbol v_\ell^{{\prime}}; {\boldsymbol{y}}\right) \!+\! \eta^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell\text{-}1}^{{\prime}}; {\boldsymbol{y}}\right) \right)}{\exp\left( \!\!-\! \eta^{}_\ell \left(\boldsymbol v_\ell^\ast; {\boldsymbol{y}}\right) \!+\! \eta^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell\text{-}1}^\ast; {\boldsymbol{y}}\right) \right)}\,, \end{align*}$

such that we can write

$\begin{align} \beta^{}_\ell\left(\boldsymbol v_{\ell}^\ast, \boldsymbol v_{\ell}^{{\prime}} \right) \! = \! \min\!\left\{\! 1, \underbrace{\frac{p^{}_\ell \left(\boldsymbol v_{\ell}^{{\prime}}\right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell\text{-}1}^\ast\right)\,q \left( \boldsymbol v_{\ell,\, f}^\ast | \boldsymbol v_{\ell}^{\prime}, \boldsymbol v_{\ell\text{-}1}^\ast \right)}{p^{}_\ell \left(\boldsymbol v_{\ell}^\ast\right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell\text{-}1}^{\prime}\right)\,q \left( \boldsymbol v_{\ell,f}^{\prime} | \boldsymbol v_{\ell}^\ast,\boldsymbol v_{\ell\text{-}1}^{\prime} \right)} }_{\unicode{x2460}} \! \underbrace{\frac{\exp\left( \!\!-\! \eta^{}_\ell \left(\boldsymbol v_\ell^{{\prime}}; {\boldsymbol{y}}\right) \!+\! \eta^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell\text{-}1}^{{\prime}}; {\boldsymbol{y}}\right) \right)}{\exp\left( \!\!-\! \eta^{}_\ell \left(\boldsymbol v_\ell^\ast; {\boldsymbol{y}}\right) \!+\! \eta^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell\text{-}1}^\ast; {\boldsymbol{y}}\right) \right)} }_{\unicode{x2461}} \!\!\right\} . \end{align} \tag{ 75 }$

The level $\ell$ parameter vectors can be split as $\boldsymbol v_{\ell}^{{\prime}} = (\boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}})$ and $\boldsymbol v_{\ell}^\ast = (\boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast)$ . and we have $\boldsymbol v_{\ell,c}^{{\prime}} = \boldsymbol v_{\ell\text{-}1}^{{\prime}}$ and $\boldsymbol v_{\ell,c}^\ast = \boldsymbol v_{\ell\text{-}1}^\ast$ by construction in the coupling procedure. Thus,

$\begin{align} \unicode{x2460} = \frac{p^{}_\ell \left(\boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}}\right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^\ast\right)\,q \left( \boldsymbol v_{\ell,\, f}^\ast | \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}}, \boldsymbol v_{\ell,c}^\ast \right)}{p^{}_\ell \left(\boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast\right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^{{\prime}}\right)\,q \left( \boldsymbol v_{\ell,f}^{\prime} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast,\boldsymbol v_{\ell,c}^{{\prime}} \right)}. \end{align} \tag{ 76 }$

The density of the conditional DILI proposal $q \big( \boldsymbol v_{\ell,f}^{\prime} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast,\boldsymbol v_{\ell,c}^{{\prime}} \big)$ is defined as

$\begin{align} q \left( \boldsymbol v_{\ell,f}^{\prime} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast,\boldsymbol v_{\ell,c}^{{\prime}} \right) = \frac{q \left( \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right)}{q \left( \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right)}, \end{align} \tag{ 77 }$

that is the ratio between the DILI proposal density and the marginal DILI proposal density, which takes the form

$\begin{align} q \left( \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right) \equiv \int q \left( \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right) d\boldsymbol v_{\ell,f}^{{\prime}}. \end{align} \tag{ 78 }$

Due to corollary 5.1, the DILI proposal $q(\boldsymbol v_{\ell}^{{\prime}} | \boldsymbol v_{\ell}^\ast)$ has the prior distribution $p_\ell(\boldsymbol v_\ell)$ as invariant measure, i.e.

$\begin{align} p^{}_\ell \left(\boldsymbol v_{\ell}^\ast\right) q \left( \boldsymbol v_{\ell}^{{\prime}} | \boldsymbol v_{\ell}^\ast \right) = p^{}_\ell \left(\boldsymbol v_{\ell}^{{\prime}} \right). \end{align} \tag{ 79 }$

Hence, if $\boldsymbol v_{\ell}^\ast = (\boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast)$ is drawn from the prior $p_\ell(\boldsymbol v_{\ell})$ , then the proposal candidate $\boldsymbol v_{\ell}^{{\prime}} = (\boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}})$ also follows the prior $p_\ell(\boldsymbol v_{\ell})$ . Furthermore, if $\boldsymbol v_{\ell}^\ast$ is drawn from $p_\ell(\boldsymbol v_{\ell})$ , then the marginal DILI proposal $q \big( \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \big)$ generates candidates with coarse components that follow the marginal prior

$\begin{equation*} \int p^{}_\ell\left(\boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}}\right) d\boldsymbol v_{\ell,f}^{{\prime}}, \end{equation*}$

which for our particular choice of parametrisation is the same as the prior $p_{\ell\text{-}1} \big(\boldsymbol v_{\ell,c}^{{\prime}}\big)$ on level $\ell\text{-}1$ , that is, $p^{}_\ell \big(\boldsymbol v_{\ell}^\ast\big) q \big( \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell}^\ast \big) = p^{}_{\ell\text{-}1} \big(\boldsymbol v_{\ell,c}^{{\prime}} \big)$ . Using this identity and substituting (77) into (76), the ratio $\unicode{x2460}$ can be simplified to

$\begin{align*} \unicode{x2460} & = \frac{p^{}_\ell \left(\boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}}\right)\,q \left( \boldsymbol v_{\ell,\, f}^\ast, \boldsymbol v_{\ell,c}^\ast | \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}}\right)\,q \left( \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^\ast\right)} {p^{}_\ell \left(\boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast\right)\,q \left( \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right)\,q \left( \boldsymbol v_{\ell,c}^\ast | \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}} \right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^{{\prime}}\right)} \\ & = \frac{p^{}_\ell \left(\boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast\right)\,q \left( \boldsymbol v_{\ell,c}^{{\prime}} | \boldsymbol v_{\ell,f}^\ast, \boldsymbol v_{\ell,c}^\ast \right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^\ast\right)} {p^{}_\ell \left( \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}} \right)\,q \left( \boldsymbol v_{\ell,c}^\ast | \boldsymbol v_{\ell,f}^{{\prime}}, \boldsymbol v_{\ell,c}^{{\prime}} \right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^{{\prime}}\right)} = \frac{p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^{{\prime}} \right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^\ast\right)} {p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^\ast \right)\,p^{}_{\ell\text{-}1} \left(\boldsymbol v_{\ell,c}^{{\prime}}\right)} = 1. \end{align*}$

The result then follows immediately from (75).

Multilevel dimension-independent likelihood-informed MCMC for large-scale inverse problems

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Background

2.1. Bayesian inference framework

2.2. Dimension-independent likelihood-informed MCMC on function space

2.3. Posterior discretisation and bias-variance decomposition

3. Multilevel MCMC

3.1. Variance management

3.2. Notations

4. Multilevel LIS

4.1. Setup

4.2. Base level LIS

4.3. LIS enrichment

5. Multilevel DILI MCMC

5.1. Coupled DILI proposal

5.1.1. DILI proposal.

5.1.2. Conditional DILI proposal.

5.1.3. Generating conditional samples.

5.2. Final MLDILI algorithm

5.3. Pooling strategy

6. Numerical experiments

6.1. Setup

6.2. Comparisons

7. Conclusion

Acknowledgment

Data availability statement

Appendix A: Computational complexity of hierarchical LIS

Appendix B: Proof of corollary 5.2

Multilevel dimension-independent likelihood-informed MCMC for large-scale inverse problems

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Background

2.1. Bayesian inference framework

2.2. Dimension-independent likelihood-informed MCMC on function space

2.3. Posterior discretisation and bias-variance decomposition

3. Multilevel MCMC

3.1. Variance management

3.2. Notations

4. Multilevel LIS

4.1. Setup

4.2. Base level LIS

4.3. LIS enrichment

5. Multilevel DILI MCMC

5.1. Coupled DILI proposal

5.1.1. DILI proposal.

5.1.2. Conditional DILI proposal.

5.1.3. Generating conditional samples.

5.2. Final MLDILI algorithm

5.3. Pooling strategy

6. Numerical experiments

6.1. Setup

6.2. Comparisons

7. Conclusion

Acknowledgment

Data availability statement

Appendix A: Computational complexity of hierarchical LIS

Appendix B: Proof of corollary 5.2