1 Introduction

This paper is concerned with the efficient numerical solution of matrix equations that arise from implicit time integration of large systems of ordinary differential equations (ODEs). More specifically, we treat generalized Sylvester equations of the form

$$\begin{aligned} AXB_2^T + MXB_1^T=F, \end{aligned}$$
(1)

where \(F\in {\mathbb {R}}^{n\times n_t}\) and the columns of the unknown matrix \(X\in {\mathbb {R}}^{n\times n_t}\) contain the approximate solution of the ODE at \(n_t\) time steps with constant step size. The matrices \(A,M\in {\mathbb {R}}^{n\times n}\) are large and sparse, as they arise, for example, from the finite difference/element spatial discretization of a time-dependent partial differential equation (PDE). The matrices \(B_1,B_2\in {\mathbb {R}}^{n_t\times n_t}\) are banded Toeplitz matrices and contain the coefficients of the time integrator. We assume throughout this work that M and \(B_2\) are invertible.

Vectorizing (1) and using the Kronecker product \(\otimes \), we obtain the equivalent \(nn_t\times nn_t\) linear system

$$\begin{aligned} \left( B_2\otimes A +B_1\otimes M\right) {\textbf{x}}={\textbf{f}}, \end{aligned}$$
(2)

where \({\textbf{x}} = \textrm{vec}(X)\) and \({\textbf{f}} = \textrm{vec}(F)\). While the theoretical properties of generalized Sylvester equations are well understood and various numerical solution strategies have been developed [6, 46], the structure of the matrix coefficients in (1) comes with particular challenges that are addressed in this work.

Illustrative example. To illustrate (1) and motivate our developments, we consider the ODE

$$\begin{aligned} {\left\{ \begin{array}{ll}M\dot{{\textbf{u}}}(t)+A{\textbf{u}}(t)= {\textbf{f}}(t),&{}t\in (0, T],\\ {\textbf{u}}(0)={\textbf{x}}_0,\end{array}\right. } \end{aligned}$$
(3)

with \({\textbf{u}}(t)\in {\mathbb {R}}^n\). When such an ODE arises from the finite element discretization of a linear time-dependent PDE then A and M correspond to the stiffness matrix and the mass matrix, respectively, and M is symmetric positive definite. When using a finite difference discretization, we have \(M = I_n\). The implicit Euler method with step size \(\Delta t = T / n_t\) applied to (3) requires the successive solution of the linear systems

$$\begin{aligned} (\Delta t^{-1}M+ A){\textbf{x}}_k = {\textbf{f}}_k + \Delta t^{-1}M {\textbf{x}}_{k-1}, \qquad k=1,\dots , n_t, \end{aligned}$$
(4)

where \({\textbf{x}}_k \approx {\textbf{u}}(k\cdot \Delta t)\). Stacking these equations yields an \(nn_t\times nn_t\) linear system of the form

$$\begin{aligned} \begin{bmatrix} \Delta t^{-1}M+ A\\ -\Delta t^{-1}M&{}\Delta t^{-1}M+ A\\ &{}\ddots &{}\ddots \\ &{}&{}-\Delta t^{-1}M&{}\Delta t^{-1}M+ A \end{bmatrix}\begin{bmatrix} {\textbf{x}}_1\\ {\textbf{x}}_2\\ \vdots \\ {\textbf{x}}_{n_t} \end{bmatrix}=\begin{bmatrix}{\textbf{f}}_1+ \Delta t^{-1}M{\textbf{x}}_0\\ {\textbf{f}}_2\\ \vdots \\ {\textbf{f}}_{n_t} \end{bmatrix}. \end{aligned}$$

The system matrix can be rewritten in the form \(B_2\otimes A +B_1\otimes M\) from (2) by setting

$$\begin{aligned} B_1= \frac{1}{\Delta t}\begin{bmatrix} 1&{}\\ -1&{}\quad 1\\ &{}\ddots &{}\quad \ddots \\ &{}&{}\quad -1&{}\quad 1 \end{bmatrix},\qquad B_2=I_{n_t}. \end{aligned}$$
(5)

In turn, the linear system can be rephrased as a matrix equation of the form (1).

Taking a closer look at (2), we identify the factors that have a role in determining such a peculiar structure. The Kronecker structure of the coefficient matrix is due to the time-independence of the coefficients of the linear ODE (3). The matrices \(B_1,B_2\) are Toeplitz because of the use of a constant step size. Indeed, this property is maintained when replacing implicit Euler by implicit multistep methods [27] such as BDF (backward differentiation formula). When using Runge–Kutta integrators [27], we still get banded Toeplitz matrices \(B_1,B_2\) but – as we will see in Sect. 4 – we have to embed AM into larger matrices in order to arrive at a matrix equation of the form (1).

Existing work. Classical time integration approaches for ODEs proceed by solving the discretized equations, like (3), sequentially first for \(k =1\), then for \(k = 2\), and so on. This limits the potential for parallelizing the implementation of the integrator. The Parareal algorithm [22, 33] avoids this limitation by combining, iteratively, coarse (cheap) time steps that are performed sequentially with fine (expensive) time steps that are performed in parallel. There have been many developments and modifications of Parareal during the last two decades; we refer to [15, 21] for overviews and a historical perspective. Our work relates to the so called ParaDiag algorithms [20], which proceed somewhat differently from Parareal by explicitly exploiting the structure of the matrix Eq. (1). In passing, we mention that linear ODEs with constant coefficients, like the model problem (3), can be turned into an independent set of linear systems by combining an integral transform, like the Laplace transform, with quadrature for the numerical evaluation of the inverse transform; see [15, Sec. 5.5] and the references therein.

The basic idea of ParaDiag is simple: If \(B_2^{-1} B_1\) can be diagonalized by a similarity transformation then (1) decouples into \(n_t\) linear systems, which can be solved in parallel. The pitfall of this approach is that the most straightforward choice of time steps, constant time steps, leads to non-diagonalizable matrices; indeed, the matrix \(B_2^{-1} B_1\) in (5) is one big Jordan block. When choosing a geometrically increasing sequence of time steps, such as \(\Delta t_j= \Delta t_1 \tau ^{j-1}\) for some \(\tau >1\) then \(B_2^{-1}B_1\) has mutually distinct eigenvalues and thus becomes diagonalizable [35]. However, even when using a second order method in time, e.g., the Crank-Nicolson method, the condition number of the transformation matrix grows rapidly as \(n_t\) increases, which severely limits the number of time steps [18, 19].

A second class of ParaDiag algorithms allows for uniform time steps and perturbs \(B_1, B_2\) to enforce diagonalizability. More specifically, the perturbed system matrix takes the form

$$\begin{aligned} P_{\alpha }=C_2^{(\alpha )}\otimes A+C_1^{(\alpha )}\otimes M, \end{aligned}$$
(6)

where \(C_1^{(\alpha )},C_2^{(\alpha )}\) are Strang-type [9, 38, 48] \(\alpha \)-circulant matricesFootnote 1 constructed from the Toeplitz matrices \(B_1,B_2\). This matrix can be used within a stationary iteration \(P_{\alpha }{\textbf{x}}^{(j)}=(P_{\alpha }-B_2\otimes A - B_1\otimes M){\textbf{x}}^{(j-1)}+{\textbf{f}}\) [34] or as a preconditioner for GMRES [10, 37] applied to (2). In the solution of linear systems with \(P_{\alpha }\) one takes advantage of the fact that the matrices \(C_1^{(\alpha )},C_2^{(\alpha )}\) are simultaneously diagonalized by a scaled Fourier transform:

$$\begin{aligned}{} & {} \Omega ^*D_{\alpha }C_1^{(\alpha )}D_{\alpha }^{-1}\Omega = \textrm{diag}(\lambda _{1,1},\dots , \lambda _{1,n_t}), \\{} & {} \Omega ^*D_{\alpha }C_2^{(\alpha )}D_{\alpha }^{-1}\Omega = \textrm{diag}(\lambda _{2,1},\dots , \lambda _{2,n_t}), \end{aligned}$$

where \(\Omega \in {\mathbb {C}}^{n_t\times n_t}\) is the discrete Fourier transform matrix and \(D_{\alpha }=\textrm{diag}(1, \alpha ^{\frac{1}{n_t}},\dots , \alpha ^{\frac{n_t-1}{n_t}})\). In turn, \( P_{\alpha }\) is block diagonalized:

$$\begin{aligned} (\Omega ^*D_{\alpha }\otimes I_n)P_{\alpha }(D_{\alpha }^{-1}\Omega \otimes I_n)= \begin{bmatrix}\lambda _{1,1}M+\lambda _{2,1}A\\ &{}\ddots \\ &{}&{}\lambda _{1,n_t}M+\lambda _{2,n_t}A\end{bmatrix}. \end{aligned}$$

Using the fast Fourier transform (FFT), this allows us to compute the solution of a linear system, with \(P_{\alpha }\) by combining FFT with the (embarrassingly parallel) solution of \(n_t\) linear systems with the matrices \((\lambda _{1,j}M +\lambda _{2,j}A\), \(j = 1,\ldots ,n_t\). This procedure, rephrased for the matrix equation corresponding to \(P_{\alpha }{\textbf{x}}={\textbf{f}}\), is summarized in Algorithm 1. Note that, even though the original ODE (3) features real matrices, Algorithm 1 requires the solution of complex linear systems, which is more expensive (often by a factor 2–4) than solving real linear systems. This disadvantage of having to pass to complex arithmetic is shared by all diagonalization based methods discussed in this work.

figure a

The use of the preconditioner \(P_\alpha \) can dramatically accelerate the convergence of an iterative solver for (2). This has been observed for parabolic problems in [37] when using \(P_{\alpha }\) with \(\alpha =\pm 1\) as a preconditioner in GMRES. In [20], wave propagation problems (integrated in time with a leap-frog finite difference scheme) have been successfully treated by choosing a relatively small value for \(0<\alpha <1\). Note, however, that \(\alpha \) cannot be chosen arbitrarily small because otherwise the condition number of the matrix \(D_\alpha \) (given by \(\alpha ^{(1-n_t)/n_t} \approx 1/\alpha \)) explodes and, in turn, roundoff error impedes convergence. In practice, this means that the choice of \(\alpha \) needs to strike a compromise and there is no choice of \(\alpha \) that yields sudden convergence. In turn, several iterations are needed to attain good accuracy, which adds significant computational overhead and communication cost because every iteration calls Algorithm 1 once.

New contributions. In this paper, we propose two novel strategies that avoid the overhead incurred by iterative procedures for solving (2).

Our first method, presented in Sect. 2, exploits the trivial observation that \(C_1^{(\alpha )} - B_1\) and \(C_2^{(\alpha )} - B_2\) have very low rank. Thus, the Sylvester equation addressed by Algorithm 1 can be viewed as a low-rank modification of the original Eq. (1), which allows us to apply the low-rank update from [30] for linear matrix equations. This update corrects the output of Algorithm 1 by solving a Sylvester equation with the same coefficients as (1) but with a different right-hand side that has low rank. The main novelty of Sect. 2 is in the analysis of the low-rank structure of the solution to this equation and the development of efficient algorithms.

Our second method, presented in Sect. 3, takes an interpolation point of view: Whenever we solve the linear system \(P_{\alpha }{\textbf{x}}={\textbf{f}}\) we are actually evaluating a vector-valued function \({\textbf{x}}(\alpha )\), depending on the complex parameter \(\alpha \), such that \({\textbf{x}}(0)\) corresponds to the solution of (2). More precisely, this function can be efficiently and accurately evaluated on a scaled unit circle. We suggest to construct an approximation for \({\textbf{x}}(0)\) via a linear combination of \({\textbf{x}}(\alpha )\) evaluated at all dth order roots of a complex number for some small value of d. This strategy is highly parallelizable as it requires to solve d completely independent linear systems with coefficient matrices of the form \(P_{\alpha }\). Moreover, it applies also to the cases where the matrices \(\delta B_j^{(\alpha )}\) are not low-rank, for example, when a time fractional differential operator is involved.

Section 4 extends our framework to implicit Runge–Kutta time integrators.

Section 5 is dedicated to a range of numerical experiments that highlight the advantages of our method for a broad range of situations.

2 Low-rank update approach

For a lower triangular banded Toeplitz matrix B, Strang’s \(\alpha \)-circulant preconditioner [38] is obtained from reflecting the lower band to the top right corner and multiplying it by \(\alpha \):

In particular, this implies that the rank of this modification is bounded by the width of the lower bandwidth. Applied to the time stepping matrices \(B_1\), \(B_2\), we obtain that

$$\begin{aligned} C_1^{(\alpha )}=B_1+\delta B_1^{(\alpha )}, \qquad C_2^{(\alpha )}=B_2+\delta B_2^{(\alpha )} \end{aligned}$$

for certain low-rank matrices \(\delta B_1^{(\alpha )},\delta B_2^{(\alpha )}\). For instance, for the implicit Euler method (see (5)) we have \(\delta B_1^{(\alpha )}=-\frac{\alpha }{\Delta t}{\textbf{e}}_1 {\textbf{e}}_{n_t}^T\), where \({\textbf{e}}_j\) denoting the jth unit vector of length \(n_t\), and \(\delta B_2^{(\alpha )}=0\). Inspired by [30] we write the solution of (1) as \(X=X_0+\delta X\), where

$$\begin{aligned} AX_0(C_2^{(\alpha )})^T+MX_0(C_1^{(\alpha )})^T&= F, \end{aligned}$$
(7)
$$\begin{aligned} A\delta XB_2^T+M\delta XB_1^T&=AX_0(\delta B_2^{(\alpha )})^T + MX_0(\delta B_1^{(\alpha )})^T. \end{aligned}$$
(8)

After Eq. (7) is solved via Algorithm 1, the correction Eq. (8) becomes a generalized Sylvester equation in \(\delta X\) that has the same coefficients as (1) but a different right-hand side of rank at most \(\hbox {rank}(\delta B_1^{(\alpha )})+\hbox {rank}(\delta B_2^{(\alpha )})\). This remarkable property enables us to make use of low-rank solvers for linear matrix equations (see [46] for an overview) in order to approximate \(\delta X\). Note that, the choice of \(\alpha \ne 0\) has no influence on this strategy; for this reason we always choose \(\alpha =1\) when using the low-rank update approach. The resulting procedure is summarized in Algorithm 2.

figure b

Remark 1

It is not uncommon that the right-hand side matrix F of (1) has (numerically) low rank because, for example, the inhomogeneity \({\textbf{f}}\) of the ODE (3) is constant or varies smoothly with respect to t. In such a situation, a low-rank solver can be applied directly to (1) and the following approach has been suggested by Palitta [39]: A block Krylov subspace method is used for reducing the spatial dimension combined with an application of the Shermann–Morrison–Woodbury formula to the resulting reduced equation. Our low-rank approach has the advantage that it is agnostic to properties of F, at the expense of having to solve (7). In principle, Palitta’s method can be applied to solve the correction Eq. (8) but we found it more efficient for large \(n_t\) to use a two-sided approach that reduces the temporal dimension as well.

2.1 Solution of correction equation

In the following, we discuss the application of low-rank solvers to the correction Eq. (8). In view of the (assumed) invertibility of M and \(B_2\) we replace (8) by the following Sylvester matrix equation:

$$\begin{aligned} {\widetilde{A}}\delta X+\delta X{\widetilde{B}}^T=C, \end{aligned}$$
(9)

where

$$\begin{aligned} {\widetilde{A}}= M^{-1}A, \quad {\widetilde{B}}= B_2^{-1}B_1, \quad C= M^{-1}(AX_0(\delta B_2^{(1)})^T + MX_0(\delta B_1^{(1)})^T)B_2^{-T} \end{aligned}$$

and \(\hbox {rank}(C)\le k:=\hbox {rank}(\delta B_1^{(1)})+\hbox {rank}(\delta B_2^{(1)})\ll \min \{n,n_t\}\). For given factorizations \(\delta B_1^{(1)}=U_1V_1^*\) and \(\delta B_2^{(1)}=U_2V_2^*\), a low-rank factorization \(C=UV^*\), with \(U\in {\mathbb {R}}^{n\times k}\) and \(V\in {\mathbb {R}}^{n_t\times k}\), is obtained by setting

$$\begin{aligned} U = [M^{-1}AX_0 V_2, \ X_0V_1], \qquad V=[B_2^{-1}U_2,\ B_2^{-1}U_1]. \end{aligned}$$
(10)

Alternatively, when M is symmetric positive definite and its Cholesky factorization \(M=LL^T\) is available, a possible symmetry of A is preserved by considering

(11)

with , , , and . In analogy to (10) an explicit rank-k factorization of can be derived. In the rest of this section we focus on analyzing and solving (9) but we remark that our findings extend to (11) with minor modifications.

Various numerical methods have been developed to treat large-scale Sylvester equations with low-rank right-hand side, including the Alternating Direction Implicit method (ADI) [14, 40] and the Rational Krylov Subspace Method (RKSM) [5, 26, 42]. These methods compute approximate solutions of the form \(\widetilde{\delta X}= WYZ^*\), where W and Z are bases of (block) rational Krylov subspaces generated with the matrices \({\widetilde{A}},{\widetilde{B}}\) and starting block vectors U and V, respectively. More precisely, given two families \(\mathbf \xi =\{\xi _1,\dots ,\xi _{\ell }\}\), \(\mathbf \psi =\{\psi _1,\dots ,\psi _{\ell }\}\subset {\mathbb {C}}\) of so-called shift parameters, W and Z are chosen as bases of the following subspaces:

$$\begin{aligned} {\mathcal {R}}{\mathcal {K}}({\widetilde{A}},U,\mathbf \xi )&= \textrm{span}\big ((\xi _1I-{\widetilde{A}})^{-1}U,\dots , (\xi _{\ell }I-{\widetilde{A}})^{-1}U\big ),\nonumber \\ {\mathcal {R}}{\mathcal {K}}({\widetilde{B}},V,\mathbf \psi )&= \textrm{span}\big ((\psi _1I-{\widetilde{B}})^{-1}V,\dots , (\psi _{\ell }I-{\widetilde{B}})^{-1}V\big ). \end{aligned}$$
(12)

When shifts are repeated, the matrix power is increased. For example, when a single shift \(\xi _1\) is repeated \(\ell \) times, one uses \(\textrm{span}\big ((\xi _1I-{\widetilde{A}})^{-1}U,\dots , (\xi _{1}I-{\widetilde{A}})^{-\ell }U\big )\). We refer to the relevant literature [5, 13, 46] for a complete description of the computation of the factors WYZ in these methods.

Let us comment on some implementation aspects of RKSM, which will be the method of choice in our implementation of Algorithm 2, line 4. In RKSM, the middle factor Y in the approximation \(\widetilde{\delta X}= WYZ^*\) is obtained by solving a compressed matrix equation, which does not contribute significantly to the overall computational time as long as the dimensions of the Krylov subspaces remain modest. Computing W and Z requires to solve shifted linear systems with the matrices \({\widetilde{A}},{\widetilde{B}}\); these operations can be performed without forming \(M^{-1}A\) and \(B_2^{-1}B_1\) explicitly, by means of the relations

$$\begin{aligned} (\xi _jI-{\widetilde{A}})^{-1}{\textbf{v}}= (\xi _jM-A)^{-1}M{\textbf{v}},\qquad (\psi _jI-{\widetilde{B}})^{-1}{\textbf{v}}= (\psi _jB_2-B_1)^{-1}B_2{\textbf{v}}. \end{aligned}$$

In particular, we can still leverage properties like sparsity in A and M, or the bandedness of \(B_1,B_2\), when solving shifted linear systems with \({\widetilde{A}},{\widetilde{B}}\). Our implementation of RKSM relies on the Matlab toolbox rk-toolbox [7] for generating the factors WZ. Finally and importantly, the selection of the shift parameters \(\mathbf \xi ,\mathbf \psi \) strongly affects the convergence of RKSM and we discuss suitable choices in the following section.

2.2 Low-rank approximability of the correction equation

In this section, we analyze the numerical low-rank structure of the solution \(\delta X\) to the correction Eq. (9), which also yields suitable choices of shift parameters in RKSM discussed above.

By the Eckart-Young-Mirsky theorem, \(\delta X\) admits good low-rank approximations if its singular values \(\sigma _j(\delta X)\) decay rapidly. Several works, including [1, 2, 41, 43], have established singular value decay bounds for solutions of Sylvester matrix equations. In particular, the framework presented in [4] applies to (9) and shows that [4, P. 323]

$$\begin{aligned} \sigma _{1+kj}(\delta X) \le \min _{r(z)\in {\mathcal {R}}_{j,j}} \Vert r({\widetilde{A}})\Vert _2 \Vert r(-{\widetilde{B}})^{-1}\Vert _2 \Vert \delta X\Vert _2, \end{aligned}$$
(13)

where \({\mathcal {R}}_{j,j}\) denotes the set of rational functions having numerator and denominator of degree at most j. Choosing a good candidate for the rational function r is difficult, especially when \({\widetilde{A}}\) and/or \({\widetilde{B}}\) are non-normal. To circumvent this difficulty, we loosen the bound (13) by considering the numerical range \({\mathcal {W}}({\widetilde{A}}) = \{ x^* A x:\Vert x\Vert _2 = 1\} \subset {\mathbb {C}}\). A result by Crouzeix and Palencia [11] states that

$$\begin{aligned} \Vert r({\widetilde{A}})\Vert _2 \le (1+\sqrt{2}) \max _{z \in {\mathcal {W}}({\widetilde{A}})} |r(z)| \le (1+\sqrt{2}) \max _{z \in E} |r(z)|, \end{aligned}$$

where the constant \(1+\sqrt{2}\) can be omitted when A is normal and the set \(E\supseteq {\mathcal {W}}({\widetilde{A}})\) is chosen to feature a simple geometry, for which the eventual rational approximation problem admits explicit solutions. Similarly, one has

$$\begin{aligned} \Vert r(-{\widetilde{B}})^{-1}\Vert _2 \le (1+\sqrt{2}) \max _{z \in F} |r(z)|^{-1} = (1+\sqrt{2}) \big ( \min _{z \in F} |r(z)| \big )^{-1} \end{aligned}$$

for any set \(F \subset {\mathbb {C}}\) with \(F\supseteq {\mathcal {W}}(-{\widetilde{B}})\). Inserting these inequalities into (13), we arrive at

$$\begin{aligned} \frac{\sigma _{1+kj}(\delta X)}{\Vert \delta X\Vert _2}\le (1+\sqrt{2})^a Z_j(E, F), \qquad Z_j(E, F):=\min _{r(z)\in {\mathcal {R}}_{j,j}}\frac{\max _{z\in E} |r(z)|}{\min _{z\in F} |r(z)|}, \end{aligned}$$

where the exponent a satisfies \(a = 0\) if both \({\widetilde{A}},{\widetilde{B}}\) are normal, \(a = 1\) if one of the two matrices is normal, and \(a = 2\) otherwise.

The quantity \(Z_j(E,F)\) and the related optimization problem are known in the literature as the Zolotarev number and Zolotarev problem for the sets E and F [4, 52]. Studying the Zolotarev problem is not only important from a theoretical but also from a computational point of view because identifying an extremal rational function \(r_j^*(z)\) for \(Z_j(E,F)\) allows for the fast solution of (9). Indeed, running j steps of ADI or RKSM with shift parameters given by the zeros and poles of \(r_j^*(z)\) ensures a convergence rate not slower than the decay rate of \(Z_j(E,F)\) [3, 32, 50]. A major complication of this construction is the need for choosing the sets EF. On the one hand, it is desirable to choose EF such that they enclose the numerical ranges of \({\widetilde{A}},{\widetilde{B}}\) as tightly as possible. On the other hand, explicit solutions for \(Z_j(E,F)\) are known only for a few selected geometries of E and F, including: two disjoint real intervals [52], the two connected components of the complement of an annulus [25], two disjoint discs [47]. Therefore, one usually encloses the numerical range by a disc or an interval (in the case of Hermitian matrices). In Sects. 2.2.1 and 2.2.2 below, we discuss two geometries in more detail: the case of two discs and the case of a disc and an interval. These enclosures will be used for generating the shift parameters in Algorithm 2 for most of our numerical examples.

2.2.1 Solution of the Zolotarev problem for two disjoint discs

Letting \({\mathbb {B}}(x,\rho )\) denote the closed disc with center \(x\in {\mathbb {C}}\) and radius \(\rho >0\), we consider the Zolotarev problem for

$$\begin{aligned} E={\mathbb {B}}(x_E, \rho _E),\qquad F={\mathbb {B}}(x_F, \rho _F),\qquad |x_E-x_F|> \rho _E+\rho _F. \end{aligned}$$

Starke [47] addressed the case \(x_E, x_F \in {\mathbb {R}}\) by explicitly constructing a rational function \(r_1^*(z)=\frac{z-q^*}{z-p^*}\) with \(q^*\in {\hspace{0.0pt}E}^{\textrm{o}} \), \(p^*\in {\hspace{0.0pt}F}^{\textrm{o}} \), and \(|r_1^{*}(z)|\) being constant on \(\partial E\) and on \(\partial F\). In view of the (generalized) near-circularity criterion [47, Theorem 3.1], this implies the optimality of the rational function \([r_1^*(z)]^j\) for \(Z_j(E,F)\), \(j\ge 1\).

The general case of two disjoint discs (with \(x_E,x_F \in {\mathbb {C}}\) not necessarily on the real line) can be derived from Starke’s result by relying on the invariance property of Zolotarev problems under Moebius transformations; see also [44, Example VIII.4.2]. For completeness and convenience of the reader, we provide the complete transformation that maps EF to the complement of an annulus. More precisely, a Moebius map \(\Phi \) is constructed such that \(\Phi (F)\) is a closed disc centered at the origin and \(\Phi (E)\) is the complement of a larger open disc centered at the origin. We compose \(\Phi \) from three simpler Moebius maps \(\Phi _j\), \(j=1,2,3\), as follows (see also Fig. 1):

  • \(\Phi _1(z)= \frac{z-x_E}{\rho _E}\) maps E to the unit disc while F remains a disjoint disc,

  • \(\Phi _2(z)= z^{-1}\) maps \(\Phi _1(E)\) to the exterior of the unit disc and \(\Phi _1(F)\) to a disc enclosed by the unit circle,

  • \(\Phi _3(z)=\frac{z-\alpha }{z-\beta }\) maps \(\Phi _2(\Phi _1(F))\) to the exterior of a disc with center 0 and \(\Phi _2(\Phi _1(E))\) to a disc centered at 0 and enclosed by \(\partial \Phi _3(\Phi _2(\Phi _1(F)))\).

Fig. 1
figure 1

Action of the Moebius map \(\Phi =\Phi _3\circ \Phi _2\circ \Phi _1\) on the sets E (blue) and F (red)

The parameters \(\alpha , \beta \in {\mathbb {C}}\) are chosen as the so-called common inverse points for the circles \(\partial \Phi _2(\Phi _1(E))=\partial {\mathbb {B}}(0,1)\) and \(\partial \Phi _2(\Phi _1(F))\) [28, Section 4.2]. Setting \( \Phi _2(\Phi _1(F))=:{\mathbb {B}}({\widetilde{x}}_F, \widetilde{\rho }_F)\) the values of \(\alpha ,\beta \) from satisfying the two equations

$$\begin{aligned} \alpha \overline{\beta }=1, \qquad (\alpha - {\widetilde{x}}_F)\overline{(\beta - {\widetilde{x}}_F)}=\widetilde{\rho }_F^2. \end{aligned}$$

Clearly, one solution is given by

$$\begin{aligned} \beta =\alpha ^{-1}, \qquad \alpha =\bigg [|{\widetilde{x}}_F|^2+1-\widetilde{\rho }_F^2+\sqrt{(|{\widetilde{x}}_F|^2+1-\widetilde{\rho }_F^2)^2-4|{\widetilde{x}}_F|^2}\bigg ]/(2\overline{{\widetilde{x}}_F}).\nonumber \\ \end{aligned}$$
(14)

Observe that, \(\Phi _1(F)\) is a disc with center \(\Phi _1(x_F)\ne 0\) and radius \(\rho _F/\rho _E\). In particular, \(\Phi _1(F)\) attains its maximum and minimum modulus at the elements

$$\begin{aligned} x_1=\Phi _1(x_F) + \frac{\Phi _1(x_F)}{|\Phi _1(x_F)|}\frac{\rho _F}{\rho _E},\qquad x_2=\Phi _1(x_F) - \frac{\Phi _1(x_F)}{|\Phi _1(x_F)|}\frac{\rho _F}{\rho _E}. \end{aligned}$$

Now, \(\Phi _2(z)=z^{-1}\) maps \(x_1\) and \(x_2\) into the minimum and maximum modulus elements of \({\mathbb {B}}({\widetilde{x}}_F, \widetilde{\rho }_F)\), respectively. This implies that \({\widetilde{x}}_F,\widetilde{\rho }_F\) take the following values:

$$\begin{aligned} {\widetilde{x}}_F=\frac{\Phi _2(x_1) + \Phi _2(x_2)}{2},\qquad \widetilde{\rho }_F=\frac{|\Phi _2(x_1) - \Phi _2(x_2)|}{2}. \end{aligned}$$

Together with (14), this allows us to compute \(\alpha ,\beta \) explicitly and evaluate \(\Phi \). The complete expressions for \(\Phi \) and \(\Phi ^{-1}\) read as follows:

$$\begin{aligned} \Phi (z)= \frac{\rho _E-\alpha (z-x_E)}{\rho _E-\overline{\alpha }^{-1}(z-x_E)},\qquad \Phi ^{-1}(z)= \frac{(\overline{\alpha }^{-1}x_E+\rho _E)z-\alpha x_E-\rho _E}{\overline{\alpha }^{-1}z-\alpha }. \end{aligned}$$

By the near-circularity criterion [47, Theorem 3.1], an extremal rational function for the complement of an annulus centered at 0 is \(z^{j}\), having the only pole \(\infty \) and the only zero at 0. Thus, an optimal rational function for the original problem is given by \(r_j^*(z)=\left( \frac{z-\Phi ^{-1}(0)}{z-\Phi ^{-1}(\infty )}\right) ^{j}\). The pole \(p^*\) and zero \(q^*\) are given by

$$\begin{aligned} p^*=\Phi ^{-1}(\infty )= x_E+\frac{\rho _E}{\overline{\alpha }^{-1}},\qquad q^*=\Phi ^{-1}(0)= x_E+\frac{\rho _E}{\alpha }. \end{aligned}$$
(15)

Since \(\Phi \) maps the boundaries of E and F to the boundaries of the annulus, it follows that the Zolotarev numbers are given by

$$\begin{aligned} Z_j(E,F)= & {} \eta ^j, \nonumber \\ \eta \!= & {} \! \frac{|\Phi (x_E+\rho _E)|}{|\Phi (x_F+\rho _F)|}\!=\!\left| \frac{(1-\alpha )(\rho _E\!-\!\overline{\alpha }^{-1}(x_F\!+\!\rho _F-x_E))}{(1-\overline{\alpha }^{-1}) (\rho _E\!-\!\alpha (x_F+\rho _F-x_E))}\right| <1. \end{aligned}$$
(16)

2.2.2 Quasi-optimal solution of the Zolotarev problem for a disc and an interval

Let us now consider the situation of a disc and a disjoint interval:

$$\begin{aligned} E={\mathbb {B}}(x_E,\rho _E),\qquad F =[a,b],\qquad x_E\in {\mathbb {R}},\quad x_E+\rho _E<a\text { or } x_E-\rho _E>b. \end{aligned}$$

Similarly to the previous section, we build a Moebius transformation that recasts \(Z_j(E,F)\) as a Zolotarev problem for which a (quasi-optimal) solution is known. More specifically, we let \(\Theta =\Theta _2\circ \Theta _1\) where (see also Fig. 2):

  • \(\Theta _1(z)=\Phi _1(z)\) maps E to the unit disc while F remains a real disjoint interval,

  • \(\Theta _2(z)=\frac{z+1}{z-1}\) maps \(\Theta _1(E)\) to the closed left half of the complex plane and \(\Theta _1(F)\) to an interval \([{\tilde{a}}, {\tilde{b}}]\) of positive real numbers.

A straightforward calculation yields explicit expressions for \(\Theta \) and its inverse:

$$\begin{aligned} \Theta (z)=\frac{z-x_E+\rho _E}{z-x_E-\rho _E},\qquad \Theta ^{-1}(z)=\frac{(x_E+\rho _E)z-x_E+\rho _E}{z-1}. \end{aligned}$$
Fig. 2
figure 2

Action of the Moebius map \(\Theta =\Theta _2\circ \Theta _1\) on the sets E (blue) and F (red)

By the maximum modulus principle, it suffices to consider the boundary of \(\Theta (E)\) and thus the transformed Zolotarev problem takes the form

$$\begin{aligned} Z_j(\partial \Theta (E), \Theta (F))=Z_j({\textbf{i}}{\mathbb {R}}, [{\tilde{a}}, {\tilde{b}}])=\min _{r(z)\in {\mathcal {R}}_{j,j}}\frac{\max _{z\in {\textbf{i}}{\mathbb {R}}}|r(z)|}{\min _{z\in [{\tilde{a}}, {\tilde{b}}]}|r(z)|}. \end{aligned}$$
(17)

Let \({\tilde{r}}_j\) denote the extremal rational function for the Zolotarev problem \(Z_j( [-{\tilde{b}}, -{\tilde{a}}], [{\tilde{a}}, {\tilde{b}}])\), that is, the imaginary axis is replaced by \([-{\tilde{b}}, -{\tilde{a}}]\). This function has been extensively studied in the literature; see [4, Sec. 3.1] and the references therein. In particular, it holds that

$$\begin{aligned} {\tilde{r}}_j(z)=\prod _{i=1}^j \frac{z+\tilde{\psi }_{j,i}}{z-\tilde{\psi }_{j,i}}, \end{aligned}$$

where the poles \(\tilde{\psi }_{j,i} \in [{\tilde{a}},{\tilde{b}}]\), \(i=1,\dots , j\), can be expressed in closed form via elliptic integrals. Note that \({\tilde{r}}_j(-z) = 1/{\tilde{r}}_j(z)\) and the maximum absolute value of \({\tilde{r}}_j\) is 1, which is assumed throughout the imaginary axis. By [4, Corollary 3.2], it holds that \(Z_j( [-{\tilde{b}}, -{\tilde{a}}], [{\tilde{a}}, {\tilde{b}}]) \le 4 \eta ^{-2j}\) with \(\eta =\exp \left( \dfrac{\pi ^2}{2\log (4{\tilde{b}}/{\tilde{a}})}\right) \). Inserting \({\tilde{r}}_j\) into (17) gives the following upper bound for \(Z_j({\textbf{i}}{\mathbb {R}}, [{\tilde{a}}, {\tilde{b}}])\):

$$\begin{aligned} \delta _j:= \frac{\max _{z\in {\textbf{i}}{\mathbb {R}}}|{\tilde{r}}_j(z)|}{\min _{z\in [{\tilde{a}}, {\tilde{b}}]}|{\tilde{r}}_j(z)|} = \frac{1}{\min _{z\in [{\tilde{a}}, {\tilde{b}}]}|{\tilde{r}}_j(z)|} = \sqrt{Z_j( [-{\tilde{b}}, -{\tilde{a}}], [{\tilde{a}}, {\tilde{b}}])} \le 2 \eta ^{-j}. \end{aligned}$$
(18)

It has been shown in [12, 31] that \({\tilde{r}}_j\) satisfies the necessary optimality conditions for (17) and that it is within a factor 2 of the optimal solution: \(\delta _j \le 2 Z_j({\textbf{i}}{\mathbb {R}}, [{\tilde{a}}, {\tilde{b}}])\). To the best of our knowledge, it is still an open question whether \({\tilde{r}}_j\) is, in fact, an extremal rational function for (17). In summary, we have

$$\begin{aligned} Z_j(E,F)\le 2 \exp \left( \frac{-j\pi ^2}{2\log (4{\tilde{b}}/{\tilde{a}})}\right) \end{aligned}$$
(19)

with a quasi-optimal solution for \(Z_j(E,F)\) given by

$$\begin{aligned} {\widehat{r}}_j(z)=\prod _{i=1}^j\frac{z-\Theta ^{-1}(-\tilde{\psi }_{j,i})}{z-\Theta ^{-1}(\tilde{\psi }_{j,i})}. \end{aligned}$$
(20)

The poles of the rational function (19) are not nested, that is, the poles of \({\widehat{r}}_j(z)\) are different from the ones of \({\widehat{r}}_{j+1}(z)\). This is a disadvantage when using the poles of \({\widehat{r}}_j(z)\) in an iterative method (such as ADI or RKSM) where usually the number of steps (the parameter j) is not known a priori. Choosing j adaptively would require to restart the method from scratch whenever j is changed. In such a situation it is preferable to use a sequence of rational functions with nested poles. The method of equidistributed sequences (EDS) [12, Section 4] allows the generation of a nested sequence of poles \(\xi _i\), \(i=1,2,\dots \), such that the rational functions \(\prod _{i=1}^j\tfrac{z+\xi _i}{z-\xi _i}\) have asymptotically optimal convergence rates for \(Z_j([-{\tilde{b}}, -{\tilde{a}}],[{\tilde{a}},{\tilde{b}}])\) as j increases, see also [36, Section 3.5]. In analogy with (20), the nested asymptotically optimal zeros and poles for \(Z_j(E,F)\) are given by \(\Theta ^{-1}(-\xi _i)\) and \(\Theta ^{-1}(\xi _i)\), respectively. In the following example and more generally in Sect. 5, we demonstrate that the shift parameters computed by means of EDS yield convergence rates very close to the ones obtained with the optimal solution of \(Z_j([-{\tilde{b}}, -{\tilde{a}}],[{\tilde{a}},{\tilde{b}}])\).

Example 1

Let us consider the following 1D heat equation from [16, Section 7.1]:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_t= u_{xx} +f(x,t),&{}x \in \Omega := [0,1], t \in [0,1]\\ u\equiv 0,&{}\text {on }\partial \Omega ,\\ u(x,0)=4x(1-x),&{}\text {at } t=0,\\ f(x,t)=h\max \{1-|c(t)-x|/w,0\},&{} c(t)=\frac{1}{2} +(\frac{1}{2}-w)\sin (2\pi t), \end{array}\right. } \end{aligned}$$
(21)

with \(w=0.05\) and \(h=100\). Discretizing in space with second order central differences and in time with the implicit Euler method on a uniform \(n\times n_t\) grid yields a matrix equation of the form \(AX+XB_1^T=F\) with \(A \in {\mathbb {R}}^{n\times n}\) and \(B_1 \in {\mathbb {R}}^{n_t\times n_t}\) are the matrices of centered and backward differences with respect to the x and t variables. For this example we set \(n=n_t=1000\) and rescale the equation by multiplying both sides with \(\Delta t\). We focus on solving the correction equation

$$\begin{aligned} A\delta X+\delta XB_1^T= -X_0{\textbf{e}}_n{\textbf{e}}_1^T, \end{aligned}$$
(22)

where \(X_0\) satisfies \(AX_0+X_0(C_1^{(1)})^T= F\). For this purpose, we use RKSM based on one of the rational functions constructed above. The poles and zeros of the rational function determine the shifts in the rational Krylov subspaces for A and \(B_1\), respectively, see (12). We consider the following choices:

2DISCS:

single shifts \(p^*\) and \(q^*\) resulting from the Zolotarev problems with 2 discs, see (15);

ZOL-DI:

optimal shifts resulting from the Zolotarev problem with a disc and an interval, see (20);

EDS:

nested, asymptotically optimal shifts resulting from the Zolotarev problem with a disc and an interval; the computation of equidistributed shifts is described in [36, Section 3.5].

EK:

alternating shifts 0 and \(\infty \), corresponding to the extended Krylov method [45].

The numerical ranges of the matrices \(A,B_1\), which determine the sets EF, are known analytically: \({\mathcal {W}}(B_1)=\{z:\ |z+1|\le \cos (\pi /(n_t+1))\}\) and \({\mathcal {W}}(A)=[\lambda _{\min },\lambda _{\max }]\) with \(\lambda _{\min }=(2 - 2 \cos (\pi / (n+1)))\frac{(n+1)^2}{n_t}\) and \(\lambda _{\max }=(2 - 2 \cos (n \pi / (n+1)))\frac{(n+1)^2}{n_t}\). We can directly use these sets for generating the shift parameters in EDS and ZOL-DI, while 2DISCS uses the (suboptimal) inclusion \({\mathcal {W}}(A)\subset \{z:\ |z-(\lambda _{\max }+\lambda _{\min })/2|\le (\lambda _{\max }-\lambda _{\min })/2\}\).

Letting \(\widetilde{\delta X}\) denote the approximate solution produced by one of the choices above, Fig. 3 reports the relative error history \(\Vert \widetilde{\delta X}-\delta X\Vert _2/\Vert \delta X\Vert _2\), where \(\delta X\) is a benchmark solution computed via the lyap command of Matlab, as the dimension of the rational Krylov subspaces increases. Finally, we report the relative error associated with the truncated singular values decomposition of \(\delta X\) to show the best error attainable. The results are shown in Fig. 3 together with the decay bounds (16) (Decay 2DISCS) and (19) (Decay ZOL-DI). The observed convergence of the residuals clearly highlights the advantage of choosing shifts via the Zolotarev problem with a disc and an interval, either optimally (ZOL-DI) or asymptotically optimally (EDS).

Fig. 3
figure 3

Relative error convergence histories of RKSM for (9), in the case of the 1D heat equation (Eq. (22)), for various choices of the shift parameters

2.3 Cost analysis of Algorithm 2

In order to compare with our second approach in the next section, we briefly discuss the cost of Algorithm 2 as well as the potential for carrying out (embarrassingly) parallel computations. Let \(\mathcal {C}_{\textrm{sys}}\) denote the maximum cost of solving a lineaear system involving a linear combinationr system with a linear combination of the matrices A and M, for a constant number of right-hand sides, respectively; note that, \({\mathcal {C}}_{\textrm{sys}}\) is also an upper bound for the cost of solving with A or M, individually. The cost of a matrix–vector operation with the banded matrices \(B_1,B_2\) is \({\mathcal {O}}(n_t)\), which is negligible. We may assume \(k = {\mathcal {O}}(1)\), which holds for all time discretization schemes discussed in this work. We assume that RKSM converges after \(\ell \) iterations to fixed accuracy and that the shift parameters are all different, as it happens, for instance, in EDS. If A is symmetric positive definite and its condition number grows polynomially with n (as, e.g., for finite difference or finite element discretizations of elliptic operators) then the bound (19) predicts \(\ell = {\mathcal {O}}(\log n)\). RKSM needs to solve \({\mathcal {O}}( \ell )\) linear systems at cost \(\mathcal {C}_{\textrm{sys}}\) each for generating the left factor W of the approximate solution \(\delta X \approx WYZ^*\). The (re)orthogonalization of the bases WZ accounts for another \({\mathcal {O}}((n+n_t)\ell ^2 )\) operations. As we expect \(\ell \) to be small, the cost for computing Y as the solution of the compressed equation is negligible. When \(M\ne I_n\), computing the low-rank factorization at line 3 requires an additional constant number of linear systems solves. By adding the cost of Algorithm 1 called in line 2 of Algorithm 2 we obtain the overall complexity

$$\begin{aligned} {\mathcal {O}}\left( (n_t+\ell )\mathcal {C}_{\textrm{sys}}+ (n+n_t)\ell ^2+nn_t\log (n_t)\right) . \end{aligned}$$

In terms of parallelization, note that the linear systems solves of Algorithm 1 are embarrassingly parallel on up to \(n_t\) cores. In this situation, the cost reduces to \( {\mathcal {O}}(\ell \mathcal {C}_{\textrm{sys}}+ (n+n_t)\ell ^2+nn_t\log (n_t)) \) for each of the \(n_t\) cores. A further reduction can be obtained by making use of parallel implementations of RKSM [8] and FFT.

3 An evaluation interpolation approach

As pointed out in the introduction, existing ParaDiag algorithms replace (1) with

$$\begin{aligned} AX(C_2^{(\alpha )})^T+MX(C_1^{(\alpha )})^T=F \end{aligned}$$
(23)

for fixed \(\alpha \), where \(C_1^{(\alpha )},C_2^{(\alpha )}\) are \(\alpha \)-cyclic matrices. Solving Eq. (23) can be carried out efficiently with Algorithm 1 and serves as a preconditioner in a Krylov method or a stationary iteration.

In this section we propose a new method that approximates the solution of (1) by combining solutions of the matrix Eq. (23) for different values of \(\alpha \). Given \(\rho >0\), we let \({\mathcal {X}}_{\rho }\) denote the matrix-valued function that maps \(z\in {\mathbb {C}}\) to the solution \(X = {\mathcal {X}}_{\rho }(z)\) of (23) for \(\alpha = \rho z\). In particular \({\mathcal {X}}_{\rho }(0)\) is the desired solution of (1). In the following, we assume that \({\mathcal {X}}_\rho (z)\) is analytic in the disc \(\{z:\ |z|<\widehat{\rho }\}\) for some \(\widehat{\rho }>1\); see Example 2 below for an illustration of this assumption.

Our approach proceeds by interpolating \({\mathcal {X}}_\rho (z)\) at the dth roots of unity for some integer d, \(\omega _j=\exp (2\pi {\textbf{i}} j/d)\) for \(j=0,\dots ,d-1\), with the interpolation data \({\mathcal {X}}_\rho (\omega _j)\) computed by Algorithm 1. It is well known (see, e.g., [49]) that the coefficients of the interpolating polynomial \(\widetilde{{\mathcal {X}}}^{(d)}(z)=\sum _{j=0}^{d-1}{\widetilde{X}}_j^{(d)} z^j\) can be obtained from the interpolation data by applying the inverse Fourier transform. In particular, we obtain

$$\begin{aligned} {\mathcal {X}}_{\rho }(0) \approx \widetilde{{\mathcal {X}}}^{(d)}(0) = {\widetilde{X}}_0^{(d)} = \big ({\mathcal {X}}_\rho (\omega _0) + \cdots + {\mathcal {X}}_\rho (\omega _{d-1})\big ) / d. \end{aligned}$$

The described procedure is summarized in Algorithm 3. The analysis of the approximation error for such a trigonometric interpolation has a well established theory. The following result, retrieved by applying Theorem 12.1 from [49] to each entry, shows that \({\widetilde{X}}_0^{(d)}\) converges exponentially fast to the solution of (1) as d increases with the convergence rate depending on the radius of analyticity \(\widehat{\rho }\).

Theorem 1

Let \({\mathcal {X}}_\rho (z)\) be analytic on \(\{z:\ |z|<\widehat{\rho }\}\) with \(\widehat{\rho }>1\) and choose \(R\in (1,\widehat{\rho })\). Then the following holds for the entries of the approximation \({\widetilde{X}}_0^{(d)}\) returned by Algorithm 3:

$$\begin{aligned} |({\mathcal {X}}_\rho (0))_{ij} -({\widetilde{X}}_0^{(d)})_{ij}|\le \frac{\max _{|z|=R}|({\mathcal {X}}_\rho (z))_{ij}|}{R^d-1}, \qquad i=1,\dots n,\quad j=1,\dots ,n_t. \end{aligned}$$
figure c

We remark that it is possible to implement an adaptive doubling strategy for the choice of the parameter d. Whenever the accuracy of \({\widetilde{X}}_0^{(d)}\) is unsatisfactory, we can consider the computation of \({\widetilde{X}}_0^{(2d)}\). Since a dth root of the unity is also a 2dth root of the unity, this has the advantage that half of the evaluations of \({\mathcal {X}}_\rho (z)\) that we need in Algorithm 3 have already been computed at the previous iteration. As stopping criterion, one can verify that the norm of the difference between two consecutive approximations is smaller than a certain threshold. However, in all numerical experiments considered in this work we observe that the choices \(d=2,3\) already provide sufficient accuracy and there is little need for an adaptive choice for d.

Example 2

Let us consider the matrix equation \(AX+XB_1^T=F\) arising from the 1D heat equation of Example 1 with \(n=n_t\). Then \(C_1^{(\rho z)} = B_1 - \rho z {\textbf{e}}_1 {\textbf{e}}_{n}^T\) with the matrix \(B_1\) from (5) multiplied with \(\Delta t\). The eigendecomposition of \(C_1^{(\rho z)}\) is given by

where \(\zeta =\exp (2{\textbf{i}} \pi /n))\) and \(\Omega \) is the discrete Fourier transform. In particular, the eigenvalues of \(C_1^{(\rho z)}\) are contained in a disc with center 1 and radius \(|\rho z|^{1/n}\). Since A is symmetric positive definite, the spectra of A and \(-C_1^{(\rho z)}\) stay separated as long as \(|z| < \rho ^{-\frac{1}{n}} + \lambda _{\min }(A)\). This implies that \(X_\rho (z)\) is well-defined and, as a rational function, also analytic on the open disc with radius \(\rho ^{-\frac{1}{n}} + \lambda _{\min }(A)\) and hence the assumptions of Theorem 1 can be satisfied as long as \(\rho \le 1\). Assuming \(|\rho z|\le 1\), the 2-norm condition number of the eigenvector matrix for \(C_1^{(\rho z)}\) is bounded by \(1/|\rho z|\) and, in turn, \(\Vert {\mathcal {X}}_\rho (z)\Vert _F \le |\rho z \lambda _{\min } |^{-1} \Vert F \Vert _F\). Choosing \(R = 1/\rho \) in Theorem 1, we thus obtain

$$\begin{aligned} \Vert X_0 - {\widetilde{X}}_0^{(d)}\Vert _F \le \frac{1}{\lambda _{\min }( \rho ^{-d}-1)} \Vert F\Vert _F. \end{aligned}$$

Hence, we expect to obtain very fast convergence with decreasing \(\rho \) and/or increasing d. Note that \(\rho \) cannot be chosen too small because otherwise roundoff error, due to the ill-conditioned eigenvector matrix of \(X(\rho \omega _j)\), starts interfering with the interpolation error. Both statements are confirmed by numerical experiments. Table 1 shows relative residual \(\textrm{Res}:=\Vert A{\widetilde{X}}_0+ {\widetilde{X}}_0B_1^T-F\Vert _F/ \Vert F\Vert _F\) for the matrix \({\widetilde{X}}_0\) computed by Algorithm 3 obtained with different values of \(n,\rho ,d\). The results indicate that excellent accuracy can already achieved for \(d=2\) when choosing \(\rho \) properly.

Table 1 Relative residual of the approximate solution computed by Algorithm 3 with different choices of the parameters \(\rho ,d\) and for increasing size n of the problem

3.1 Cost analysis of Algorithm 3

In this section we maintain the assumptions and notation of Sect. 2.3. In particular, \(\mathcal {C}_{\textrm{sys}}\) denotes the cost of solving linear systems with a linear combination of the matrices A and M. The asymptotic complexity of Algorithm 3 is dominated by the d calls to Algorithm 1, that is,

$$\begin{aligned} {\mathcal {O}}(dn_t\mathcal {C}_{\textrm{sys}} + dnn_t\log (n_t)). \end{aligned}$$

As the calls to Algorithm 1 are embarrassingly parallel, the cost of Algorithm 3 reduces to

$$\begin{aligned} {\mathcal {O}}(\mathcal {C}_{\textrm{sys}}+ nn_t\log (n_t)). \end{aligned}$$

for each core if \(dn_t\) cores are available. Compared with the discussion in Sect. 2.3, this indicates that Algorithm 1 is better suited than Algorithm 2 in a massively parallel environment.

4 All-at-once Runge–Kutta formulation

In this section, we explain how the approaches presented in this work extend when the time discretization is performed by implicit Runge–Kutta methods. For this purpose, let us consider the differential problem (3) and assume, without loss of generality, that \(M=I\). The discretization by an Runge–Kutta method with s stages yields

$$\begin{aligned} {\left\{ \begin{array}{ll} {\textbf{x}}_{j+1}={\textbf{x}}_j+\Delta t\sum _{i=1}^sb_i {\textbf{k}}_i^{(j)},&{}j=0,\dots n_t-1,\\ {\textbf{k}}_i^{(j)}= A{\textbf{x}}_j+\Delta t\sum _{h=1}^sg_{ih}A {\textbf{k}}_h^{(j)} + {\textbf{f}}(t_j+c_i\Delta t),&{}i=1,\dots ,s, \end{array}\right. } \end{aligned}$$

where the coefficients \(b_i\) and \(c_i\) are the nodes and the weights of the Butcher tableau. By considering the Runge–Kutta matrix \(G=(g_{ih})\in {\mathbb {R}}^{s\times s}\) and setting \(K^{(j)}:=[{\textbf{k}}^{(j)}_1|\dots | {\textbf{k}}^{(j)}_s]\in {\mathbb {R}}^{n\times s}\), we can rewrite the second equation as

$$\begin{aligned} {\textbf{k}}_i^{(j)}&= A{\textbf{x}}_j+\Delta tAK^{(j)}(G{\textbf{e}}_i)^T + {\textbf{f}}(t_j+c_i\Delta t)\nonumber \\&\Rightarrow K^{(j)}= A{\textbf{x}}_j{\textbf{e}}^T+\Delta tAK^{(j)}G^T + F^{(j)}, \end{aligned}$$
(24)

where \({\textbf{e}}\in {\mathbb {R}}^s\) is the vector of all ones, \({\textbf{e}}_i\in {\mathbb {R}}^s\) is the ith unit vector, and \(F^{(j)}:=[f(t_j+c_1\Delta t)|\dots |f(t_j+c_s\Delta t)]\in {\mathbb {R}}^{n\times s}\). Vectorizing (24) leads to

$$\begin{aligned} \underbrace{(I-\Delta t G\otimes A)}_{H}\textrm{vec}(K^{(j)})-({\textbf{e}}\otimes A){\textbf{x}}_j= \textrm{vec}(F^{(j)}). \end{aligned}$$

Therefore, we get an all-at-once linear system \({\mathcal {A}}{\textbf{x}}={\textbf{f}}\) of the form

(25)

where \({\textbf{b}}:=\Delta t[b_1,\dots ,b_s]\) is a row vector. Note that we can rewrite the \(2\times 2\) block matrix on the main diagonal as

$$\begin{aligned} \begin{bmatrix} H&{}\\ -{\textbf{b}}\otimes I&{}\quad I \end{bmatrix}= \begin{bmatrix} -\Delta t G&{} \\ &{} \end{bmatrix} \otimes A + \begin{bmatrix} I&{}\\ -{\textbf{b}}&{}\quad 1 \end{bmatrix}\otimes I\in {\mathbb {R}}^{n(s+1)\times n(s+1)}. \end{aligned}$$

Finally, by letting , we can write the coefficient matrix of the linear system (25) as \({\mathcal {A}} =I\otimes {\widehat{A}} + B_1\otimes {\widehat{M}}\) where

$$\begin{aligned} B_1=\begin{bmatrix} 1\\ -1&{}\quad \ddots \\ &{}\quad \ddots &{}\quad \ddots \\ &{}&{}\quad -1&{}\quad 1 \end{bmatrix},\qquad {\widehat{A}}=\begin{bmatrix} -\Delta t G&{} -{\textbf{e}}\\ &{} \end{bmatrix} \otimes A + \begin{bmatrix} I&{}\\ -{\textbf{b}}&{} \end{bmatrix}\otimes I. \end{aligned}$$
(26)

This implies that (25) is equivalent to a generalized Sylvester equation. In view of ParaDiag techniques, it is natural to consider the matrix \({\mathcal {A}}^{(\alpha )}\), obtained by replacing \(B_1\) in (26) with an \(\alpha \)-cyclic matrix \(C_1^{(\alpha )}\). This leads to the following three extensions of the ParaDiag framework to Runge–Kutta methods.

Preconditioned Krylov method. Employ \({\mathcal {A}}^{(\alpha )}\) as a preconditioner for GMRES.

Evaluation interpolation approach. Employ a solver for linear systems with matrices \({\mathcal {A}}^{(\rho \omega _j)}\) as building block for Algorithm 3.

Low-rank updates. Solve the linear system \({\mathcal {A}}^{(1)}{\textbf{x}}={\textbf{f}}\) and compute the solution of the corresponding update equation, that has right-hand side of rank 1, with RKSM.

In the next sections we describe the details of the solver for linear systems with the matrix \({\mathcal {A}}^{(\alpha )}\) and of the update equation that has to be addressed with RKSM.

4.1 Solving linear systems with \({\mathcal {A}}^{(\alpha )}\)

After the diagonalization of the \(\alpha \)-cyclic matrix \(C_1^{(\alpha )}\), solving linear systems with the matrix \({\mathcal {A}}^{(\alpha )}\) requires to solve — possibly in parallel — \(n_t\) linear systems with a coefficient matrix of the form

$$\begin{aligned}\lambda _j{\widehat{M}}+{\widehat{A}}=\begin{bmatrix} -\Delta t G&{}\quad (\lambda _j-1){\textbf{e}}\\ &{} \end{bmatrix} \otimes A + \begin{bmatrix} I&{}\\ -{\textbf{b}}&{}\quad \lambda _j \end{bmatrix}\otimes I. \end{aligned}$$

Each of this linear systems corresponds to solving a (generalized) Sylvester equation of the form

$$\begin{aligned} AXN_1+XN_2= F, \end{aligned}$$
(27)

with \(N_i\in {\mathbb {C}}^{(s+1)\times (s+1)}\), \(i=1,2\). Because s, the number of Runge–Kutta stages, is usually small we can solve (27) efficiently by computing generalized Schur decomposition [24] of \((N_1,N_2)\) via the QZ algorithm [29]. This yields unitary matrices \(Q,Z\in {\mathbb {C}}^{(s+1)\times (s+1)}\) such that

$$\begin{aligned} N_1=QT_1Z^*, \qquad N_2=QT_2Z^*, \end{aligned}$$

where \(T_1,T_2\in {\mathbb {C}}^{(s+1)\times (s+1)}\) are upper triangular. The correspondingly transformed Eq. (27) takes the form

$$\begin{aligned} AYT_1+YT_2= {\widetilde{F}}, \qquad {\widetilde{F}}=FZ, \qquad Y=XQ, \end{aligned}$$

which can be solved by forward substitution by rewriting the Eq. in block-wise form:

$$\begin{aligned} A&\begin{bmatrix}Y_1&Y_2\end{bmatrix}\begin{bmatrix}T_1^{(11)}&{} T_1^{(12)}\\ {} &{} T_1^{(22)}\end{bmatrix}+\begin{bmatrix}Y_1&Y_2\end{bmatrix}\begin{bmatrix}T_2^{(11)}&{} T_2^{(12)}\\ {} &{} T_2^{(22)}\end{bmatrix}=\begin{bmatrix}{\widetilde{F}}_1&{\widetilde{F}}_2\end{bmatrix}\\&\Longrightarrow {\left\{ \begin{array}{ll} AY_1T_1^{(11)}+ Y_1T_2^{(11)}={\widetilde{F}}_1\\ AY_2T_1^{(22)}+ Y_2T_2^{(22)}={\widetilde{F}}_2-AY_1T_1^{(12)}-Y_1T_2^{(12)} \end{array}\right. }. \end{aligned}$$

Assuming that \(T_1^{(11)}, T_2^{(11)}\) are scalar quantities we can first retrieve \(Y_1\in {\mathbb {C}}^n\) by solving a linear system with \((A+T_2^{(11}) I)\) and continuing to compute \(Y_2\in {\mathbb {C}}^{n\times s}\) via recursion. The solution X of (27) is obtained as \(YQ^*\).

4.2 Solving the update equation

In the case of Runge–Kutta methods, the low-rank update approach from Sect. 2 requires us to solve the generalized matrix equation

$$\begin{aligned} {\widehat{A}}\delta X +{\widehat{M}}\delta XB_1^T= {\widehat{M}}X_0{\textbf{e}}_{n_t}{\textbf{e}}_1^T, \end{aligned}$$
(28)

where \(X_0\in {\mathbb {C}}^{n(s+1)\times n_t}\) is the (reshaped) solution of the linear system \({\mathcal {A}}^{(1)}{\textbf{x}}={\textbf{f}}\) described in the previous section. Equation (28) has a right-hand side of rank 1 but it is not immediate to recast it as a Sylvester equation of the form (9) because the coefficient \({\widehat{M}}\) is not invertible. Given a shift parameter \(\sigma \in {\mathbb {C}}\), we add and subtract the quantity \(\sigma {\widehat{A}}\delta XB_1^T\) to (28), obtaining

$$\begin{aligned} {\widehat{A}}\delta X(I-\sigma B_1^T) +({\widehat{M}}+\sigma {\widehat{A}})\delta XB_1^T={\widehat{M}}X_0{\textbf{e}}_{n_t}{\textbf{e}}_1^T. \end{aligned}$$

Hence, if \(\sigma \) is chosen such that both \({\widehat{M}}+\sigma {\widehat{A}}\) and \(I-\sigma B_1^T\) are invertible, we can write an equation of the form \({\widetilde{A}} \delta X+\delta X{\widetilde{B}}^T = UV^*\) equivalent to (28) by setting:

$$\begin{aligned} {\widetilde{A}}&= ({\widehat{M}}+\sigma {\widehat{A}})^{-1}{\widehat{A}},&{\widetilde{B}}&=(I-\sigma B_1)^{-1}B_1,\\ {\textbf{u}}&=({\widehat{M}}+\sigma {\widehat{A}})^{-1}{\widehat{M}}X_0{\textbf{e}}_{n_t},&{\textbf{v}}&= (I-\sigma B_1)^{-1}{\textbf{e}}_1. \end{aligned}$$

This suggests to use the matrices \({\widetilde{A}},{\widetilde{B}}\) and the vectors \({\textbf{u}}, {\textbf{v}}\) to generate Krylov subspaces for the Eq. (28). Once again, there is no need to form explicitly \({\widetilde{A}},{\widetilde{B}}\) for performing matrix–vector products and solving shifted linear systems; for these operations we rely on the relations:

$$\begin{aligned} \left[ ({\widehat{M}}+\sigma {\widehat{A}})^{-1}{\widehat{A}}-zI\right] ^{-1}{\textbf{w}}&= \left[ (1-z\sigma ){\widehat{A}}-z{\widehat{M}}\right] ^{-1}({\widehat{M}}+\sigma {\widehat{A}}){\textbf{w}},\\ \left[ (I-\sigma B_1)^{-1}B_1-zI\right] ^{-1}{\textbf{w}}&=\left[ (1+z\sigma )B_1-zI\right] ^{-1}(I-\sigma B_1){\textbf{w}}, \end{aligned}$$

that only need to manipulate the matrices \({\widehat{A}}, {\widehat{M}}, B_1,I-\sigma B_1\), which are sparse when A is sparse.

In order to study the low-rank property of the solution to (28) and to provide suitable a priori choices for the shift parameters in RKSM, one would need to study the spectral properties of \({\widetilde{A}},{\widetilde{B}}\). The matrix \({\widetilde{B}}\) is Toeplitz triangular and we can explicitly compute its entries:

$$\begin{aligned} {\widetilde{B}}=(1-\sigma )^{-1} \begin{bmatrix} 1\\ \frac{1}{\sigma -1}&{}\quad \ddots \\ \frac{\sigma }{(\sigma -1)^2}&{}\quad \ddots &{}\quad \ddots \\ \vdots &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots \\ \frac{\sigma ^{n_t-2}}{(\sigma -1)^{n_t-1}}&{}\quad \dots &{}\quad \frac{\sigma }{(\sigma -1)^2}&{}\quad \frac{1}{\sigma -1}&{}\quad 1 \end{bmatrix} \end{aligned}$$

In particular, it has the only eigenvalue \((1-\sigma )^{-1}\) with multiplicity \(n_t\). However, the matrix \({\widetilde{A}}\) appears hard to analyze in general, even assuming strong properties such as A symmetric positive definite. For this reason, we postpone the analysis of (28) to future work and we propose the use of the extended Krylov method (see Example 2) or adaptive choices of the shift parameters [13] when solving the update equation with RKSM.

5 Numerical results

In this section, we compare the performances of Algorithm 2 and Algorithm 3 with those of the preconditioned GMRES method considered in [20, 37], on different examples. More specifically, we consider the following list of methods:

  • PGMRES: GMRES preconditioned by the solver of the matrix equation with a \(\alpha \)-circulant coefficient. If not stated otherwise, \(\alpha =1\) and the threshold for stopping GMRES is set to \(10^{-8}\),

  • Ev-Int: Algorithm 3. If not stated otherwise, we choose \(d=2\) and \(\rho =5\cdot 10^{-4}\).

  • 2DISCS, ZOL-DI(4), EDS, EK: Algorithm 2 with different choices of poles; see Example 1 for details. Note that, the method ZOL-DI is not competitive, in terms of computational time, as it requires to recompute the bases of the Krylov subspaces from scratch at every iteration. For this reason we consider its variant ZOL-DI(4), that consists in cyclically repeating the 4 shifts corresponding to the Zolotarev problem with rational functions of degree (4, 4). We have chosen 4 shifts because preliminary experiments indicated that this choice offers good performance across a broader range of examples.

RKSM called at line 4 of Algorithm 2 is stopped when the relative residual of the update Eq. (9) is below the threshold \(10^{-8}\). The methods EK, 2DISCS, and ZOL-DI(4) rely on precomputed factorizations of the, possibly shifted, matrices \({\widetilde{A}}\) and \({\widetilde{B}}\) when generating the bases of the rational Krylov subspaces. If the matrices A and M in (1) are sparse, a sparse direct solver is exploited whenever the solution of a linear system involving a linear combination of A and M is required; e.g., at line 4 of Algorithm 3.

For each method we report the computational time required and the relative error \(\textrm{Res}:= \Vert A{\widehat{X}}B_2^T+M{\widehat{X}}B_1^T-F\Vert _F/\Vert F\Vert _F\) in order to assess the accuracy of the correspondent approximate solution \({\widehat{X}}\). The best timings of each case study are highlighted in bold and reported in Tables 2, 3, 4, 5, 6, 7, 8. For the various implementations of Algorithm 2 we also report the percentage of the CPU time that is spent for solving (9), labeled with \(\mathrm {T_{sylv}}\), the dimension (Dim) of the rational Krylov subspace generated and the rank (rk) of the approximate solution to (9), after recompression.

Apart from Sect. 5.7, all experiments have been performed on a laptop with a dual-core Intel Core i7-7500U 2.70 GHz CPU, 256KB of level 2 cache, and 16 GB of RAM. The algorithms are implemented in MATLAB and tested under MATLAB2020a, with MKL BLAS version 2019.0.3 utilizing both cores.

5.1 1D heat equation

We start by testing all the algorithms detailed at the beginning of this section on the matrix equation \(AX+XB_1^T=F\) from Example 1 for \(n=1089, 4225\) and \(n_t=2^8,\dots , 2^{11}\).

The results are reported in Table 2. As expected ZOL-DI(4) and EDS generate significantly lower dimensional Krylov subspaces with respect to those generated by EK and 2DISCS, while achieving comparable accuracies. Looking at the computational times, we see that, for all methods, we have an almost linear dependence on the size n and on the number of time steps \(n_t\). The methods with the most favorable timings are Ev-Int and ZOL-DI(4). The former is preferable when \(\mathrm {T_{sylv}}>50 \%\), which happens for the smaller instances of the equation. For all four implementations of Algorithm 2, the fraction of the time spent for solving the low-rank Sylvester equation decreases as the size of the problem increases. Since the advantage of using ZOL-DI(4) is inversely proportional to \(\mathrm {T_{sylv}}\) we get the best scenario for ZOL-DI(4) when \(n=4225\) and \(n_t=2048\) where we get about a \(2\times \) speedup over Ev-Int and a \(5\times \) speedup over PGMRES.

Table 2 1D heat equation; performances of the various methods
Table 3 Convection diffusion; performances of the various methods

5.2 Convection diffusion

This example is concerned with the convection diffusion equation from [37, Section 6.2], which models the temperature distribution in a cavity having an external wall maintained at a constant (hot) temperature and a wind that determines a recirculating flow. The problem reads as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_t-\epsilon \Delta u + {\textbf{w}}\cdot \nabla u=0,&{}S\times (0,1),\ S = (-1,1)^2,\\ u(x, y, 0)=\chi _{\{x=1\}},\\ u(x, y, t)=\chi _{\{x=1\}},&{} \partial S\times (0,1), \end{array}\right. } \end{aligned}$$

where \(\epsilon = 0.005\), \(\chi _{\{x=1\}}\) is the indicator function of the set \(\{x=1\}\), and \({\textbf{w}} = (2y(1-x^2), -2x(1-y^2))\). The discretization is done via Q1 finite elements over the domain S and backward Euler time-stepping, together with streamline-upwind Petrov–Galerkin (SUPG) stabilization. The matrix equation corresponding to the all-at-once linear system takes the form \(AX+MXB_1^T=F\), where both A and M are sparse. To estimate the numerical range of \(M^{-1}A\), we compute the largest and smallest real parts, \(r_{\max },r_{\min }\), of its eigenvalues, by means of the Matlab command eigs. We remark that, \(r_{\max },r_{\min }>0\) and we denote \(c:=(r_{\max }+r_{\min })/2\). Then, 2DISCS employs \(\{z:\ |z-c|\le (r_{\max }-r_{\min })/2 \}\) as estimate for \({\mathcal {W}}(M^{-1}A)\) while ZOL-DI(4) and EDS makes use of the real interval \([r_{\min },r_{\max }]\). The performances of the numerical methods are reported in Table 3. Although we used the slightly higher tolerance \(10^{-6}\) to stop PGMRES, the latter needs 17 iterations to converge and this makes the other approaches significantly faster. Among the methods based on low-rank updates, ZOL-DI(4) and EDS are those that generate the Krylov subspaces of smallest dimensions, while maintaining a good accuracy. This feature makes EDS faster than EK although the latter has a cheaper iteration cost. Similar comments to the ones made in the previous example apply to the comparison between \(\texttt {ZOL-DI(4)}\) and \(\texttt {Ev-Int.}\)

5.3 Fractional space diffusion

We consider the fractional diffusion example from [51, Section 5]:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_t= a_1\frac{\partial ^{\gamma _1} u^+}{\partial x^{\gamma _1}}+b_1\frac{\partial ^{\gamma _1} u^-}{\partial x^{\gamma _1}}+a_2\frac{\partial ^{\gamma _2} u^+}{\partial y^{\gamma _2}}+b_2\frac{\partial ^{\gamma _2} u^-}{\partial y^{\gamma _2}}+f(x,y,t), &{} S \times (0,5),\ S = (0,1)^2,\\ u(x,y, 0)= 0, &{} (x,y)\in S,\\ u(x,y,t)=0,&{} (x,y,t)\in \partial S\times (0,5), \end{array}\right. } \end{aligned}$$

where \(\frac{\partial ^{\gamma _j}u^\pm }{\partial x}\) indicates the Riemann-Liouville right- and left- looking derivatives; the diffusion coefficients are set as \(a_1 = 1, a_2 = 0.5, b_1 = 0.2, b_2 = 1\), the differential orders are \(\gamma _1=1.75\) and \(\gamma _2=1.5\), and the source term is \(f(x,y,t)= 10\sin (3xyt)\). For space discretization, the second-order weighted and shifted Grünwald difference formula is employed; uniform backward Euler time stepping is applied. The all-at-once linear system matrix takes the form \(I\otimes A+B_1\otimes I\) where the matrix A has the structure

$$\begin{aligned} A = I \otimes T_1+T_2\otimes I \end{aligned}$$

with \(T_1,T_2\) non sparse Toeplitz matrices; see [51] for a detailed description. In particular, matrix–vector operations with the matrix A are efficiently performed by relying on FFT and matrix equation techniques. Moreover, the minimum and maximum real part of the eigenvalues of A are easily obtained once estimates of the same quantities for the matrices \(T_1,T_2\) are computed. When running \(\texttt {2DISCS}, \texttt {ZOL-DI(4)}, \texttt {EDS}\), we approximate \({\mathcal {W}}(A)\) as we did in the previous example.

The results reported in Table 4 show that the methods based on low-rank updates struggle on this example because they have to generate somewhat large subspaces in order to achieve the desired accuracy. This is caused by the fact that the solution of the update Eq. (8) has a slightly higher rank with respect to the previous examples, and the approximations of \({\mathcal {W}}(A)\) are not tight, see Fig. 4. Nevertheless, ZOL-DI(4) achieves the best timings for \(n=4225\). Ev-Int maintains the usual behavior and turns out to be the preferable method for \(n=1089\).

Fig. 4
figure 4

Fractional diffusion example with \(n=1089\); eigenvalues of the matrix A (cyan crosses), numerical range of the matrix A (disc enclosed by the green line) and its approximations used for 2DISCS (disc enclosed by the blue line), and for ZOL-DI(4) (red line segment)

Table 4 Fractional space diffusion; performances of the various methods

5.4 Fractional time diffusion

We now consider an example where Algorithm 2 is not applicable due to the non locality of the time differential operator. We modify Example 2 by replacing the first-order time derivative with the Caputo time derivative of order \(\gamma \in (0,1)\):

$$\begin{aligned} {\left\{ \begin{array}{ll} \frac{\partial ^\gamma u}{\partial t^\gamma }= u_{xx} +f(x,t),&{}S:= [0,1]\times [0, 1],\\ u\equiv 0,&{}\text {on }\partial S,\\ u(x,0)=4x(1-x),&{}\text {at } t=0,\\ f(x,t)=h\max \{1-|c(t)-x|/w,0\},&{} c(t)=\frac{1}{2} +(\frac{1}{2}-w)\sin (2\pi t), \end{array}\right. } \end{aligned}$$

and we keep the same parameters wh of the integer derivative case. Discretizing the Caputo time derivative with the unshifted Grünwald-Letnikov formula, we get a matrix equation \(AX+XB_1^T=F\) where \(B_1\) is Toeplitz lower triangular and takes the form:

$$\begin{aligned} B_1=\begin{bmatrix} g_{\gamma ,0}\\ g_{\gamma , 1}&{}g_{\gamma ,0}\\ \vdots &{}\ddots &{}\ddots \\ g_{\gamma , n_t-1}&{}\dots &{}g_{\gamma , 1}&{}g_{\gamma , 0} \end{bmatrix}, \qquad {\left\{ \begin{array}{ll} g_{\gamma ,0}=1,\\ g_{\gamma ,j}= \left( 1-\frac{\gamma +1}{j}\right) g_{\gamma , j-1}. \end{array}\right. } \end{aligned}$$

In particular, the usual way to construct a circulant matrix from \(B_1\) requires to apply a matrix \(\delta B_1^{(\alpha )}\) of rank \(n_t-1\). For this reason we restrict to PGMRES and Ev-Int for solving the all-at-once problem. The results reported in Table 5 refer to the case \(\gamma = 0.3\) and show that Ev-Int significantly outperform PGMRES, despite the low number of iterations needed by the latter to converge.

Table 5 Fractional time differentiation; performances of PGMRES and Ev-Int

5.5 Wave equation

We consider the linear wave equation

$$\begin{aligned} {\left\{ \begin{array}{ll} u_{tt}-\Delta u=(1+2\pi ^2)e^t\sin (\pi x)\sin (\pi y),&{}\text {on } S\times (0, T),\\ u(x,y,t)=0, &{}\text {on } \partial S \times (0, T),\\ u(\cdot ,\cdot ,0) = u_0(x,y),\quad u_t(\cdot ,\cdot ,0) = u_1(x,y), &{} \text {in }S, \end{array}\right. } \end{aligned}$$

with \(S =(0, 1)\), \(T=1\), \(u_0(x,y)=u_1(x,y)=\sin (\pi x)\sin (\pi y)\). As described in [20, Section 3], by discretizing with finite differences in space and with the implicit leap-frog scheme in time we get the all-at-once linear system \((B_1\otimes I+B_2\otimes A){\textbf{x}}={\textbf{f}}\) where \(A= \textrm{trid}(-1, 2, -1)\cdot (n+1)^2\) and

$$\begin{aligned} B_1=\frac{1}{\Delta t^2}\begin{bmatrix} 1\\ -2&{}\quad 1\\ 1&{}\quad -2&{}\quad 1\\ &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots \\ &{}&{}\quad 1&{}\quad -2&{}\quad 1 \end{bmatrix}\in {\mathbb {R}}^{n_t\times n_t},\qquad B_2=\frac{1}{2}\begin{bmatrix} 1\\ 0&{}\quad 1\\ 1&{}\quad 0&{}\quad 1\\ &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots \\ &{}&{}\quad 1&{}\quad 0&{}\quad 1 \end{bmatrix}\in {\mathbb {R}}^{n_t\times n_t}. \end{aligned}$$

We note that, for this example, all methods based on low-rank updates generate quite small Krylov subspaces, and their dimensions do not grow as the parameters n and \(n_t\) vary. As a consequence, EK is the one that achieves the cheapest consumption of computational time and we only report its performances. We remark that, in order to get comparable accuracy we have considered \(d=3\) roots of the unit when running Ev-Int. The preconditioner employed in PGMRES uses the parameter \(\alpha =0.1\) as suggested in [20] and the tolerance for stopping the GMRES iteration has been set to \(10^{-7}\). The results are shown in Table 6. Comparing \(\texttt {PGMRES},\texttt {Ev-Int}\), and \(\texttt {EK}\), leads to similar conclusions as in the example of the 1D heat equation.

Table 6 Wave equation; performances of the various methods

5.6 Runge–Kutta example

We present a numerical test on the all-at-once Runge–Kutta formulation introduced in Sect. 4. Let us start from the finite difference semidiscretization of the following differential problem:

$$\begin{aligned}{} & {} {\left\{ \begin{array}{ll} u_t=u_{xx},&{}(x,t)\in (0,1)^2,\\ u(x, 0) = \sin (\pi x), \end{array}\right. }\nonumber \\{} & {} \qquad \quad \Rightarrow \quad {\left\{ \begin{array}{ll} \dot{U}(t)=AU(t),&{}A=\textrm{trid}(1,-2,1)\cdot (n+1)^2, \\ U(0) = {\textbf{x}}_0,&{}({\textbf{x}}_0)_j=\sin (j\pi /(n+1)). \end{array}\right. } \end{aligned}$$
(29)

We consider solving the all-at-once formulation (25) corresponding to the following four-stage (\(s=4\)), 3rd order, Diagonally Implicit Runge-Kutta method (DIRK):

$$\begin{aligned} \begin{array}{c|cccc} 1/2&{}1/2\\ 2/3&{}1/6&{}1/2\\ 1/2&{}-1/2&{}1/2&{}1/2\\ 1&{}3/2&{}-3/2&{}1/2&{}1/2\\ \hline &{}3/2&{}-3/2&{}1/2&{}1/2 \end{array}. \end{aligned}$$

More specifically, we solve (25) by applying the methods PGMRES, Ev-Int and EK adapted as described in Sect. 4. For this experiment, the parameter \(\rho \) used by Ev-Int is set to the value 0.05. As shift parameter for the update Eq. (28) solved by EK, it has been chosen \(\sigma = -2\). The performances of the 3 methods are reported in Table 7; as observed in previous tests, the approaches based on low-rank update and evaluation interpolation outperforms PGMRES for both speed and accuracy. As the number of time steps increases, the cost of the update equation becomes relatively cheap, making EK faster than Ev-Int. The relative residuals obtained with EK and Ev-Int are comparable.

Problem (29) admits the explicit solution \(U(t)=\exp (tA){\textbf{x}}_0\) so that we can compare approximate solutions with respect to a benchmark solution computed by means of the expm Matlab function. Figure 5 reports the relative error \(\Vert \widehat{{\textbf{x}}}_t-\exp (tA){\textbf{x}}_0\Vert _2/\Vert \exp (tA){\textbf{x}}_0\Vert _2\) of the approximants \(\widehat{{\textbf{x}}_t}\) returned by EK, for \(n=500\) and different time step sizes; the plot is in agreement with the order 3 of the Runge–Kutta scheme.

Table 7 Performances of the Runge–Kutta schemes for different values of \(n=250,500\) and \(n_t=250,500,1000,2000\)
Fig. 5
figure 5

Relative error norm of EK for \(n=500\) and \(n_t=250,500,1000,2000\); the error decreases with order 3 with respect to the time step size

5.7 Preliminary experiments on parallelism

We conclude by testing the parallel scaling features of the proposed algorithms in a multi-core computational environment. The experiment in this section has been run on a server with two Intel(R) Xeon(R) E5-2650v4 CPU with 12 cores and 24 threads each, running at 2.20 GHz, using MATLAB R2022b with the Intel(R) Math Kernel Library Version 11.3.1. The test has been run using the SLURM scheduler, allocating 12 cores and 250 GB of RAM. Our implementation exploits parallelism for the solution of the block diagonal linear system generated by fast_diag_solve (lines 4-6 of Algorithm 1). This parallel routine is used as a preconditioner in PGMRES, it is called at line 2 of Algorithm 2, and it is called at line 4 of Ev-Int. Finally, we also parallelize the for loop in Ev-Int (lines 2-5 of Algorithm 3).

As case study, we consider the fractional diffusion example of Sect. 5.3 with \(n=4225\) and \(n_t=8192\), and we measure the execution time of the procedures PGMRES, Ev-Int, and ZOL-DI, as the number of cores p ranges in the set \(\{1,2,4,8,12\}\). The tolerance to stop GMRES has been set to \(10^{-6}\) in order to get comparable residual norms with the other two methods. To evaluate the efficiency of the implementation, we also compute the strong scaling efficiency (SE) with p cores defined as

$$\begin{aligned} \textrm{SE}(p):=100 \cdot \frac{\textrm{Time}(\text {1 core})}{p\cdot \textrm{Time}(\text {{ p} cores})}. \end{aligned}$$

The results reported in Table 8 clearly show that Ev-Int is the method that gains the most from parallelism, maintaining more than 70% of strong scaling efficiency, up to 12 cores. Also PGMRES leverages the use of multiple cores to the extent of becoming faster than ZOL-DI, when \(p=12\). On the contrary, our current implementation of ZOL-DI does not exploit more than one core for solving the update equation and, consequently, parallel efficiency drops as soon as the cost of (8) becomes dominant. An adaptation of the parallel implementation of RKSM [8] is nontrival and beyond the scope of this work, but it potentially addresses the scaling issue of ZOL-DI.

Table 8 Fractional space diffusion with \(n=4225\) and \(n_t=8192\); parallel efficiency of the procedures as the number of cores p increases

6 Conclusions

We have proposed a tensorized Krylov subspace method and an interpolation method for replacing the iterative refinement phase that is required by ParaDiag algorithms with uniform time steps. In the case of multistep methods, a theoretical analysis of the approximation property of the Krylov subspace method has been provided. We have shown several numerical tests demonstrating that our approaches can significantly outperform block-circulant preconditioned GMRES iterations for solving linear initial value problems. Let us emphasize that the focus of this work was on theoretical and algorithmic aspects; we have not carried out an extensive study of the parallel implementation of our algorithms, which will be subject to future work. From the analysis in Sects. 2.3 and 3.1 we expect that the interpolation method (Algorithm 3) will perform significantly better in a massively parallel environment.

We remark that Algorithm 2 and Algorithm 3 can directly be put in place instead of the existing ParaDiag algorithms for solving certain non-linear ODEs of the form \(M\dot{U}+f(U)=0\) [17] and for computing the so-called Coarse Grid Correction of the Parareal algorithm [22, 33].

In Sect. 4, we have discussed how to extend the methodology to the case of implicit Runge–Kutta time integration, but some questions remain for further work. In particular, a careful analysis of the low-rank approximability of Eq. (28) and of the choice of the shift parameter \(\sigma \), would be desirable.