1 Introduction

Complex networks have become an essential tool in many scientific areas [36, 69, 70] for studying the interaction between different elements within a system. From these networks, it is possible, for instance, to identify the most important elements in the system, such as the key proteins in a protein interaction network [35], the keystone species in an ecological network [59], vulnerable infrastructures [4] or the least resilient nodes in a transportation network [9].

Measures of node importance are referred to as node centrality and many metrics have been proposed over the years [36, 70]. Here, we consider the family of centrality measures defined in terms of matrix functions [38], which classify the nodes according to how well they can spread information to other nodes in the network. Both the Katz centrality [60] and subgraph centrality [39] belong to this family. In most cases, the node centrality is computed based on the matrix resolvent \((\textbf{I} - \gamma \textbf{A})^{-1}\) and the exponential \(\exp {(\textbf{A})}\), but other functions [11, 17] can be used as well. Here \(\textbf{A} \in \mathbb {R}^{n \times n}\) denotes the network’s adjacency matrix. In this paper, we focus on the subgraph centrality [39] (i.e., \(\mathop {\textrm{diag}}\limits (f(\textbf{A}))\)) and total communicability [15] (i.e., \(f(\textbf{A})\textbf{1}\)), where \(f(\textbf{A}))\) is a given matrix function.

Although these centrality measures have been successfully used in many problems [17, 38, 39, 41], computing a matrix function for large networks can be very demanding using the current numerical methods. Direct methods, such as expm [5, 53] or the Schur-Parlett algorithm [26], have a computational cost of \(O(n^3)\) and yield a full dense matrix \(f(\textbf{A})\), hence are only feasible for small matrices. Methods based on Gaussian quadrature rules [14, 42, 46] can estimate the diagonal entries of \(f(\textbf{A})\) without evaluating the full matrix function, but are prone to numerical breakdown when \(\textbf{A}\) is sparse and large (which is often the case with real networks), and thus, are often employed to determine only the most important nodes in the network. Krylov-based methods [3, 33, 51, 52] can efficiently compute \(f(\textbf{A})\textbf{v}\), for \(\textbf{v} \in \mathbb {R}^{n}\), provided that \(\textbf{A}\) is well conditioned. Otherwise, their convergence rate can be very slow or even stagnate since a general and well-established procedure to precondition the matrix \(\textbf{A}\) does not exist. Rational Krylov methods are often more resilient to the condition number and provide a better approximation to \(f(\textbf{A})\textbf{v}\) than polynomial ones, but require solving a linear system for each vector of the basis. Moreover, the stopping criteria for these methods remains an open issue [52].

Monte Carlo methods [18, 28, 29, 43, 58] provide an alternative way to calculate matrix functions, primarily for solving linear systems. In essence, these methods construct a discrete-time Markov chain whose underlying random paths evolve through the different indices of the matrix, which can be formally understood as the Monte Carlo sampling of the corresponding Neumann series. Their convergence has been rigorously established in [18, 30, 58]. Recently, [1, 2, 48] extended these methods for the evaluation of the matrix exponential and Mittag-Leffler functions.

Another strategy is to construct a random sketch (i.e., a probabilistic representation of the matrix) and then use it to approximate the desired operation. This is a basic idea in contemporary numerical linear algebra [64, 66]. Some recent studies have shown that a polynomial Krylov method can be accelerated using randomization techniques [25, 49]. In this paper, we propose a new stochastic algorithm that randomly samples full rows and columns of the matrix as a way to approximate the target function using the corresponding power series expansion. Through an extensive set of numerical experiments, we show that our approach converges much faster than the original Monte Carlo method and that it is particularly effective for estimating the subgraph centrality and total communicability of large networks. We also compare our method against other classical and randomized methods considering very large matrices.

The paper is organized as follows. Section 2 presents a brief overview of the centrality measures defined in terms of matrix functions. Section 3 describes our randomized algorithm and how it can be implemented efficiently. In Sect. 4, we evaluate the performance and accuracy of our method by running several benchmarks with both synthetic and real networks. We also compare our method against several other algorithms. In the last section, we conclude our paper and present some future work.

2 Background

In this section, we introduce some definitions and ideas from graph theory that will be used throughout the paper.

2.1 Graph Definitions

A graph or network \(G = (V, E)\) is composed of a set \(V = \{1, 2, \dots , n\}\) of nodes (or vertices) and a set \(E = \{(u, v): u, v \in V\}\) of edges between them [70]. A graph is undirected if the edges are bidirectional and directed if the edges are unidirectional. A walk of length k over the graph G is a sequence of vertices \(v_1, v_2, \dots , v_{k + 1}\) such that \((v_i, v_{i + 1}) \in E\) for \(i =1, 2 \ldots , k\). A closed walk is a walk that starts and ends at the same vertex, i.e., \(v_1 = v_{k + 1}\). An edge from a node to itself is called a loop. A subgraph of graph G is a graph created from a subset of nodes and edges of G. The degree of a node is defined as the number of edges entering or exiting the node. In directed graphs, the in-degree counts the number of incoming edges and out-degree, the number of outgoing edges.

A graph G can be represented through an adjacency matrix \(\textbf{A} \in \mathbb {R}^{n \times n}\):

$$\begin{aligned} \textbf{A} = (a_{ij}); \qquad a_{ij} = \left\{ \begin{array}{@{}ll@{}} w_{ij}, &{} \quad \text {if edge } (i, j) \text { exists in graph } G \\ 0, &{} \quad \text {otherwise} \end{array}\right. \end{aligned}$$
(1)

where \(w_{ij}\) is the weight of the edge (ij). In this paper, we focus our attention on graphs that are undirected and unweighted (i.e., \(w_{ij} = 1\) for all edges (ij)) and do not contain loops or have multiple edges between nodes. Consequently, all matrices in this paper will be symmetric, binary, and with zeros along the diagonal. Notwithstanding, it is worth mentioning that our method is general and can be applied to other classes of matrices.

2.2 Centrality Measures

There are many node centrality measures, and the simplest one must be the degree centrality [44]. The degree of a node provides a rough estimation of its importance, yet it fails to take into consideration the connectivity of the immediate neighbours with the rest of the network. Instead, let us consider the flow of information in the network. An important node must be part of routes where the information can flow through, and thus, be able to spread information very quickly to the rest of the network. These information routes are represented as walks over the network. This is the main idea behind walk-based centralities and was formalized by Estrada and Higham [38].

In graph theory, it is well known that the entry \((\textbf{A}^k)_{ij}\) counts the number of walks of length \(k \ge 1\) over graph G that starts at node i and end at node j. Then, the entry \(f_{ij}\) of the matrix function \(f(\textbf{A})\) defined as

$$\begin{aligned} f(\textbf{A}) = \sum _{k = 0}^\infty {\zeta _k \textbf{A}^k} \end{aligned}$$
(2)

measures how easily the information can travel from node i to node j. The entry \(\textbf{A}^k\) is scaled by a coefficient \(\zeta _k\), such that \(\zeta _k \ge \zeta _{k + 1} \ge 0\) and \(\zeta _k \rightarrow 0\) when k is large, in order to penalize the contribution of longer walks and ensure the convergence of the series. The two most common choices for \(f(\textbf{A})\) are the matrix exponential \(e^{\textbf{A}}\) and resolvent \((I - \gamma \textbf{A})^{-1}\) [38], but other matrix functions can be used as well [10, 11, 17].

From the power series (2), Estrada defined f-subgraph centrality [38, 39] as the diagonal of \(f(\textbf{A})\), that is \(fSC(i) = (f(A))_{ii}\), and measures the importance of this node based on its participation in all subgraphs in the network. The sum over all nodes of the subgraph centrality has become known as the Estrada Index [27, 34], which was first introduced to quantify the degree of folding in protein chains [34], but later extended to characterize the global structure of general complex networks [37, 40].

Later, Benzi and Klymko [15, 17] introduced the concept of f-total communicability based on the row sum of \(f(\textbf{A})\), ranking the nodes according to how well they can communicate with the rest of the network. Formally, the f-total communicability is expressed as

$$\begin{aligned} fTC(i) = (f(\textbf{A}) \textbf{1})_i, \end{aligned}$$
(3)

where \(\textbf{1}\) is a vector of length n with all entries set to 1. If we consider the matrix resolvent \(f(\textbf{A}) = (I - \gamma \textbf{A})^{-1}\), the total communicability of a node corresponds to the well-known Katz’s Centrality [55, 60].

In the context of network science, it is common to introduce a weighting parameter \(\gamma \in (0, 1)\) and work with the parametric matrix function \(f(\gamma \textbf{A})\). The parameter \(\gamma \) can be interpreted as the inverse temperature and accounts for external disturbances on the network [41]. Furthermore, the value of \(\gamma \) is often chosen in such a way that the terms \(\zeta _k (\gamma \textbf{A})^k\) in (2) are monotonically decreasing in order to preserve the notion that the information travels faster to nearby nodes compared to those that are farther away.

3 Randomized Algorithm for Matrix Functions

Algorithm 1
figure a

A Monte Carlo method adapted from [18] for computing a matrix \(\textbf{F}\) as an approximation of \(f(\textbf{A})\). \(N_s\) represents the number of random walks for approximating each row and \(W_c\) is the weight cutoff.

Ulam and von Neumann [43] were the first to propose a Monte Carlo method for computing the matrix inverse as a way to solve linear systems, which was later refined by [18, 28, 30]. It consists of generating random walks over the matrix \(\textbf{A}\) to approximate each power in the corresponding series (2). Starting from a row \(\ell _0\), a random walk consists of a random variable \(W^{(k)}\) and a sequence of states \(\ell _0, \ell _1, \ldots , \ell _k\), which are obtained by randomly jumping from one row to the next. At each step k, the program updates \(W^{(k)}\) and add the results to entry \(f(\textbf{A})_{i\ell _k}\) as an approximation for the k-th term in the series (2).

The full procedure is described in Algorithm 1. The SelectNextState routine randomly selects an entry in row \(\ell _k\) to determine which row to jump to in the next step of the random walk. The probability of choosing an entry j is equal to \(\mathbb {P}(\ell _{k + 1} = j \mid \ell _k = i) = t_{ij}\), where \(t_{ij}\) is an entry of a transition probability matrix \(\textbf{T}\).

The main limitation of this method is that each random walk only updates a single entry of \(f(\textbf{A})\) at a time, requiring a large number of walks just to estimate a single row with reasonable accuracy. Therefore, our objective is to modify this algorithm such that it samples entire rows and columns of \(\textbf{A}\) when approximating each term in the series (2), drastically reducing the number of walks necessary to achieve the desired precision.

3.1 Mathematical Description of the Method

In the following, we discuss how to extend the randomized matrix product algorithm proposed in [31, 32] to compute an arbitrary matrix power.

Lemma 1

Let \(R_i\) and \(C_j\) denote the i-th row and j-th column of \(\textbf{A} \in \mathbb {R}^{n \times n}\). The matrix power \(\textbf{A}^p\) with \(p \in \mathbb {N}\) and \(p \ge 2\) can be evaluated as

$$\begin{aligned} \textbf{A}^p = \sum _{i_2 = 1}^n{\sum _{i_3 = 1}^n{\dots \sum _{i_p = 1}^n{C_{i_2} a_{i_2 i_3} a_{i_3 i_4} \dots a_{i_{p - 1} i_p} R_{i_p}}}}. \end{aligned}$$
(4)

Proof

Recall that the matrix product \(\textbf{AB}\) with \(\textbf{A}, \textbf{B} \in \mathbb {R}^{n \times n}\) can be expressed as a sum of outer products [32]:

$$\begin{aligned} \textbf{AB} = \sum _{k = 1}^{n} C_k B_k, \end{aligned}$$

where \(C_k\) is the k-th column of \(\textbf{A}\) and \(B_k\) is the k-th row of \(\textbf{B}\). Therefore, a power p of a square matrix \(\textbf{A}\) can be written as

$$\begin{aligned} \textbf{A}^p = \sum _{k = 1}^n{C_k R_k^{(p - 1)}}, \end{aligned}$$
(5)

where \(R_k^{(p - 1)}\) is the k-th row of \(\textbf{A}^{p - 1}\). For a single row, (5) is reduced to

$$\begin{aligned} R^{(p)}_i = \sum _{k = 1}^n{a_{ik} R_k^{(p - 1)}}. \end{aligned}$$
(6)

Recursively applying (6) for the powers \(p, p - 1, \dots , 2\) and then substituting the expansion in (5) leads to the expression in (4). \(\square \)

Corollary 1

Let \(f(\textbf{A}): \mathbb {R}^{n \times n} \rightarrow \mathbb {R}^{n \times n}\) be a matrix function defined by the power series

$$\begin{aligned} f(\textbf{A})=\zeta _0\textbf{I} + \zeta _1\textbf{A}+\sum _{k=2}^{\infty } {\zeta _k \textbf{A}^k}=\textbf{H} + \textbf{U}, \end{aligned}$$
(7)

where \(\textbf{H} = \zeta _0\textbf{I} + \zeta _1\textbf{A}\). Concerning matrix \(\textbf{U}\), it can be written as the following sum of rank-one matrices:

$$\begin{aligned} \textbf{U}= \sum _{k = 2}^{\infty }{ \sum _{i_2 = 1}^n{\sum _{i_3 = 1}^n{\dots \sum _{i_k = 1}^n{\zeta _k C_{i_2} a_{i_2 i_3} a_{i_3 i_4} \dots a_{i_{k - 1} i_k} R_{i_k}}}}}. \end{aligned}$$
(8)

The multiple sums appearing in the definition of matrix \(\textbf{U}\) in (8) can be exploited in practice to construct a probabilistic algorithm similar to [18, 28], which was originally created for computing the inverse matrix. In fact, the formal procedure is analogous, but instead of generating random scalar variables our goal here consists of generating randomly rank-one matrices governed by the following Markov chain.

Definition 1

Let \(\{X_k:k\ge 0\}\) be a Markov chain taking values in the state space \(S~=~\{1,2,\dots ,n\}\) with initial distribution and transition probability matrix given by

$$\begin{aligned} (i){} & {} \mathbb {P}(X_0 = \ell _0) = p_{\ell _0}=\frac{\Vert C_{\ell _0}\Vert _2}{\sum _{k = 1}^n{\Vert C_k\Vert _2}} \end{aligned}$$
(9)
$$\begin{aligned} (ii){} & {} \textbf{T} = (t_{ij}); \quad t_{ij} = \mathbb {P}(\ell _{k + 1} = j \mid \ell _k = i) = \frac{|a_{ij} |}{\sum _{k = 1}^n{|a_{ik} |}}. \end{aligned}$$
(10)

Here, the indices \(\ell _m\) denote the corresponding state reached by the Markov chain after m-steps. We use the initial state \(\ell _0\) of the Markov chain to choose randomly a column of matrix \(\textbf{A}\), \(C_{\ell _0}\). After k-steps of the Markov chain, the state is \(\ell _k\), which is used to select the corresponding row of the matrix \(\textbf{A}\), \(R_{\ell _k}\). During this random evolution of the Markov chain different states are visited according to the corresponding transition probability, and along this process a suitable multiplicative random variable \(W^{(k)}\) is updated conveniently. Finally, matrix \(\textbf{U}\) can be computed through the expected value of a given functional, as proved formally by the following Lemma.

Lemma 2

Let \(\textbf{Z}(\omega )\) be a realization of a random matrix at a point \(\omega \) of the discrete sample space, defined as

$$\begin{aligned} \textbf{Z}(\omega ) = \sum _{k = 0}^\infty {\zeta _{k+2} \,C_{X_0(\omega )}\, W^{(k)} R_{X_k(\omega )}}. \end{aligned}$$
(11)

Here \(W^{(k)}\) is a multiplicative random variable defined recursively as

$$\begin{aligned} W^{(0)} = \frac{1}{p_{\ell _0}} \qquad W^{(k)} = W^{(k - 1)} \frac{a_{\ell _{k - 1} \ell _k}}{t_{\ell _{k - 1} \ell _k}}. \end{aligned}$$
(12)

Then, it holds that

$$\begin{aligned} \textbf{U} = \mathbb {E}[\textbf{Z}]. \end{aligned}$$
(13)

Proof

Note that \(\textbf{Z}(\omega )\) is obtained from Eq. (11) as a sum of independent random matrices \(\textbf{Y}^{(k)}(\omega )\):

$$\begin{aligned} \textbf{Z} (\omega ) = \sum _{k = 0}^\infty {\zeta _{k+2} \textbf{Y}^{(k)} (\omega )}, \end{aligned}$$
(14)

where

$$\begin{aligned} \textbf{Y}^{(k)}(\omega ) = C_{X_0(\omega )} W^{(k)} R_{X_k(\omega )} = C_{X_0(\omega )} \frac{a_{\ell _0 \ell _1} a_{\ell _1 \ell _2} \dots a_{\ell _{k - 1} \ell _k}}{p_{\ell _0} t_{\ell _0 \ell _1} t_{\ell _1 \ell _2} \dots t_{\ell _{k - 1} \ell _k}} R_{X_k(\omega )}. \end{aligned}$$

Let \(P^{(k)}(\omega )\) be the probability of occurring an event \(\omega \) consisting in a transition from \(\ell _0\) to \(\ell _k\) after k steps. This probability turns out to be \(p_{\ell _0} t_{\ell _0 \ell _1} t_{\ell _1 \ell _2} \dots t_{\ell _{k - 1} \ell _k}\). Therefore, the expected value of the random matrix \(\textbf{Y}^{(k)}(\omega )\) is given by,

$$\begin{aligned} \mathbb {E}[\textbf{Y}^{(k)}]&= \sum _\omega {P^{(k)}(\omega ) \textbf{Y}^{(k)}(\omega )} \\&= \sum _\omega {C_{X_0(\omega )} a_{\ell _0 \ell _1} a_{\ell _1 \ell _2} \dots a_{\ell _{k - 1} \ell _k} R_{X_k(\omega )}}. \end{aligned}$$

Note that every event \(\omega \) can be described by different values of \(k+1\) integer indices, running from 1 to n, then

$$\begin{aligned} \mathbb {E}[\textbf{Y}^{(k)}]&= \sum _{i_0 = 1}^n{\sum _{i_1 = 1}^n{\dots \sum _{i_{k} = 1}^n{C_{i_0} a_{i_0 i_1} a_{i_1 i_2} \dots a_{i_{k-1} i_{k}} R_{i_{k}}}}}. \end{aligned}$$

Therefore, from (8) we conclude that \(\mathbb {E}[\textbf{Y}^{(k)}]=\textbf{A}^{k+2}\). Finally after summing all contributions coming from any number of steps, using (14) and by linearity of the expected value operator we obtain

$$\begin{aligned} \mathbb {E}[\textbf{Z}] = \sum _{k = 0}^\infty {\zeta _{k+2} \mathbb {E}[\textbf{Y}^{(k)}]} = \sum _{k = 0}^\infty {\zeta _{k+2} \textbf{A}^{k+2}} = \textbf{U}. \end{aligned}$$
(15)

\(\square \)

3.2 Practical Implementation of the Probabilistic Method

To transform Lemma 2 into a practical algorithm, we must first select a finite sample size \(N_s\) and then compute the expected value \(\mathbb {E}[\textbf{Z}]\) in (13) as the corresponding arithmetic mean. Additionally, each random walk must terminate after a finite number m of steps. Mathematically, this is equivalent to considering only the first m terms of the power series expansion. Since some random walks may retain important information of the matrix for longer steps than others and it is very difficult to determine a priori the number of steps required for achieving a specific precision, we adopt the following termination criteria: the computation of the random walk will end when the associated weight is less than a relative threshold \(W_{c}\) [18]. In other words, a random walk terminates at step m when

$$\begin{aligned} W^{(m)} \le W_{c} W^{(0)}, \end{aligned}$$
(16)

where \(W^{(m)}\) is the weight after m steps and \(W^{(0)}\) is the weight at the initial step of the random walk. Formally, the infinite series in (15) is truncated as

$$\begin{aligned} f(\textbf{A}) \approx \textbf{H} + \hat{\textbf{U}} \end{aligned}$$
(17)

with

$$\begin{aligned} \hat{\textbf{U}} = \frac{1}{N_s} \sum _{s = 1}^{N_s}{\sum _{k = 0}^m{\zeta _{k + 2} C_{X_0(s)} \Omega _{\ell _0 \ell _k}^{(k)} R_{X_k(s)}}} \end{aligned}$$
(18)

where \(\Omega _{ij}^{(k)}\) corresponds to the weight \(W^{(k)}\) of the random walk that began at state i and arrived at state j after k steps. Here s indicates a realization of a random walk.

Computing \(\hat{\textbf{U}}\) directly from (18) is ill-advised due to the large number of outer products, while also being very difficult to be parallelised efficiently. Instead, let us rearrange (17) to a more suitable form. The random walks with the same starting column can be grouped as

$$\begin{aligned} \hat{\textbf{U}} = \sum _{i = 1}^{n}{C_i \left( \frac{1}{N_i} \sum _{s = 1}^{N_i}{\sum _{k = 0}^m{\zeta _{k + 2} \Omega _{i\, \ell _k}^{(k)} R_{X_k(s)}}} \right) }, \end{aligned}$$
(19)

where \(N_i\) is the number of random walks that began at column i. Assuming that \(N_s>> 1\), the value of \(N_i\) can be estimated a priori as \(N_i \approx p_i N_s\) with \(p_i = \mathbb {P}(X_0 = i)\) as defined in (9).

Let \(\nu _{ij}\) denote a visit to the state j at the step k of a random walk that started at state i. For each visit \(\nu _{ij}\), the weight of the walk is added to the entry \(q_{ij}\) of matrix \(\textbf{Q}\), defined as

$$\begin{aligned} q_{ij} = \frac{1}{N_{i}} \sum _{\nu _{ij}}{\zeta _{k + 2} \Omega _{ij}^{(k)}}. \end{aligned}$$
(20)

Then, we can rewrite (18) as

$$\begin{aligned} \hat{\textbf{U}} = \sum _{i = 1}^{n}{\sum _{j = 1}^{n} C_i q_{ij} R_{j}} = \textbf{A}\textbf{Q} \textbf{A}. \end{aligned}$$
(21)
Algorithm 2
figure b

A probabilistic algorithm for computing the matrix \(\textbf{F}\) as an approximation of \(f(\textbf{A})\). \(N_s\) represents the total number of random walks and \(W_c\) the weight cutoff.

Algorithm 2 describes the procedure for approximating \(f(\textbf{A})\) based on Eqs. (17) and (21). Assuming that matrix \(\textbf{A}\) is sparse with \(N_{nz}\) nonzero entries, Algorithm 2 requires \(O(N_s m)\) operations to construct the matrix \(\textbf{Q}\) and \(O(n N_{nz})\) to calculate the final product \(\textbf{AQA}\), for a total computational cost of order of \(O(N_s m + n N_{nz})\). It also uses an additional \(n^2\) space in memory to store the matrix \(\textbf{Q}\). It is possible to reduce memory consumption if the program divides rows of \(\textbf{Q}\) into blocks in such a way that only one block is computed at a time. At the end of each block, the program updates the matrix \(f(\textbf{A})\) and reuses the memory allocation for the next block.

3.3 Diagonal of the Matrix Function

Algorithm 2 can be conveniently modified to compute only the diagonal of the matrix function. Let \(Q_k\) denote the k-th row of matrix \(\textbf{Q}\) defined in (20). Then, the diagonal of \(f(\textbf{A})\) is approximated by vector \(\textbf{d} = (d_i)\) as follows

$$\begin{aligned} d_{i} = \textbf{e}_{i}^\intercal \, f(\textbf{A}) \, \textbf{e}_i = \zeta _0 + \zeta _1 a_{ii} + \sum _{k = 1}^n a_{ik} \, \langle Q_k, C_i \rangle , \end{aligned}$$
(22)

where \(\langle \cdot , \cdot \rangle \) denotes the inner product and \(e_i\) a vector from the canonical basis. Essentially, the program updates the value of \(d_i\) immediately after the computation of row \(Q_k\). In this way, only a single row of \(\textbf{Q}\) needs to reside in memory at a time. Naturally, if \(a_{ik} \sim 0\), the program can skip the calculation of \(Q_k\) in order to save computational resources. Note that multiple entries can be computed at the same time by reusing \(Q_k\) and then selecting the appropriate entry \(a_{ik}\) and column \(C_i\).

In terms of computational cost, the computation of all diagonal entries requires \(O(N_s m + N_{nz} \bar{n}_c)\) floating-point operations, where \(\bar{n}_c\) denotes the average number of nonzero entries per column and consumes an additional n space in memory. The algorithm described in this section will be referred to as RandFunmDiag.

3.4 Action of the Matrix Function Over a Vector

Let \(\textbf{v}\) be a real vector in \(\mathbb {R}^{n}\), our goal is to develop another algorithm based on Lemma 2 for computing \(f(\textbf{A}) \, \textbf{v}\). First, let us multiply the truncated series (17) by the vector \(\textbf{v}\):

$$\begin{aligned} f(\textbf{A})\textbf{v} \approx \textbf{h} + {\hat{\textbf{u}}} \end{aligned}$$
(23)

where

$$\begin{aligned} {\hat{\textbf{u}}} = \frac{1}{N_s} \sum _{s = 1}^{N_s}{\sum _{k = 0}^m{\zeta _{k + 2} C_{X_0(s)} \Omega _{\ell _0 \ell _k}^{(k)} r_{X_k(s)}}} \end{aligned}$$
(24)

with \(r_i = \langle R_i, \textbf{v} \rangle \) and \(\textbf{h} = \textbf{H}\textbf{v}\). Rearranging the series such that the random walks with the same starting column are grouped:

$$\begin{aligned} {\hat{\textbf{u}}} = \sum _{i = 1}^{n}{C_i \left( \frac{1}{N_i} \sum _{s = 1}^{N_i}{\sum _{k = 0}^m{\zeta _{k + 2} \Omega _{i \, \ell _k}^{(k)} r_{X_k(s)}}} \right) }. \end{aligned}$$

Then, the action of the matrix function \(f(\textbf{A})\) over the vector \(\textbf{v}\) can be approximated as

$$\begin{aligned} f(\textbf{A})\textbf{v} \approx \textbf{h} + \textbf{A}\textbf{q} \end{aligned}$$
(25)

with

$$\begin{aligned} \textbf{q} = (q_i); \qquad \textbf{q}_i = \frac{1}{N_i} \sum _{s = 1}^{N_i}{\sum _{k = 0}^m{\zeta _{k + 2} \Omega _{i \, \ell _k}^{(k)} r_{X_k(s)}}}. \end{aligned}$$
Algorithm 3
figure c

A probabilistic algorithm for computing \(\textbf{y}\) as an approximation of \(f(\textbf{A})\textbf{v}\). \(N_s\) represents the total number of random walks and \(W_c\) the weight cutoff threshold.

Algorithm 3 describes the procedure for approximating \(f(\textbf{A})\textbf{v}\) based on Eq. (25) and the definition of the vector \(\textbf{q}\). It has a time complexity of \(O(N_s m + N_{nz})\) and requires an additional n space in memory to store the vector \(\textbf{q}\).

3.5 Convergence of the Method and Numerical Errors

In the following, we prove the convergence of the Monte Carlo method described by Eqs. (17) and (21) through the following theorem:

Theorem 1

Let m be any positive integer. For each \(k\in \{0,\ldots ,m\}\), let \((\varvec{\xi }^{(k)}(s))_{s \ge 1}\) be a collection of i.i.d. vector-valued random variable in \(\mathbb {R}^{n^2}\) defined as \(\varvec{\xi }^{(k)}(s)=\zeta _{k + 2} \Omega _{\ell _0 \ell _k}^{(k)} {{\,\textrm{vec}\,}}(C_{X_0(s)} R_{X_k(s)})\), and \(V(s)=\sum _{k = 0}^m \varvec{\xi }^{(k)}(s)\). Let \(\psi = \sum _{s = 0}^{N_s} V(s)\), \(\mu =\mathbb {E}[V(s)]\), \(\alpha =\max _{i}\{(\sum _{j = 1}^n |a_{i j}|)^2\}<1\), and \(|\zeta _{k + 1}|<|\zeta _{k}|\) when \(k\rightarrow \infty \). Then

$$\begin{aligned} \lim _{N_s\rightarrow \infty } \frac{\psi -N_s \mu }{\sqrt{N_s}} \end{aligned}$$
(26)

converges in distribution to a random vector distributed according to a multivariate normal distribution \(N[0,\mathbf {\Theta }]\). Here \(\mathbf {\Theta }=(\theta _{ij})\) is the covariance matrix, with \(\theta _{ij}=Cov(v_{i}(s),v_{j}(s))\), and \(v_i(s)\) the i-th component of the random vector V(s).

Proof

This proof is based on the proof described in Theorem 3.4 [58] conveniently modified for the current numerical method. Assuming that all random walks are independently generated, then

$$\begin{aligned} \textrm{Var}(v_i(s))=\sum _{k = 0}^m \textrm{Var}(\xi _i^{(k)}(s))\le \sum _{k = 0}^m \mathbb {E}[(\xi _i^{(k)}(s))^2]\le \sum _{k = 0}^\infty \mathbb {E}[(\xi _i^{(k)}(s))^2]. \end{aligned}$$
(27)

Here, \(\xi _i^{(k)}(s)\) denotes the i-th component of the random vector \(\varvec{\xi }^{(k)}(s)\). To compute the expected value \(\mathbb {E}[(\xi _i^{(k)}(s))^2]\), we have to enumerate all the different transitions that occurred between a given initial state \(\ell _0\) and an arbitrary final state \(\ell _k\) of the Markov chain in k steps, along with the corresponding probabilities. This yields,

$$\begin{aligned} \mathbb {E}[(\xi _i^{(k)}(s))^2]=\zeta _{k + 2}^2 \sum _{i_1 = 1}^n\cdots \sum _{i_{k} = 1}^n t_{\ell _0 i_1} t_{i_1 i_2} \cdots t_{i_{k-1} i_{k}}\frac{1}{p_{\ell _0}^2} (\Omega _{\ell _0 i_k}^{(k)})^2 (g_i^{(i_k)})^2 \end{aligned}$$
(28)

where \(\textbf{g}^{(i_k)}={{\,\textrm{vec}\,}}(C_{\ell _0} R_{i_k})\) is the vector obtained after vectorizing the matrix \(C_{\ell _0} R_{i_k}\) with \(g_i^{(i_k)}\) as the i-th component. From Eq. (12), it follows that

$$\begin{aligned} \mathbb {E}[(\xi _i^{(k)}(s))^2]&= \zeta _{k + 2}^2 \sum _{i_1 = 1}^n\sum _{i_2 = 1}^n\cdots \sum _{i_{k} = 1}^n\frac{1}{p_{\ell _0}^2} \frac{(a_{\ell _0 i_1} a_{i_1 i_2} \cdots a_{i_{k-1} i_{k}})^2}{t_{\ell _0 i_1} t_{i_1 i_2} \cdots t_{i_{k-1} i_{k}}} (g_i^{(i_k)})^2 \nonumber \\&=\frac{\zeta _{k + 2}^2}{p_{\ell _0}^2}\sum _{i_1 = 1}^n \frac{a_{\ell _0 i_1}^2}{t_{\ell _0 i_{1}}} \sum _{i_2 = 1}^n \frac{a_{i_{1} i_2}^2}{t_{i_{1} i_{2}}} \cdots \sum _{i_{k} = 1}^n \frac{a_{i_{k-1} i_k}^2}{t_{i_{k-1} i_{k}}}(g_i^{(i_k)})^2. \end{aligned}$$
(29)

Note that \(\sum _{j = 1}^n \frac{a_{i j}^2}{t_{i j}}=(\sum _{j = 1}^n |a_{i j}|)^2\) from Eq. (10), then it holds

$$\begin{aligned} \mathbb {E}[(\xi _i^{(k)}(s))^2] \le \frac{\zeta _{k + 2}^2}{p_{\ell _0}^2} \alpha ^k \beta . \end{aligned}$$
(30)

Here \(\beta =\max _{j}\{(g_i^{(j)})^2\}\). Since \(\zeta _{k+2}\le 1,\forall k\) for any matrix function of interest, from Eq. (27), we have

$$\begin{aligned} \textrm{Var}(v_i(s))\le \sum _{k = 0}^\infty \mathbb {E}[(\xi _i^{(k)}(s))^2]\le \frac{\beta }{p_{\ell _0}^2} \sum _{k = 0}^\infty |\zeta _{k + 2}| \alpha ^k \le \frac{\beta }{\alpha ^2 p_{\ell _0}^2} \sum _{j = 0}^\infty |\zeta _{j}| \alpha ^j. \end{aligned}$$
(31)

Since \(\lim _{k\rightarrow \infty } \alpha \frac{|\zeta _{k + 1}|}{|\zeta _{k}|}< 1\), it holds that the series \(\sum _{j = 0}^\infty |\zeta _{j}| \alpha ^j\) converges and therefore the variance \(\textrm{Var}(v_i(s))\) is bounded.

Note, however, that for the specific case of the exponential function, \(|\zeta _{k + 1}|/|\zeta _{k}|\!=\!1/(k\!+\!1)\), and therefore any value of \(\alpha \) is allowed to ensure convergence of the series. However, for the inverse function \(\alpha < 1\) is strictly mandatory.

Since the variance is finite, the Central Limit Theorem for vector-valued random variables (see [56] e.g.) guarantees that

$$\begin{aligned} \frac{\psi -N_s \mu }{\sqrt{N_s}} \rightarrow ^{d} N[0,\mathbf {\Theta }]. \end{aligned}$$
(32)

\(\square \)

Therefore, for a finite sample size \(N_s\), replacing the expected value in (13) by the arithmetic mean introduces an error which is statistical in nature and known to be distributed according to a normal distribution. From Theorem 1 the standard error of this mean \(\varepsilon \) can be readily estimated as \(\sigma ^2/\sqrt{N_s}\), being \(\sigma \) the corresponding standard deviation.

4 Numerical Examples

Table 1 Numerical examples for evaluating the algorithms. Here, \(k = 10^3\), \(M = 10^6\) and \(B = 10^9\)

To illustrate the applicability of our method, we compute the subgraph centrality and total communicability of several complex networks using the matrix exponential \(e^{\gamma \textbf{A}}\) with \(\gamma \in [0, 1]\). Due to the random nature of the Monte Carlo algorithms, all results reported in this section correspond to the mean value of 10 runs of the algorithm using different random seeds.

All numerical simulations were executed on a commodity server with an AMD EPYC 9554P 64C 3.75 GHz and 256 GB of RAM, running Fedora 38. All randomized algorithms were implemented in C++ using OpenMP. The code was compiled with AMD AOCC v4.0 with the -O3 and -march=znver4 flags. Our implementation uses the PCG64DXSM [71] random number generator.

The algorithms were tested using two synthetic graphs—smallworld and kronecker—as well as a set of networks extracted from real-world data, which are described in Table 1. Note that, before calculating the subgraph centrality and total communicability of directed graphs, their adjacency matrix \(\textbf{A}\) must symmetrized as

$$\begin{aligned} \textbf{B} = \begin{bmatrix} 0 &{} \quad \textbf{A} \\ \textbf{A}^\intercal &{} \quad 0 \end{bmatrix} \end{aligned}$$
(33)

in order to split apart the outgoing and incoming edges of the graph [17]. We also remove all duplicated edges, loops and disconnected nodes from the kronecker graph generated by the Graph500 code [47].

4.1 Numerical Errors and Accuracy

Fig. 1
figure 1

Relative \(\ell _\infty \) error as function of the number of random walks \(N_s\) for \(W_c = 10^{-6}\). The degree \(\gamma \) of the polynomial of the fitting curve is indicated as \(\varepsilon _r \sim N_s^\gamma \)

Figure 1 show the relative \(\ell _\infty \) error of our method as function of \(N_s\). Recall from Sect. 3.2 that the algorithm assigns a priori \(N_j \sim \Vert C_j \Vert _2\) random walks to each node j of the graph. Therefore, if \(N_s\) is too low, very few random walks will be assigned to a node i with a low norm, such that its centrality score is basically approximated by just the first two terms in the series, i.e., \(SC(i) \approx 1 + a_{ii}\) and \(TC(i) \approx 1 + (\textbf{A} \textbf{1})_i\). For this reason, the relative errors are highly dependent on the structure of the graph. In particular, the nodes in the smallworld network have very similar probabilities, and therefore it can happen that for \(N_s < n\) some nodes will have no chance to be randomly chosen. Only when the sample size is sufficiently large, the algorithm can properly estimate the centrality of every node. In this scenario, the numerical error of the algorithm scales with \(O(N_s^{-0.5})\), similar to other probabilistic methods. This relation is confirmed numerically by the trend lines in Fig. 1 which has a slope of approximately \(-0.5\) in the logarithmic scale. Table 2 shows the relative \(\ell _\infty \) error for a fixed number of samples. The table also shows the standard measurement error obtained after several independent runs.

Table 2 Relative \(\ell _\infty \) error for \(N_s = 10^{8}\) and \(W_c = 10^{-6}\) as well as the standard measurement error obtained after several independent runs
Fig. 2
figure 2

Relative \(\ell _\infty \) error as function of the weight cutoff \(W_c\) for \(N_s = 10^{8}\)

In Fig. 2 we show the relative \(\ell _\infty \) error as function of \(W_c\). The value of \(W_c\) controls the length of the random walks in terms of the number of steps, which is related to the number of terms of the power series expansion. When the value of \(W_c\) is large, the algorithm stops the random walk generation too early, leading to large errors. On the other hand, when the value of \(W_c\) is small, the algorithm continues generating the random walk for longer steps than necessary. Note that this increases the computational cost without necessarily improving the accuracy of the method. In fact, we have to consider also the statistical error, which depends on the number of generated random walks. According to Fig. 2 the optimal value for \(W_c\) is around \(10^{-6}\) for these networks.

Fig. 3
figure 3

Relative \(\ell _\infty \) error as function of the number of nodes n of the graph considering \(W_c = 10^{-6}\) and \(N_s = 10^{8}\). The degree \(\gamma \) of the polynomial of the fitting curve is indicated as \(\varepsilon _r \sim n^\gamma \)

Figure 3 shows how the relative \(\ell _\infty \) error grows with the size of the smallworld and kronecker networks. The smallworld network starts as a ring lattice with n nodes, each connected to 10 neighbours. The algorithm then rewires each edge in the lattice with a \(10\%\) chance, i.e., the edge (ij) is replaced by (ik) where k is chosen at random from all the possible nodes in the network such that there are no loops or duplicated edges. As a result, the random walks have very similar weights \(W^{(k)}\) independently of the sequence of nodes visited. In other words, the norm of the covariance matrix \(\sigma ^2 = \Vert \mathbf {\Theta }\Vert _\infty \) is very low. Considering that \(N_s\) is fixed, there are fewer random walks to estimate the centrality score of each node as the graph size increases, which reduces the precision of the algorithm by a factor of \(\sqrt{n}\), as shown in Fig. 3b, c. This is in line with the theoretical results from Sect. 3.5, where \(\varepsilon _r \sim \sigma ^2 N_s^{-0.5}\).

In contrast, the nodes in the kronecker graph are organized hierarchically. At the top level, there is a single node acting as the central hub for the entire graph. As we move down the hierarchy, there are more nodes per level, but they have fewer connections. The number of levels in the hierarchy as well as the number of nodes and connections at each level are determined by the size of the graph. Therefore, the weight of the random walks can vary greatly depending on which nodes are visited and their position in the hierarchy. Larger graphs have a higher covariance norm \(\sigma ^2\) due to a wider degree difference between nodes.

4.2 Comparison with Other Methods

There are a few algorithms available in the literature for computing the matrix exponential. Perhaps the most well-known scheme is the expm routine from MATLAB [5, 53, 54]. The method first scales the matrix \(\textbf{A}\) by a power of 2 to reduce the norm to order 1, calculates the Padé approximant of the matrix exponential and then repeatedly squares the result to recover the original exponent. For a generic \(n \times n\) matrix, expm requires \(O(n^3)\) arithmetic operations and an additional \(O(n^2)\) space in memory. expm calculate the entire matrix \(e^{\gamma \textbf{A}}\).

To rank the nodes using the subgraph centrality, we only need to calculate the diagonal entries of \(e^{\gamma \textbf{A}}\), not the complete matrix. Methods for estimating the individual entries of the matrix function have been proposed by Golub, Meurant and others [14, 42, 46] and are based on Gaussian quadrature rules and the Lanczos algorithm. They require O(n) operations to determine each diagonal entry, resulting in a total cost of \(O(n^2)\) to calculate the subgraph centrality for all nodes. In practice, this method may suffer from numerical breakdowns when \(\textbf{A}\) is large and sparse [12, 42]. For this reason, it is often restricted to estimate only the k most important nodes in the graph.

Likewise, the total communicability only requires the action of \(f(\textbf{A})\) over a vector setting \(\textbf{v} = \textbf{1}\), which can be computed efficiently using either a polynomial or rational Krylov method [3, 33, 51, 52]. These methods consist in generating a Krylov basis using the input matrix and then evaluating the function over the projected matrix through some direct method, such as expm. Assuming a sparse matrix with \(N_{nz}\) nonzeros and a Krylov basis with m vectors, the computational cost is \(O(m N_{nz})\). In particular, we compared our method against the restarted polynomial Krylov [3, 33] from the funm_kryl toolbox [50].

While writing this article, Güttel and Schweitzer published a paper [49] proposing two new randomized algorithms—sFOM and sGMRES—for estimating \(f(\textbf{A})\textbf{v}\). Here, we focus on sFOM since sGMRES works best with Stieltjes functions and requires a numerical quadrature rule for all the other functions. sFOM first creates a random sketch of the \(\textbf{A}\) and then uses an incomplete Arnoldi decomposition to generate a non-orthogonal Krylov basis from this sketch. However, the basis may be ill-conditioned, and thus, for stabilizing the method, it is required to compute a thin QR decomposition (also called whitening [67]) of the basis before evaluating the matrix function over the projected matrix. The computational cost of sFOM is \(O(N_{nz} m \log {m} + m^3)\).

Another preprint by Cortinovis, Kressner and Nakatsukasa [25] was also published recently proposing a different randomization strategy for the Krylov method. They propose an Arnoldi-like decomposition to iteratively build the non-orthogonal Krylov basis \(\textbf{V}_m\) using only random sketches of the basis vectors. Again, the method may apply a whitening [67] to improve the condition number of the basis. Afterwards, the program solves a least-square problem to obtain the projected matrix. We will denote this algorithm as rand_kryl and it has an overall computational cost of \(O(N_{nz} m^2 + m^3)\).

The MC represents the Monte Carlo method adapted from [18] as described in Algorithm 1. Similar to our randomized algorithm, the length of the random walks depends on the weight cutoff \(W_c\). It has a computational cost of \(O(m N_s)\), where m denotes the average number of steps in the random walk, and does not require additional space in memory. This method can be modified to calculate \(f(\textbf{A})\textbf{v}\) or \(f(\textbf{A})_{ii}\) instead of the full matrix function. Note that for the latter, the method still computes the full \(f(\textbf{A})\), but discards all the off-diagonal entries.

Fig. 4
figure 4

Comparison between different algorithms when calculating the subgraph centrality for real networks and \(\gamma = 10^{-3}\). Here, we consider the centrality scores generated by our algorithm with \(N_s = 10^{11}\) as reference

Figure 4 compares the serial execution time and accuracy among the different methods when computing the subgraph centrality. The graphs are sorted according to the number of nodes. The similarities between the two node rankings are measured using the Pearson correlation coefficient [13]. Here, \(cc_1\) denotes the correlation coefficient between the top \(1\%\) nodes between a reference list and the ranking obtained by the algorithm. Note that, if two or more nodes have similar centrality scores, numerical deviations can alter their ranking order, lowering the correlation coefficient. Nevertheless, these nodes have similar importance within the network, and thus, their order in the ranking may not be relevant to understanding the dynamics of the graph.

Although expm reaches machine precision, it cannot be used for large graphs due to its hefty computational cost and cubic scaling. In fact, the cond-mat graph with 40k nodes is already too large for expm and cannot be executed in a reasonable amount of time. The MC requires a very large number of random walks to estimate the subgraph centrality as it only updates a single entry of the matrix exponential at a time, and it is more likely for this entry to be outside the diagonal if the matrix is very large. For this reason, the accuracy of MC is quite poor even with a large number of random walks. Both RandFunm and RandFunmDiag have the same accuracy and correlation since the core algorithm is the same. However, RandFunmDiag does not require the computation of the full matrix product at the end of the algorithm, resulting in a speedup between 3 to 24 over RandFunm. Moreover, the full matrix exponential of the twitch and stanford graphs are too large to be fully represented in memory, and thus, their subgraph centrality can only be calculated by RandFunmDiag. The randomized algorithms also show a very high correlation for top-\(1\%\) nodes in the ranking compared with the reference list.

Fig. 5
figure 5

Comparison between different algorithms when computing the total communicability for real networks and \(\gamma = 10^{-5}\). Here, we consider the centrality scores generated by expmv [6] as reference

Figure 5 compares RandFunmAction against all Krylov-based methods in terms of the serial execution time and accuracy. The tolerance of funm_kryl was set to \(10^{-8}\) and the size of the Krylov basis of sFOM and rand_kryl was set to 4, such that all algorithms have comparable precision. We choose a small value of \(\gamma \) to avoid overflow as we are working with large positive matrices, and consequently, all algorithms converge very fast to target precision.

The stopping criterion of funm_kryl is well-known to be pessimistic [52], resulting in much higher precision than the target at the cost of higher execution times. In some networks (twitch, orkut and twitter), sFOM and rand_kryl outperformed funm_kryl mainly due to the smaller basis, while in others (uk-2005), the additional cost associated with the sketching, whitening, least square QR and other operations lead to significant slowdowns. Considering the accuracy difference, the randomization in the Krylov method does not seem to be very effective when the basis is relatively small. RandFunmAction shows the best performance among all algorithms, in particular, for the twitter network, being \(3.8\times \) faster than funm_kryl, while sFOM and rand_kryl are \(2.7\times \) and \(1.7\times \) faster, respectively.

For complex networks, it suffices for the algorithm to be sufficiently accurate to differentiate the centrality score between all nodes in the graph, there is no benefit in having higher accuracy. Indeed, the ranking produced by RandFunmAction has a correlation greater than 0.95 for the top \(1\%\) nodes despite having a lower accuracy than the others. The only exception is the twitter network. As a massive social network, there is no clear structure or hierarchy in the graph, such that many of them have similar centrality scores and small numerical variations can drastically change the rank order. In comparison, uk-2005 is an equally large web graph that follows a more clear structure with a well-defined hub and authorities, and thus, it is less susceptible to noise. For this reason, the correlation for the twitter network is much lower than other graphs and requires greater precision to differentiate the nodes. Note that the top-\(1\%\) in the ranking contains almost 1 million nodes in both the twitter and uk-2005 networks with a wide range of centrality scores. If we consider only the top-\(0.1\%\), the correlation of RandFunmAction for the twitter network increases to 0.79.

Fig. 6
figure 6

Elapsed time for computing the subgraph centrality as a function of the number of nodes n for a fixed accuracy \(\varepsilon _r\) and \(\gamma = 10^{-3}\)

Fig. 7
figure 7

Elapsed time for computing the total communicability as a function of the number of nodes n for a fixed accuracy \(\varepsilon _r\) and \(\gamma = 10^{-5}\)

Figures 6 and 7 show the elapsed serial time as a function of the number of nodes for the kronecker and smallworld networks. The computational cost of the expm algorithm is of the order of \(O(n^3)\). This is regardless of the sparsity or the distribution of the nonzeros of the matrix since it was originally proposed for dense matrices. The MC algorithm was too slow for the prescribed accuracy, and thus, it was not included in the graph. When \(N_{nz} \sim n\) the computational cost of the RandFunm and RandFunmDiag algorithms become of order \(O(n^2)\). This is because the computational cost of computing the matrix product is higher than generating random walks. Note that this cost is similar to other algorithms [42] for estimating diagonal entries of the matrix. However, the main advantage here is the smaller proportionality constant and the capability to compute the subgraph centrality for sparse and large matrices without worrying about a numerical breakdown [42]. Again, the RandFunmDiag algorithm is significantly faster than the RandFunm algorithm as it only requires the partial evaluation of the matrix product at the end of the algorithm. All Krylov-based methods scale linearly with n as they rely on matrix–vector multiplications. Similar to the other Monte Carlo methods, RandFunmAction spends more time computing the matrix–vector product than generating the random walks, and thus, also scales linearly with n.

4.3 Single Entry

Table 3 Relative error for calculating the total communicability of a single node \(i = \mathop {\mathrm {arg\,max}}\limits _j \sum _k |a_{jk}|\) considering \(N_s = 10^8\), \(W_c = 10^{-6}\) and \(\gamma = 10^{-5}\)

One of the main advantages of Monte Carlo algorithms is the ability to calculate a single entry of the solution without requiring the computation of the full solution. Table 3 shows the relative \(\ell _\infty \) error for calculating the total communicability of the node with the highest degree. Both RandFunmAction and MC were modified to calculate a single entry of the solution as efficiently as possible. Due to the ability to sample entire rows and columns, RandFunmAction produces a much better approximation for \((f(\textbf{A})v)_i\) than MC for the same number of random walks, independently of the network.

4.4 Parallel Performance

Fig. 8
figure 8

Strong scaling for the RandFunmDiag algorithm considering \(W_c = 10^{-8}\) and \(\gamma = 10^{-3}\)

Fig. 9
figure 9

Strong scaling for the RandFunmAction algorithm considering \(W_c = 10^{-8}\) and \(\gamma = 10^{-5}\)

Parallelizing our randomized algorithm is fairly straightforward. Multiple rows of \(\textbf{Q}\) can be computed at the same time in the RandFunmDiag algorithm, yet the diagonal entries must be updated atomically to avoid data races. Likewise, the RandFunmAction algorithm can compute the vector \(\textbf{q}\) as well as the final product \(\textbf{A} \textbf{q}\) completely in parallel.

Figures 8 and 9 show the strong scaling for the parallel implementation of the RandFunmDiag and RandFunmAction algorithms, respectively. The scalability of both algorithms is excellent, attaining more than \(85\%\) in efficiency for most networks when using 64 cores. In particular, the parallel code was able to achieve near-perfect scaling for the smallworld network due to its low degree per node and an almost uniform structure. This leads to a more efficient usage of the cache as well as an even distribution of load across the processors.

In most networks, the random walks are not distributed equally across the nodes, such that some rows of \(\textbf{Q}\) and entries of \(\textbf{q}\) take longer to compute than others. To solve this load imbalance, the program dynamically distributes the vector \(\textbf{q}\) and matrix \(\textbf{Q}\) over the CPU cores. This solution was very effective for most networks, improving significantly the performance of the program. Yet, the CPU may still be underutilized at the end of the code if the graph is very unbalanced. This is the case of directed graphs due to the symmetrization of the adjacent matrix as shown in (33).

Another limiting factor is the latency and bandwidth of the main memory. Most operations with large and sparse matrices are well-known to be memory-bound as they cannot utilize the cache hierarchy effectively while requiring additional logic and memory accesses for handling the sparse storage format. In fact, the kryl_funm algorithm shows no benefits when running in a multithreaded environment since it relies on sparse matrix–vector products and the majority of the code is written in MATLAB. In contrast, our randomized algorithm only needs to compute two sparse matrix products: one at the beginning and another at the end of the algorithm. This still affects the scalability of the method when working with massive networks, such as twitter and uk-2005. Even under these conditions, the program was able to obtain significant speedups when using 64 cores, achieving \(60\%\) efficiency for twitter and \(70\%\) for uk-2005.

4.5 Katz Centrality

One of the most well-known centrality measures based on matrix functions is Katz’s Centrality (KC) [55, 60]. It is defined as \((\textbf{I} - \gamma \textbf{A}) \textbf{x} = \textbf{1}\) with \(KC(i) = x_i\) as the centrality score of the node i. Here, the \(\gamma \) is called attenuation factor and should be between 0 and \(\rho (\textbf{A})^{-1}\), where \(\rho (\textbf{A})\) is the spectral radius of \(\textbf{A}\) [60]. Different information can be extracted from the network by changing the value of \(\gamma \) [16]. For instance, if \(\gamma \) tends to \(\rho (\textbf{A})^{-1}\), Katz’s Centrality approximates the eigenvalue centrality [22, 23]. If \(\gamma \) tends to 0, then it converges to the degree centrality.

There are several ways to solve the linear system \((\textbf{I} - \gamma \textbf{A}) \textbf{x} = \textbf{1}\). Direct solvers, such as MUMPS [7, 8] or Intel Pardiso [74], first compute the \(\textbf{LU}\) factorization or similar and then solve the linear system using backward/forward substitution. However, factorization is a very costly procedure, scaling with \(O(n^3)\), while also requiring additional space to store matrices \(\textbf{L}\) and \(\textbf{U}\). On the other hand, sparse iterative solvers, such as Conjugate Gradient, GMRES, and BiCGSTAB, can converge very quickly to the solution especially provided a good preconditioning is available. Last, but not least, Monte Carlo methods solve the linear system as the truncated series

$$\begin{aligned} (\textbf{I} - \gamma \textbf{A})^{-1}\textbf{v} = \sum _{k = 0}^m{(\gamma \textbf{A})^k\textbf{v}} \end{aligned}$$

with \(\textbf{v} = \textbf{1}\). However, this series only converges when \(\rho (\gamma \textbf{A}) < 1\). Moreover, if \(\rho (\gamma \textbf{A})\) is near 1, the convergence rate will be very slow, and thus, requires computing many terms of the expansion to reach a reasonable accuracy.

Fig. 10
figure 10

Comparison among the different algorithms when computing the katz-centrality for \(\gamma = 0.85 \max _i{\sum _k |a_{ik}|}\) and \(N_s = 10^9\). The execution time is measured using all 64 cores

Figure 10 compares RandFunmAction against the original Monte Carlo method (MC) and a simple Conjugate Gradient algorithm (CG). Here, the error \(\varepsilon \) of the CG is equal to residual norm \(\Vert \textbf{1} - \textbf{A}\hat{\textbf{x}}\Vert _2\), while for RandFunmAction and MC, it corresponds to \(\ell _\infty \) error using the results from the CG as reference. Note that we avoid the costly computation of the eigenvalue by leveraging the Gershgorin’s Theorem [45], i.e., \(\rho (\gamma \textbf{A}) \approx \max _i{\sum _k |\gamma a_{ik}|}\). Again, RandFunmAction is faster and more accurate than MC for the same number of random walks \(N_s\), yet it still is not as good as the sparse iterative solver. Even without preconditioning, CG can converge extremely quickly to the solution due to the small value of \(\gamma \). It is possible to enhance the performance of our methods by combining it with a Richardson iteration as shown in [18]. This is left for future work.

We want to emphasize that Monte Carlo methods are better suited to evaluate other matrix functions than the matrix inverse, whereas this is either too expensive or is not even possible with classical methods.

5 Conclusion

This paper proposes a novel stochastic algorithm that randomly samples rows and columns of the matrix for approximating different powers of the power series expansion. It can evaluate any matrix function by using the corresponding coefficients of the series. The algorithm can be conveniently modified to compute either \(f(\textbf{A}) \textbf{v}\) or the diagonal of \(f(\textbf{A})\) without the need to compute the entire matrix function. As a way to test the applicability of our method, we compute the subgraph centrality and total communicability of several large networks using the matrix exponential. Within this context, the stochastic algorithm has proven to be particularly effective, outperforming the competition. Our method also is highly scalable in a multithreaded environment, showing remarkable efficiency when using up to 64 cores.

In this paper, we primarily focus on the analysis of complex networks as it provided a very close relation with the method itself, but the algorithm can be applied to any scientific problem that can be expressed in terms of matrix functions, providing a quick way to estimate the solution of the problem with reasonable accuracy.