1 Introduction

The Laplacian paradigm has emerged as one of the cornerstones of modern algorithmic graph theory. Integrating techniques from combinatorial optimization with powerful machinery from numerical linear algebra, it was originally pioneered by Spielman and Teng [1] who established the first nearly-linear time solvers for a (linear) Laplacian system. Thereafter, there has been a considerable amount of interest in providing simpler and more efficient solvers [2,3,4]. Indeed, this framework has led to some state of the art algorithms for a wide range of fundamental graph-theoretic problems; e.g., see [5,6,7,8,9,10], and references therein. In the distributed setting, a major breakthrough was recently made by Forster et al. [11]. In particular, the authors developed a distributed algorithm that solves any Laplacian system on an n-node graph after \(n^{o(1)} (\sqrt{n} + D) \log (1/\varepsilon )\) rounds of the standard \({{\,\textrm{CONGEST}\,}}\) model, where D represents the hop-diameter of the underlying network and \(\varepsilon > 0\) is the error of the solver. Moreover, they showed that their algorithm is existentially optimal, up to the \(n^{o(1)}\) factor, establishing a lower bound of \(\widetilde{\Omega }(\sqrt{n} +D)\) rounds via a reduction from the \(s-t\) connectivity problem [12].

This existential lower bound in the \({{\,\textrm{CONGEST}\,}}\) model of distributed computing should hardly come as any surprise. Indeed, it is well-known by now that a remarkably wide range of global optimization problems, including minimum spanning tree (MST), minimum cut (Min-Cut), maximum flow, and single-source shortest paths (SSSP), require \(\widetilde{\Omega }(\sqrt{n} + D)\) roundsFootnote 1 [12,13,14]. The same limitation generally applies to any non-trivial approximation and even under randomization. Nonetheless, these lower bounds are constructed on some pathological graph instances that arguably do not occur in practice. This begs the question: Can we obtain more refined performance guarantees based on the underlying topology of the communication network? The framework of low-congestion shortcuts, introduced by Ghaffari and Haeupler [15], demonstrated that bypassing the notorious \(\Omega (\sqrt{n})\) lower bound is possible: MST and Min-Cut on planar graphs can be solved in \(\widetilde{O}(D)\) rounds. This is crucial, given that in many graphs of practical significance the diameter is remarkably small; e.g., \(D = {{\,\textrm{polylog}\,}}(n)\) (as is folklore, this holds for most social networks), implying exponential improvements over generic algorithms used for general graphs. In the context of the distributed Laplacian paradigm, we raise the following question:

Is there a faster distributed Laplacian solver under “non-worst-case” families of graphs in the \({{\,\textrm{CONGEST}\,}}\) model?

The only known technique in distributed computing for designing algorithms that go below the \(\sqrt{n}\)-bound is the low-congestion shortcut framework of Ghaffari and Haeupler [15], and the large ecosystem of tools built around it [16,17,18,19,20,21,22]. However, the “\(\rho \)-congested minor” primitive introduced and extensively used in the novel distributed Laplacian solver [11] is out of reach from the current set of tools available in the low-congestion shortcut framework. We address this issue by introducing an analogous primitive called \(\rho \)-congested part-wise aggregation, which greatly simplifies the interface used by Forster et al. [11]. We then extend the low-congestion shortcut framework with new techniques that enables it to near-optimally solve this primitive: we provide both an algorithm that utilizes the very recent hop-constrained expander decompositions for shortcut construction [22] to solve the primitive in general graphs with a linear dependence on \(\rho \), as well as a very simple algorithm with a quadratic \(\rho \)-dependence for bounded-treewidth graphs. Finally, we settle our original question in the positive by establishing that our new primitive can be readily used to accelerate the distributed Laplacian solver for non-worst-case topologies.

Specifically, we show that our new techniques are sufficient to lift the existentially optimal algorithm [11] to a universally optimal algorithm—modulo \(n^{o(1)}\) factor inherent in the prior approach—for distributedly solving a Laplacian system, meaning that, for any topology, our algorithm is essentially as fast as possible. In other words, for any graph, our algorithm almost matches the best possible (correct) algorithm for that graph. This result is unconditional in essentially all settings of interest (see Theorem 2 for details), but relies on conjectured improvements of current state-of-the-art constructions of low-congestion shortcuts to achieve unqualified universal optimality [18]—like all other results in the area.

Furthermore, another concrete way of bypassing the \(\widetilde{\Omega }(\sqrt{n} + D)\) lower bound, besides investigating non-worst-case families of graphs, is by enhancing the local communication network with a limited amount of global power. Indeed, research concerning hybrid networks was recently initiated in the realm of distributed algorithms [23], although networks combining different communication modes have already found numerous applications in real-life computing systems; as such, hybrid networks have been intensely studied in other areas of distributed computing (see [24,25,26], and references therein). In this paper, we will enhance the standard \({{\,\textrm{CONGEST}\,}}\) model with the recently introduced node-capacitated clique (henceforth \({{\,\mathrm{\textsc {NCC}}\,}}\)) [27]. The latter model enables all-to-all communication, but with severe capacity restrictions for every node. The integration of these models will be referred to as the \({{\,\textrm{HYBRID}\,}}\) model for the rest of this work. This leads to the following central question:

Is there a faster distributed Laplacian solver in the \({{\,\textrm{HYBRID}\,}}\) model?

Our paper essentially settles this question by showing the same \(\rho \)-congested part-wise aggregation primitive can be efficiently solved in \(\widetilde{O}(\rho )\) rounds of \({{\,\mathrm{\textsc {NCC}}\,}}\), implying an almost optimal \(n^{o(1)}\)-round distributed algorithm for solving Laplacian systems in the \({{\,\textrm{HYBRID}\,}}\) model. A conceptual contribution of our approach is that we treat both \({{\,\textrm{CONGEST}\,}}\), Supported-\({{\,\textrm{CONGEST}\,}}\), and \({{\,\textrm{HYBRID}\,}}\) in a unified way through the lens of the low-congestion shortcut framework, by designing our algorithm using high-level primitives and leaving the model-specific translations to the framework itself. A similar unified view of PRAM (i.e., parallel) and \({{\,\textrm{CONGEST}\,}}\) (i.e., distributed) graph algorithms through the same lens has led to very recent breakthroughs on long-standing open problems for both of these settings [28].

1.1 Overview of our contributions and techniques

The unifying thread and the main technical ingredient of our (almost) universally optimal distributed Laplacian solvers is a new fundamental communication primitive referred to as the congested part-wise aggregation problem. Specifically, we develop near-optimal algorithms for solving this problem in the (Supported-)CONGEST and the NCC model (Sect. 3), and then we utilize this primitive to develop almost universally optimal Laplacian solvers.

1.1.1 The congested part-wise aggregation problem

To introduce the congested part-wise aggregation problem, let us first give some basic background. The aforementioned Ghaffari-Haeupler framework of low-congestion shortcuts revolves around the so-called part-wise aggregation problem posed as follows: “The graph is partitioned into disjoint and individually-connected parts, and we need to compute some simple aggregate function for each part, e.g., the minimum of the values held by the nodes in a given part” [15] (see Definition 1 for a formal definition). Importantly, it has been shown that this primitive can be solved efficiently in structured topologies and that many problems (including the MST, shortest path, min-cut, etc.) reduce to a small number of calls to a part-wise aggregation oracle, leading to universally optimal algorithms. Unfortunately, it is not clear how to reduce solving a Laplacian system to (a small number of) part-wise aggregation calls; in this paper, we primarily address this issue.

Our first technical contribution is to extend the framework of low-congestion shortcuts by studying a more general primitive: one that incorporates congestion (of the input parts) into the underlying part-wise aggregation instance. More precisely, unlike the standard part-wise aggregation problem, we allow each node to participate in up to \(\rho \in {\mathbb {Z}}_{\ge 1}\) aggregation parts (see Definition 6). We later show that efficient solutions to this primitive leads to efficient distributed Laplacian solvers.

We first remark that a natural strategy for solving congested part-wise aggregation instances does not work: congested instances cannot, in general, be directly reduced to a “small” collection of 1-congested instances, thereby necessitating a more refined approach. To this end, our approach is based on “lifting” the underlying communication network \(\overline{G}\) into its \(\rho \)-layered version \(\widehat{G}_{O(\rho )}\): every edge is replaced with a matching and every node with a \(\rho \)-clique. The importance of this transformation is that, as we show in Lemma 8, the \(\rho \)-congested part-wise aggregation problem can be reduced to a 1-congested instance on the \(\rho \)-layered graph (Sect. 3.1.1). This is first established under the assumption that individual parts correspond to simple paths, and then we extend our results to general parts by following the approach of Haeupler et al. [18]. In light of this reduction, we next focus on solving the 1-congested part-wise aggregation instance on the layered graph.

As a warm-up, we treat graphs with bounded treewidth \({{\,\textrm{tw}\,}}(G)\) (Definition 5). It is known that on a graph G with treewidth \({{\,\textrm{tw}\,}}(G)\), a 1-congested part-wise aggregation instance can be solved in \(\widetilde{O}({{\,\textrm{tw}\,}}(G) D)\) rounds of CONGEST [17]. Keeping this in mind, we first show that the treewidth of the \(\rho \)-layered graph \(\widehat{G}_{\rho }\) can only increase by a factor of \(\rho \) compared to the original graph (Lemma 13). Hence, we can solve 1-congested instances in \(\widehat{G}_{O(\rho )}\) in \(\widetilde{O}(\rho {{\,\textrm{tw}\,}}(\overline{G}) D)\) rounds (when the underlying network is \(\widehat{G}_{O(\rho )}\)), which in turn allows us to solve \(\rho \)-congested instances on \(\overline{G}\) in \(\widetilde{O}(\rho ^2 {{\,\textrm{tw}\,}}(G) D)\) time in G (another \(\rho \) factor is necessary to simulate \(\widehat{G}_{O(\rho )}\) in \(\overline{G}\)). This positive result poses a natural question: can we achieve similar results on graphs with bounded minor density \(\delta (G)\) (Definition 4)? However, the answer to this question is negative: minor density can blow up even for a 2-layered planar graph (see Observation 2), making such a result impossible.

Then, we look at arbitrary graphs G: it is known that 1-congested part-wise aggregation instances can be solved in a number of rounds that is controlled by \({{\,\textrm{SQ}\,}}(G)\), where \({{\,\textrm{SQ}\,}}(G)\) is the shortcut quality of G (a certain graph parameter we formalize in Definition 3). Specifically, it can be solved in \(\widetilde{O}({{\,\textrm{SQ}\,}}(G))\) rounds when the topology is known in advanceFootnote 2 [18] and \({{\,\textrm{poly}\,}}({{\,\textrm{SQ}\,}}(G)) \cdot n^{o(1)}\) in general CONGEST [22]. The shortcut quality parameter is significant since many distributed problems (including the MST, shortest path, Min-Cut, and—as we show later—Laplacian solving) require \(\widetilde{\Omega }({{\,\textrm{SQ}\,}}(G))\) rounds in CONGEST to be solved on G [18]. Therefore, algorithms that have an upper bound close to \({{\,\textrm{SQ}\,}}(G)\) are universally optimal.

With the end goal of solving the 1-congested part-wise aggregations on layered graphs \(\widehat{G}_\rho \) in time controlled by \({{\,\textrm{SQ}\,}}(G)\), our main result establishes that the shortcut quality of the \(\rho \)-layered graph does not increase (modulo polylogarithmic factors) as compared to the original graph (Theorem 15). This has a plethora of important consequences: (1) when \({{\,\textrm{SQ}\,}}(G) \le n^{o(1)}\), we can unconditionally solve \(\rho \)-congested part-wise aggregation instances in \(\rho \cdot n^{o(1)}\) CONGEST rounds using the state-of-the-art shortcut construction [22], and (2) when the topology of G is known (i.e., Supported-CONGEST), there exists a distributed algorithm that solves any \(\rho \)-congested part-wise aggregation problem in \(\rho \cdot \widetilde{O}({{\,\textrm{SQ}\,}}(G))\) rounds [18]. As a consequence of our general result, the shortcut quality of any 2-layered planar graph is \(\widetilde{O}(D)\) since the shortcut quality of a planar graph is \(\widetilde{O}(D)\) [15]. This is perhaps the most natural example of a graph whose minor density is very far from the shortcut quality; the only other example documented in the literature so far is that of expander graphs.

Our proof proceeds by employing alternative characterizations of the shortcut quality in terms of certain communication tasks. Specifically, shortcut quality can be shown to be equal (modulo polylogarithmic factors) to the following two-player max-min game: the first (max) player chooses k sources and k sinks in the graph such that we can find k node-disjoint paths matching the sources with the sinks; then the second (min) player finds the smallest so-called quality Q such that there exist k paths matching the sources with the sinks with the path lengths being at most Q and each edge of the underlying graph supporting at most Q of second player’s paths. This characterization allows us to compare the shortcut quality of \(\widehat{G}_{\rho }\) with \(\overline{G}\) as follows: take the worst-case (first player’s) set of sources and sinks in \(\widehat{G}_\rho \). Project them to \(\overline{G}\) and note they have node congestion \(\rho \) (due to the construction of \(\widehat{G}_{\rho }\)). Then, we show we can decompose (i.e., partition) these set of sources and sinks into \(\widetilde{O}(\rho )\) pairs of sub-sources and sub-sinks that are node-disjointly connectable in G. However, each such set enjoys paths of quality \({{\,\textrm{SQ}\,}}(G)\), hence embedding each such pair in a separate layer of \(\widehat{G}_{\rho }\) shows that the shortcut quality of \({{\,\textrm{SQ}\,}}(\widehat{G}_\rho )\) is at most \(\widetilde{O}({{\,\textrm{SQ}\,}}(\overline{G}))\). Although this general approach improves over our result for treewidth-bounded graphs we previously described, our approach for the latter class of graphs is substantially simpler and more suited in practice.

1.1.2 Almost universally optimal Laplacian solvers

First, we note that any distributed Laplacian solver that always correctly outputs an answer on a fixed graph G must take at least \(\tilde{\Omega }({{\,\textrm{SQ}\,}}(G))\) rounds, giving us a lower bound to compare ourselves with. Our refined lower bound uses the hardness result recently shown by Haeupler et al. [18] for the spanning connected subgraph problem, applicable for any (i.e., non-worst-case) graph G. Specifically, we show that a Laplacian solver can be leveraged to solve the spanning connected subgraph problem, thereby substantially strengthening the lower bound due to Forster et al. [11].

Theorem 1

Consider a graph \(\overline{G}\) with shortcut quality \({{\,\textrm{SQ}\,}}(\overline{G})\). Then, solving a Laplacian system on \(\overline{G}\) with \(\varepsilon \le \frac{1}{2}\) requires \(\widetilde{\Omega }({{\,\textrm{SQ}\,}}(\overline{G}))\) rounds in both \({{\,\textrm{CONGEST}\,}}\) and Supported-\({{\,\textrm{CONGEST}\,}}\) models.

On the upper-bound side, we utilize the congested part-wise aggregation primitive to improve and refine the Laplacian solver of Forster et al. [11], leading to a substantial improvement in the round complexity under structured network topologies.

Theorem 2

Consider any n-node graph G with shortcut quality \({{\,\textrm{SQ}\,}}(G)\) and hop-diameter D. There exists a distributed Laplacian solver with error \(\varepsilon > 0\) with the following guarantees:

  • In the Supported-\({{\,\textrm{CONGEST}\,}}\) model, it requires \(n^{o(1)} {{\,\textrm{SQ}\,}}(G) \log (1/\varepsilon )\) rounds.

  • In the \({{\,\textrm{CONGEST}\,}}\) model, it requires \(n^{o(1)} {{\,\textrm{poly}\,}}({{\,\textrm{SQ}\,}}(G)) \log (1/\varepsilon )\) rounds.

  • In the \({{\,\textrm{CONGEST}\,}}\) model on graphs with minor density \(\delta \), it requires \(n^{o(1)} \delta D \log (1/\varepsilon )\) rounds.

We note that the above algorithm is almost (up to inherent \(n^{o(1)}\) factors) universally optimality for most settings of interest. Since it is (almost) matching the \({{\,\textrm{SQ}\,}}(G)\)-lower-bound, it is unconditionally universally optimal when the topology is known in advance (i.e., Supported-CONGEST). Furthermore, in standard CONGEST, we give almost universally optimal \(D n^{o(1)} \log (1/\varepsilon )\)-round algorithms for topologies that include planar graphs, \(n^{o(1)}\)-genus graphs, \(n^{o(1)}\)-treewidth graphs, excluded-minor graphs, since all of them are graphs with minor density \(\delta (G) = n^{o(1)}\). Furthermore, for the realistic case of \(D \le n^{o(1)}\), it holds for most networks of interest that \({{\,\textrm{SQ}\,}}(G) \le n^{o(1)}\) (e.g., expanders, hop-constrained expanders, as well as all classes mentioned earlier), for which we get \(n^{o(1)} \log (1/\varepsilon )\)-round solvers. We stress that the conjectured improvements of the state-of-the-art of almost-optimal low-congestion shortcut constructions would immediately lift our results to be unconditionally universally optimal in CONGEST; this issue is orthogonal and not within the scope of this paper. It is also worth pointing out that both our algorithms for solving congested part-wise aggregations and the building blocks of the Laplacian solver of Forster et al. [11] are in general randomized, and so all of our guarantees apply with high probability.

Furthermore, in \({{\,\textrm{HYBRID}\,}}\) we obtain an almost optimal complexity in general graphs:

Theorem 3

Consider any n-node graph. There exists a distributed Laplacian solver in the \({{\,\textrm{HYBRID}\,}}\) model with round complexity \(n^{o(1)} \log (1/\varepsilon )\), where \(\varepsilon > 0\) is the error of the solver.

This implies a remarkably fast subroutine for solving a Laplacian system in \({{\,\textrm{HYBRID}\,}}\) under arbitrary topologies. As a result, we corroborate the observation that a very limited amount of global power can lead to substantially faster algorithms for certain optimization problems, supplementing a recent line of work [23, 29,30,31,32,33,34,35]. Furthermore, our framework based on the congested part-wise aggregation problem allows for a unifying treatment of both (Supported-)\({{\,\textrm{CONGEST}\,}}\) and \({{\,\textrm{HYBRID}\,}}\), and we consider this to be an important conceptual contribution of our work. Indeed, as we previously explained, both of our accelerated Laplacian solvers rely on faster algorithms for solving the congested part-wise aggregation problem. In particular, for (Supported-)\({{\,\textrm{CONGEST}\,}}\) we have already described our approach in detail, while in the \({{\,\textrm{HYBRID}\,}}\) model we employ certain communication primitives developed in [27] for dealing with congestion in part-wise aggregations. A byproduct of our results is that the framework of low-congestion shortcuts interacts particularly well with the \({{\,\textrm{HYBRID}\,}}\) model, as was also observed in prior work [36].

1.2 Further related work

Our main reference point is the recent Laplacian solver of Forster et al. [11] with existentially almost-optimal complexity of \(n^{o(1)}(\sqrt{n} + D) \log (1/\varepsilon )\) rounds, where \(\varepsilon > 0\) represents the error of the solver. Specifically, they devised several new ideas and techniques to circumvent certain issues which mostly relate to the bandwidth restrictions of the \({{\,\textrm{CONGEST}\,}}\) model; these building blocks, as well as the resulting Laplacian solver are revisited in our work to refine the performance of the solver. We are not aware of any previous research addressing this problem in the distributed context. On the other hand, the Laplacian paradigm has attracted a considerable amount of interest in the community of parallel algorithms [37, 38].

Research concerning hybrid communication networks in distributed algorithms was recently initiated by Augustine et al. [23]. Specifically, they investigated the power of a model which integrates the standard \({{\,\textrm{LOCAL}\,}}\) model [39] with the recently introduced node-capacitated clique (\({{\,\mathrm{\textsc {NCC}}\,}}\)) [27], focusing mostly on distance computation tasks. Several of their results were subsequently improved and strengthened in subsequent works [30, 32] under the same model of computation. In our work we consider a substantially weaker model, imposing a severe limitation on the communication over the “local edges”. This particular variant has been already studied in some recent works for a variety of fundamental problems [31, 33].

The \({{\,\mathrm{\textsc {NCC}}\,}}\) model, which captures the global network in all hybrid models studied thus far, was introduced by Augustine et al. [27] partly to address the unrealistic power of the congested clique (\({{\,\textrm{CLIQUE}\,}}\)) [40]. In the latter model each node can communicate concurrently and independently with all other nodes by \(O(\log n)\)-bit messages. In contrast, the \({{\,\mathrm{\textsc {NCC}}\,}}\) model allows communication with \(O(\log n)\) (arbitrary) nodes per round. As a result, in the \({{\,\textrm{HYBRID}\,}}\) model and under a sparse local network, only \(\widetilde{\Theta }(n)\) bits can be exchanged overall per round, whereas \({{\,\textrm{CLIQUE}\,}}\) allows for the exchange of up to \(\widetilde{\Theta }(n^2)\) (distinct) bits. As evidence for the power of \({{\,\textrm{CLIQUE}\,}}\) we note that even slightly super-constant lower bounds would give new lower bounds in circuit complexity, as implied by a simulation argument due to Drucker et al. [41]. Finally, we remark a subsequent work that leverages tools from the Laplacian paradigm in the broadcast variant of the congested clique [42].

2 Preliminaries

General notation. We denote with \([k]:= \{ 1, 2, \ldots , k \}\). Graphs throughout this paper are undirected. The nodes and the edges of a given graph G are denoted as V(G) and E(G), respectively. We also use \(n:= |V(G) |\) for brevity. The graphs are often weighted, in which case we assume (as is standard) that for all \(e \in E(G), \varvec{w}(e) \in \{1, 2, \dots , {{\,\textrm{poly}\,}}(n)\}\). We will denote the hop-diameter of a graph G with D(G) (the hop-diameter ignores weights). Moreover, we use \(A \uplus B\) to denote the multiset union, i.e., each element is repeated according to its multiplicity; this operation corresponds to disjoint unions when \(A \cap B = \emptyset \).

Communication models. The communication network consists of a set of \(\overline{n}\) entities with \([\overline{n}]:= \{1, 2, \dots , \overline{n}\}\) being the set of their IDs, and a local communication topology given by a graph \(\overline{G}\).Footnote 3 We define \(D:= D(\overline{G})\) to be the (hop-)diameter of the underlying network. At the beginning, each node knows its own unique \(O(\log \overline{n})\)-bit identifier as well as the weights of the incident edges. Communication occurs in synchronous rounds, and in every round nodes have unlimited computational power to process the information they possess. We will consider models with both local and global communication modes.

The local communication mode will be modeled with the CONGEST model [43] and Supported-CONGEST model [44], for which in each round every node can exchange an \(O(\log \overline{n})\)-bit message with each of its neighbors in \(\overline{G}\) via the local edges. In the (standard) \({{\,\textrm{CONGEST}\,}}\) model, each node \(v \in V(\overline{G})\) initially only knows the identifiers of each node in v’s own neighborhood, but has no further knowledge about the topology of the graph. On the other hand, in the Supported-CONGEST model, all nodes know the entire topology of \(\overline{G}\) upfront, but not the input.

The global communication mode will be modeled using NCC [27], for which in each round every node can exchange \(O(\log \overline{n})\)-bit messages with \(O(\log \overline{n})\) arbitrary nodes via global edges. If the capacity of some channel is exceeded, i.e., too many messages are sent to the same node, it will only receive an arbitrary (potentially adversarially selected) subset of the information based on the capacity of the network; the rest of the messages are dropped. In this context, we will let \({{\,\textrm{HYBRID}\,}}\) be the integration of \({{\,\textrm{CONGEST}\,}}\) and \({{\,\mathrm{\textsc {NCC}}\,}}\) (i.e., nodes have both a local and a global communication mode at their disposal).

At this point, it is worth pointing out that different models are suitable for different applications. The standard CONGEST model is appropriate in local communication networks under severely congested edges, in contrast to LOCAL, another popular model which captures local networks with edges of essentially unlimited capacity. On the other hand, NCC was introduced to model networks with global capabilities, but under severe restrictions on the amount of communication possible in each round; NCC has been put forward as a more realistic counterpart to \({{\,\textrm{CLIQUE}\,}}\), which we described earlier. Correspondingly, \({{\,\textrm{HYBRID}\,}}\) is more appropriate to model networks with multiple communication modes. For an additional motivation for each of the above models, we refer to the papers that introduced them, and references therein.

The performance of a distributed algorithm will be measured in terms of its round complexity—the number of rounds required so that every node knows its part of the output. For randomized algorithms it will suffice to reach the desired state with high probability.Footnote 4 We will assume throughout this work that nodes have access to a common source of randomness; this comes without any essential loss of generality in our setting [45]. When talking about a distributed algorithm for a specific problem (e.g., Laplacian solving, part-wise aggregation, etc.) we assume the input is appropriately distributedly stored (i.e., each node will know its own part) and, upon termination, it will be required that the output is appropriately distributedly stored. The appropriate way to distributedly store the input and output will be explained in the problem definition.

Low-congestion shortcuts. A recurring scenario in distributed algorithms for global problems (e.g. MST) boils down to solving the following part-wise aggregation problem:

Definition 1

(Part-Wise Aggregation Problem) Consider an n-node graph G whose node set V(G) is partitioned into k (disjoint) parts \(P_1 \uplus \dots \uplus P_k \subseteq V(G)\) such that each induced subgraph \(G[P_i]\) is connected. In the part-wise aggregation problem, each node \(v \in V\) is given its part-ID (if any) and an \(O(\log n)\)-bit value \(\varvec{x}(v)\) as input. The goal is that, for every part \(P_i\), all nodes in \(P_i\) learn the part-wise aggregate \(\bigoplus _{w \in P_i} \varvec{x}(w)\), where \(\bigoplus \) is an arbitrary pre-defined aggregation function.

Throughout this paper, we will assume that the aggregation function \(\bigoplus \) is commutative and associative (e.g. min, sum, logical-AND), although this is not strictly needed (e.g., see [21]). To solve such problems, Ghaffari and Haeupler [15] introduced a natural combinatorial graph structure that they refer to as low-congestion shortcuts.

Definition 2

(Low-Congestion Shortcuts) Consider a graph G whose node set V(G) is partitioned into k (disjoint) parts \(P_1 \uplus \dots \uplus P_k \subseteq V(G)\) such that each induced subgraph \(G[P_i]\) is connected. A collection of subgraphs \(H_1, \dots , H_k\) is a shortcut of G with congestion c and dilation d if the following properties hold: (i) the (hop) diameter of each subgraph \(G[P_i] \cup H_i\) is at most d, and (ii) every edge is included in at most c many of the subgraphs \(H_i\). The quantity \(Q = c + d\) will be referred to as the quality of the shortcut.

Importantly, a shortcut of quality Q allows us to solve the part-wise aggregation problem in \(\widetilde{O}(Q)\) rounds of \({{\,\textrm{CONGEST}\,}}\), as formalized below.

Proposition 4

Suppose that \(P_1, \ldots , P_k\) is any part-wise aggregation instance in a communication network \(\overline{G}\). Given a shortcut of quality Q, we can solve with high probability the part-wise aggregation problem in \(\widetilde{O}(Q)\) \({{\,\textrm{CONGEST}\,}}\) rounds.

Proof

Consider only one part \(P_i\) in isolation over the network \(G[P_i] + H_i\). First, we claim that there exists a simple deterministic algorithm that computes the AND-aggregate (where each node \(v \in P_i\) has a input bit \(\varvec{x}(v)\)) in O(d) rounds, where each edge is used to send at most O(1) messages. Concretely, any node whose input is 0 will forward its input to all neighbors and deactivate itself. Any node which hears about the existence of an input-0 will forward this to all of its neighbors and deactivate itself. After O(d) rounds, either all nodes have heard about the existence of a 0 or they can conclude all inputs are 1.

We continue considering only one part \(P_i\) in isolation. The next step is to elect a leader of \(P_i\) by finding the node with the smallest ID in \(P_i\); then, (1) iterate from the most significant bit of the ID to the least significant bit of the ID; (2) compute the AND-aggregate of the current bit of all the nodes’ IDs; (3) if the AND-aggregate is 0, all nodes whose current bit of the ID is 1 will drop out.

Putting these together we have a way of computing the aggregate of a part \(P_i\) in isolation in \(\widetilde{O}(d)\) rounds with each edge carrying \(\widetilde{O}(1)\) messages: First, we elect a leader of \(P_i\). Then, the leader initiates the computation of a spanning BFS tree of \(\overline{G}[P_i] + H_i\) by broadcasting from itself to all other nodes, and each node forwards the message to all neighbors; the neighbor from which it obtains the message first is the parent in the tree. Moreover, by performing a convergecast over the BFS tree, one can easily compute the aggregate in O(d) rounds for a single part \(P_i\).

Finally, we have to run the algorithms on all the parts \(\{ P_i \}_i\) simultaneously. However, this might incur congestion issues on some edges since algorithms associated with multiple parts want to send a message through the same edge in the same round. By definition of the congestion c, at most c messages need to be passed over any one edge e. Our job is to schedule the algorithms such that, indeed, all of them complete in \(\widetilde{O}(c)\) rounds. To this end, we choose a uniformly random \(\textrm{delay}(i)\) between 0 and \(\widetilde{O}(c)\) for each part \(P_i\). Then, we start the algorithm on \(P_i\) at round \(\textrm{delay}(i)\)—this technique is known as the randomized delay [45]. Randomly delaying all algorithms makes the expected number of messages crossing a given edge in a fixed round \(\Theta (1)\). By a Chernoff bound, this number is bounded by \(\widetilde{O}(1)\) with high probability. Therefore, by simulating each round of the algorithm using \(\widetilde{O}(1)\) rounds of communication (where each round of communication carries at most a single message across an edge), we can schedule the algorithms on all parts simultaneously [45]. In turn, this allows us to complete all of the aggregates in \(\widetilde{O}(d+c) = \widetilde{O}(Q)\) rounds. \(\square \)

We recall that we are operating under the assumption that nodes share a common source of randomness, which is used in the above proof.

Shortcut quality and construction of shortcuts. Shortcut quality, introduced below, is a fundamental graph parameter that has been proven to characterize the complexity of many important problems in distributed computing.

Definition 3

Given a graph \(G = (V, E)\), we define the shortcut quality \({{\,\textrm{SQ}\,}}(G)\) of G as the optimal (smallest) shortcut quality of the worst-case partition of V into disjoint and connected parts \(P_1 \uplus P_2 \uplus \ldots \uplus P_k \subseteq V\).

For fundamental problems such as MST, SSSP, and Min-Cut any correct algorithm requires \(\widetilde{\Omega }({{\,\textrm{SQ}\,}}(\overline{G}))\) rounds on any network \(\overline{G}\), even if we allow randomized solutions and (non-trivial) approximation factors. In fact, this limitation holds even when the network topology \(\overline{G}\) is known to all nodes in advance [18]. We remark that \(\widetilde{\Omega }(D(\overline{G})) \le {{\,\textrm{SQ}\,}}(\overline{G}) \le O(D(\overline{G}) + \sqrt{\overline{n}})\), and the upper bound is known to be tight in certain (pathological) worst-case graph instances [15].

Moreover, assuming fast distributed algorithms for constructing shortcuts of quality competitive with \({{\,\textrm{SQ}\,}}(\overline{G})\), all of the aforementioned problems can be solved in \(\widetilde{O}( {{\,\textrm{SQ}\,}}(\overline{G}))\) rounds [15, 20, 21]. However, the key issue here is the algorithmic construction of the shortcuts upon which the above papers rely. While there has been a lot of recent progress in this regard, current algorithms are quite complicated and have sub-optimal guarantees. We recall below these state-of-the-art \({{\,\textrm{SQ}\,}}(\overline{G})\)-competitive construction results.

Theorem 5

There exists a distributed algorithm that, given any part-wise aggregation instance on any \(\overline{n}\)-node graph \(\overline{G}\), computes with high probability a shortcut with the following guarantees:

  • In \({{\,\textrm{CONGEST}\,}}\), the shortcut has quality \({{\,\textrm{poly}\,}}\left( {{\,\textrm{SQ}\,}}(\overline{G})\right) \cdot \overline{n}^{o(1)}\) and the algorithm terminates in \({{\,\textrm{poly}\,}}\left( {{\,\textrm{SQ}\,}}(\overline{G})\right) \cdot \overline{n}^{o(1)}\) rounds [22].

  • In Supported-CONGEST, the shortcut has quality \(\widetilde{O}({{\,\textrm{SQ}\,}}(\overline{G}))\) and the algorithm terminates in \(\widetilde{O}({{\,\textrm{SQ}\,}}(\overline{G}))\) rounds [18].

Universal optimality. A distributed algorithm is said to be \(\alpha \)-universally optimal if, on every network graph \(\overline{G}\), it is \(\alpha \)-competitive with the fastest correct algorithm on \(\overline{G}\) [18]. Even the existence of such algorithms is not at all clear as it would seem possible that vastly different algorithms are required to leverage the structure of different networks. Nevertheless, a remarkable consequence of Theorem 5 is that in Supported-CONGEST we can design \(\widetilde{O}(1)\)-universally optimal algorithms for many fundamental optimization problems. Moreover, efficient shortcut construction is the only obstacle towards achieving these results in the full generality of \({{\,\textrm{CONGEST}\,}}\), which is an orthogonal issue and out of scope for this paper. Still, the aforementioned results are sufficient to design \(\overline{n}^{o(1)}\)-universally optimal algorithms on graphs that have shortcut quality \({{\,\textrm{SQ}\,}}(\overline{G}) = \overline{n}^{o(1)}\).

Graphs excluding dense minors. It turns out that the crucial issue of efficient shortcut construction can be resolved with a near-optimal, simple, and even deterministic algorithm for the rich class of graphs with bounded minor density. Formally, let us first recall the following definition.

Definition 4

(Minor Density) The minor density \(\delta (G)\) of a graph G is defined as

$$\begin{aligned} \delta (G) = \max \left\{ \frac{ |E' |}{|V'|} : H = (V', E') \text { is a minor of}\ G \right\} . \end{aligned}$$

Any family of graphs closed under taking minors (such as planar graphs) has a constant minor density. For such graphs, Ghaffari and Haeupler [19] established an efficient shortcut construction:

Theorem 6

([19]) Any graph G with hop-diameter D and minor density \(\delta (G)\) admits shortcuts of quality \(\widetilde{O}(\delta D)\), which can be constructed with high probability in \(\widetilde{O}(\delta D)\) rounds of \({{\,\textrm{CONGEST}\,}}\).

Some of our results apply for communication networks with bounded treewidth, so let us recall the following definition.

Definition 5

(Tree Decomposition and Treewidth) A tree decomposition of a graph G is a tree T with tree-nodes \(X_1, \dots , X_k\), where each \(X_i\) is a subset of V(G) satisfying the following properties:

  1. 1.

    \(V = \bigcup _{i=1}^k X_i\);

  2. 2.

    For any node \(u \in V(G)\), the tree-nodes containing u form a connected subtree of T;

  3. 3.

    For every edge \(\{u, v\} \in E(G)\), there exists a tree-node \(X_i\) which contains both u and v.

The width w of the tree decomposition is defined as \(w:= \max _{i \in [k]} |X_i |- 1\). Moreover, the treewidth \({{\,\textrm{tw}\,}}(G)\) of G is defined as the minimum of the width among all possible tree decompositions of G.

Bounded-treewidth graphs inherit all of the nice properties guaranteed by Theorem 6, as implied by the following well-known fact.

Fact 7

For any graph G, \(\delta (G) \le {{\,\textrm{tw}\,}}(G)\).

The Laplacian matrix. Consider a weighted undirected graph \(G = (V, E, \varvec{w} > 0)\). The Laplacian of the graph G is defined as

$$\begin{aligned} {\mathcal {L}}(G)_{u, v} ={\left\{ \begin{array}{ll} \sum _{\{u, z \} \in E} \varvec{w}(u, z) &{} \text {If}\ u = v, \\ - \varvec{w}(u, v) &{} \textrm{otherwise}. \end{array}\right. } \end{aligned}$$

The Laplacian matrix of a graph is (i) symmetric (\({\mathcal {L}}(G)^T = {\mathcal {L}}(G)\)); (ii) positive semi-definite (\(\varvec{x}^T {\mathcal {L}}(G) \varvec{x} \ge 0\) for any \(\varvec{x}\)); and (iii) weakly diagonally dominant (\({\mathcal {L}}(G)_{u, u} \ge \sum _{v \ne u} |{\mathcal {L}}(G)_{u, v} |\)).

Further notation. Consider two positive semi-definite matrices \({\textbf{A}}, {\textbf{B}} \in {\mathbb {R}}^{n \times n}\). For a vector \(\varvec{x} \in {\mathbb {R}}^n\) we define \(\Vert \varvec{x}\Vert _{{\textbf{A}}}:= \sqrt{\varvec{x}^T {\textbf{A}} \varvec{x}}\) (Mahalanobis norm). We will write \({\textbf{A}} \approx _{\varepsilon }{\textbf{B}}\) if \(\exp (-\varepsilon ) {\textbf{A}} \preceq {\textbf{B}} \preceq \exp (\varepsilon ) {\textbf{A}}\), where \({\textbf{A}} \preceq {\textbf{B}}\) if and only if the matrix \({\textbf{B}} - {\textbf{A}}\) is positive semi-definite. For an edge \(e = \{u, v\}\), we will let \(\varvec{b}(e):= \mathbb {1}_{u} - \mathbb {1}_{v}\), where \(\mathbb {1}_{u} \in {\mathbb {R}}^n\) represents the characteristic vector of node u. For a graph G with resistances \(\varvec{r}(e)\), we define the leverage scores as \({{\,\textrm{lev}\,}}_G(e):= \varvec{r}(e)^{-1} \varvec{b}^T(e) {\mathcal {L}}(G)^{\dagger } \varvec{b}(e)\). Note that \(0 \le {{\,\textrm{lev}\,}}_G(e) \le 1\).

Fig. 1
figure 1

A 2-congested part-wise aggregation problem on a \(6 \times 6\) grid (the instance immediately extends to an \(\sqrt{\overline{n}} \times \sqrt{\overline{n}}\) topology). Different colors highlight different parts of the instance

3 Congested part-wise aggregations

This section is concerned with a congested generalization of the standard part-wise aggregation problem (Definition 1), formally introduced below.

Definition 6

(Congested Part-Wise Aggregation Problem) Consider an n-node graph G with a collection of k subsets of nodes \(P_1, \ldots , P_k \subseteq V(G)\) called parts such that each induced subgraph \(G[P_i]\) is connected and each node \(v \in V(G)\) is contained in at most \(\rho \in {\mathbb {Z}}_{\ge 1}\) many parts, i.e., \(\forall v \in V(G)\ \ |\{ i: P_i \ni v \} |\le \rho \). In the \(\rho \)-congested part-wise aggregation problem, each node v is given the following as input: for each part \(P_i \ni v\) node v knows the part-ID i and an \(O(\log n)\)-bit part-specific value \(\varvec{x}_i(v)\). The goal is that, for each part \(P_i\), all nodes in \(P_i\) learn the part-wise aggregate \(\bigoplus _{w \in P_i} \varvec{x}_i(w)\), where \(\bigoplus \) is a pre-defined aggregation function.

This congested generalization of the standard part-wise aggregation problem that we study in this section turns out to be a central ingredient in our refined Laplacian solver; this is further explained in Sect. 4. The remainder of this section is organized as follows. In Sect. 3.1 we establish near-optimal algorithms for solving congested part-wise aggregations in \({{\,\textrm{CONGEST}\,}}\), which is also the main focus of this section. We conclude by pointing out the construction for \({{\,\mathrm{\textsc {NCC}}\,}}\) in Sect. 3.2.

3.1 Solving congested instances in the CONGEST model

The first natural strategy for solving the \(\rho \)-congested part-wise aggregation problem of Definition 6 is through a reduction to \({{\,\textrm{poly}\,}}(\rho )\) 1-congested instances. However, this approach immediately fails even if we allow \(\rho = 2\). Indeed, there exist congested part-wise aggregation instances for which every two (distinct) parts share a common node, even when \(\rho = 2\), leading to the following observation.

Observation 1

For an infinite family of values \(\overline{n}\), there exists an \(\overline{n}\)-node planar graph \(\overline{G}\) and a 2-congested part-wise aggregation instance \({\mathcal {I}}\) with \(k = \Theta (\sqrt{\overline{n}})\) parts such that reducing \({\mathcal {I}}\) to the union of \(k'\) 1-congested part-wise aggregation instances on \(\overline{G}\) requires \(k' = \Omega (\sqrt{\overline{n}})\).

Such a pattern is illustrated in Fig. 1. Indeed, in that 2-congested part-wise aggregation instance every two distinct parts share a common node. As a result, directly employing a 1-congested part-wise aggregation oracle is of little use since it would introduce an overhead depending on the number of parts. In light of this, we develop a more refined approach that leverages what we refer to as the layered graph.

3.1.1 The layered graph

Here we introduce the layered graph \(\widehat{G}_{\rho }\), associated with the underlying graph \(\overline{G}\). Then, we reduce any \(\rho \)-congested part-wise aggregation on \(\overline{G}\) to a 1-congested instance on \(\widehat{G}_{O(\rho )}\).

The layered graph. Consider an underlying network \(\overline{G}\) and some \(\rho \in {\mathbb {Z}}_{\ge 1}\), corresponding to the congestion parameter in Definition 6. The layered graph \(\widehat{G}_\rho \) is constructed in the following way. First, we let \(\widehat{G}_\rho \) be a disjoint union of \(\rho \) copies of \(\overline{G}\) (called layers), namely \(\overline{G}_1, \overline{G}_2, \ldots , \overline{G}_{\rho }\). Each node \(v \in V(\overline{G})\) is associated with its copies \(v_1, v_2, \ldots , v_\rho \in V(\widehat{G}_\rho )\). We also add an edge between each two copies that originate from the same node (i.e., we add a clique to \(\widehat{G}_\rho \) on the set of copies associated with the same node \(v \in V(\overline{G})\)); this construction is illustrated in Fig. 2. The layered graph induces a natural projection operation \(\pi : V(\widehat{G}_{\rho }) \rightarrow V(\overline{G})\) which maps a copy \(v_i\) to its original node \(v=\pi (v_i)\). Furthermore, we often talk about simulating \(\widehat{G}_{\rho }\) in \(\overline{G}\), by which we mean that each node v simulates—learns all the inputs and can generate all outputs—for its copies \(v_1, \ldots , v_\rho \). Throughout this paper, we will assume that \(\rho = {{\,\textrm{poly}\,}}(\overline{n})\) so that any \(O(\log n)\)-bit message on \(\widehat{G}_{\rho }\) can be sent within O(1) rounds in \(\overline{G}\); this also keeps the \(\widetilde{O}\)-notation well-defined.

The main goal of this section is to establish that the \(\rho \)-congested part-wise aggregation problem on \(\overline{G}\) can be reduced to a 1-congested instance on \(\widehat{G}_{O(\rho )}\), as formalized below.

Lemma 8

(Unrestricted Congested Part-Wise Aggregation) Let \(\overline{G}\) be an \(\overline{n}\)-node graph and let \({\mathbb {Z}}_{\ge 1} \ni \rho \le {{\,\textrm{poly}\,}}(\overline{n})\). Suppose that any (1-congested) part-wise aggregation on \(\widehat{G}_{O(\rho )}\) can be solved with a \(\tau \)-round CONGEST algorithm on \(\widehat{G}_{O(\rho )}\). Then, there exists an \(\widetilde{O}(\rho \cdot \tau )\)-round CONGEST algorithm on \(\overline{G}\) that solves any \(\rho \)-congested part-wise aggregation instance on \(\overline{G}\).

Fig. 2
figure 2

An example of a transformation from \(\overline{G}\) to the layered graph \(\widehat{G}_{\rho }\) with \(\rho = 3\). We have highlighted with different colors different layers of the graph

Towards establishing this reduction, we first point out that any \({{\,\textrm{CONGEST}\,}}\) algorithm on \(\widehat{G}_{\rho }\) can be simulated with only a \(\rho \) multiplicative overhead in the round complexity.

Lemma 9

(Simulating \(\widehat{G}_{\rho }\) in \(\overline{G}\)) For any \(\overline{G}\) and any \({\mathbb {Z}}_{\ge 1} \ni \rho \le {{\,\textrm{poly}\,}}(\overline{n})\), we can simulate any \(\tau \)-round CONGEST algorithm on \(\widehat{G}_\rho \) with a \((\rho \cdot \tau )\)-round CONGEST algorithm on \(\overline{G}\).

Proof

Let us consider one round of communication in \(\widehat{G}_\rho \). Each node v will simulate (learn all messages coming into) its copies \(v_1, \ldots , v_\rho \in V(\widehat{G}_\rho )\). Therefore, in each round node \(v \in V(\overline{G})\) needs to learn all messages sent to v’s copies \(v_1, \ldots , v_\rho \in V(\widehat{G}_\rho )\) from their neighbors in \(\widehat{G}_{\rho }\). Note that, by definition, v already knows the messages sent between any two copies \(v_i\) and \(v_j\). Hence, in a single round v can learn all messages sent to any fixed \(v_i\). As a result, \(\rho \) rounds of communication in \(\overline{G}\) suffice to simulate a single round in \(\widehat{G}_{\rho }\). \(\square \)

Furthermore, we will use a folklore result showing how to color a (multi)graph of maximum degree \(\Delta \) in \(O(\Delta )\) colors in \(O(\log n)\) rounds of CONGEST.

Fact 10

(Folklore, [46]) Given a (multi)graph G with n nodes and maximum degree \(\Delta \le {{\,\textrm{poly}\,}}(n)\), there exists a randomized CONGEST algorithm that colors the edges of G with \(O(\Delta )\) colors and completes in \(O(\log n)\) rounds, with high probability. The coloring is proper, i.e., two edges that share an endpoint are assigned a different color.

By multigraph here we simply mean that there can be multiple parallel edges between the same pair of nodes, and every such edge can carry an independent message per round. For completeness, we provide the simple proof below.

Proof of Fact 10

A simple edge-coloring algorithm presented by Johansson [46] works by choosing a color uniformly at random from the set \(\{1, \ldots , O(\Delta )\}\) for each edge. Each edge will, with constant probability, choose a color not used by its neighbors. Then, this color stays fixed and the edge drops out. Hence, after \(O(\log n)\) iterations the edges will be properly colored. Implementation-wise, we can assume there is an additional node in the middle of each edge which represents that edge (this only makes the problem harder). Each edge randomly chooses and sends its color to its endpoints which, in turn, inform on whether there is a conflict. Then, the edges send back to its endpoints whether it dropped out. This iteration is then repeated until we reach a proper coloring. \(\square \)

Using this lemma, we first prove a version of our main reduction (Lemma 8), but with the slight twist that we restrict each part of the \(\rho \)-congested part-wise aggregation problem to be a simple path.

Lemma 11

(Path-Restricted Congested Part-Wise Aggregation) Let \(\overline{G}\) be an \(\overline{n}\)-node graph and let \({\mathbb {Z}}_{\ge 1} \ni \rho \le {{\,\textrm{poly}\,}}(\overline{n})\). Suppose that there exists a \(\tau \)-round CONGEST algorithm solving the (1-congested) part-wise aggregation on \(\widehat{G}_{O(\rho )}\). Then, there exists an \(\widetilde{O}(\rho \cdot \tau )\)-round CONGEST algorithm on \(\overline{G}\) that solves any \(\rho \)-congested part-wise aggregation instance on \(\overline{G}\) when each part is restricted to be a simple pathFootnote 5 (nodes are not repeated in simple paths).

Fig. 3
figure 3

The layered graph \(\widehat{G}_{\rho }\) of a \(3 \times 3\) grid with every node having congestion \(\rho = 2\) (left), and a minor of \(\widehat{G}_{\rho }\) induced by the connected components \(\{C_1, C_2, C_3, R_1, R_2, R_3\}\) (right)

Proof

Let \({\mathcal {P}}= \{ P_1, P_2, \ldots , P_k \}\) be subsets of nodes in \(\overline{G}\) comprising the parts of some \(\rho \)-congested part-wise aggregation on \(\overline{G}\). We will construct paths \({\mathcal {P}}' = \{ P'_1, P'_2, \ldots , P'_k \}\) in \(\widehat{G}_{O(\rho )}\) in a way that solving a part-wise aggregation on \({\mathcal {P}}'\) corresponds to solving a \(\rho \)-congested part-wise aggregation on \({\mathcal {P}}\).

Let \(E_i\) be the set of edges of \(\overline{G}\) comprising the simple path traversing all the nodes in \(P_i\), and consider the graph \(G':= (V(\overline{G}), \biguplus _{i=1}^k E_i)\). First, we observe that the degree of any node in \(v \in V(G') = V(\overline{G})\) is at most \(2 \rho \) since at most \(\rho \) many parts contain v and each part contributes at most 2 to the degree (since \(P_i\) is a simple path). Furthermore, we can simulate any \(\psi \)-round CONGEST algorithm on \(G'\) with a \((\psi \cdot \rho )\)-round CONGEST algorithm on \(\overline{G}\) as each edge \(e \in E(\overline{G})\) appears at most \(\rho \) times in \(E(G')\) due to the part-wise aggregation instance being at most \(\rho \)-congested. Therefore, using Fact 10 we can distributedly color the edges of \(G'\) into at most \(O(\rho )\) colors in \(O(\log n)\) CONGEST rounds on \(G'\), which translates to \(\widetilde{O}(\rho )\) CONGEST rounds on \(\overline{G}\). Suppose that the algorithm assigns a color \(\varvec{c}(e) \in \{ 1, \ldots , O(\rho ) \}\) to each edge \(e \in \biguplus _i E_i\).

We now construct \(P'_i \subseteq \widehat{G}_{O(\rho )}\) as follows: consider each edge \(\{u, v\} \in E_i\) and add both \(u_{\varvec{c}(\{u, v\})}, v_{\varvec{c}(\{u, v\})} \in V(\widehat{G}_{O(\rho )})\) to \(P'_i\) (i.e., the \(\varvec{c}(\{u,v\})\)-th copy of both u and v). By construction, \(P'_i\) induces a connected subgraph and the projection \(P'_i\) to \(\overline{G}\) is exactly \(P_i\). Next, we invoke the (1-congested) part-wise aggregation \(\tau \)-round algorithm for \(\{{\mathcal {P}}'_1, \ldots , {\mathcal {P}}'_k \}\) on \(\widehat{G}_{\rho }\), which can be converted to an \(\widetilde{O}(\tau \cdot \rho )\)-round algorithm on \(\overline{G}\) (Lemma 9). Thus, we obtain an \(\widetilde{O}(\tau \cdot \rho )\)-round CONGEST algorithm on \(\overline{G}\) which solves any path-restricted \(\rho \)-congested part-wise aggregation problem. \(\square \)

Finally, our reduction claimed in Lemma 8 follows via [18, Lemma 7.2], as we formalize below.

Proof of Lemma 8

Armed with Lemma 11, the claim follows by leveraging [18, Lemma 7.2 in the Full Version]. For completeness, their result states the following: Suppose we are given a collection of part-wise aggregation parts \(\{P_i\}_i\) in any graph H. Then, one can solve the corresponding part-wise aggregate problem by reducing it to \(\widetilde{O}(1)\)-many (1-congested) part-wise aggregations between disjoint parts restricted to be simple paths \({\mathcal {P}}'_i = \{ P'_{i, j} \}_{j=1}^{\widetilde{O}(1)}\). Here, \(P'_{i, j}\) is a collection of node-disjoint simple paths restricted to \(P_i\). Note that, for each j, \(\bigcup _{i} P'_{i,j}\) is a collection of node-disjoint simple paths (since \(P'_{i,j} \subseteq P_i\)). Hence, it is sufficient to solve part-wise aggregation on a collection of node-disjoint simple paths on H.

For our proof, we use the above result for \(H = \overline{G}\). Moreover, \(\bigcup _{i} P'_{i,j}\) is \(\rho \)-congested since at most \(\rho \) parts \(P_i\) use any node v, and within each such \(P_i\), every oracle call uses the node v at most once (since they are disjoint). By Lemma 11, the part-wise aggregation problem on \(\rho \)-congested node-disjoint simple paths is exactly handled in \(\widetilde{O}(\rho \cdot \tau )\) CONGEST rounds. Therefore, the original problem can also be solved in \(\widetilde{O}(\rho \cdot \tau )\) CONGEST rounds, as required. \(\square \)

3.1.2 Treewidth-bounded graphs

Here we leverage the reduction we established in Lemma 8 to obtain a simple algorithm for solving the congested part-wise aggregation problem in treewidth-bounded graphs. The crucial observation is that the treewidth of the layered graph can only grow by a factor of \(\rho \) compared to the treewidth of the underlying graph, as we show below.

Claim 12

\(D(\widehat{G}_{\rho }) \le D(\overline{G}) + 1\).

Proof

First, consider any two nodes \(u_i, v_j \in V(\widehat{G}_{\rho })\) such that \(\pi (u_i) \ne \pi (v_j)\), with \(i, j \in [\rho ]\). By construction of the layered graph \(\overline{G}_i\), there exists a path of length at most \(D(\overline{G})\) in the i-th layer of \(\widehat{G}_{\rho }\) between \(u_i\) to \(v_i\). Thus, it follows that the (hop) distance between \(u_i\) and \(v_j\) is at most \(D(\overline{G})\) given that \(v_j\) and \(v_i\), with \(i \ne j\), are adjacent—the copies form a clique in the layered graph. This also implies that the distance between any two nodes \(u_i\) and \(u_j\), with \(\pi (u_i) = \pi (u_j)\), is 1, concluding the proof. \(\square \)

Lemma 13

If the treewidth of \(\overline{G}\) is \({{\,\textrm{tw}\,}}(\overline{G})\), then \({{\,\textrm{tw}\,}}(\widehat{G}_{\rho }) \le \rho {{\,\textrm{tw}\,}}(\overline{G}) + \rho - 1\).

Proof

Consider a tree decomposition (in the sense of Definition 5) of \(\overline{G}\) into tree-nodes \(\{X_j\}_{j=1}^k\) such that the width of the decomposition satisfies \(w = {{\,\textrm{tw}\,}}(\overline{G})\). We will show that there exists a tree decomposition on the graph \(\widehat{G}_{\rho }\) with width at most \(\rho (w + 1) - 1\), which in turn will imply that \({{\,\textrm{tw}\,}}(\widehat{G}_{\rho }) \le \rho ( w + 1) -1 = \rho ({{\,\textrm{tw}\,}}(\overline{G}) +1) - 1\). Indeed, consider the following sets:

$$\begin{aligned} \widehat{X}_j := \{ u_i : u \in X_j, i \in [\rho ] \}, \end{aligned}$$

for all \(j \in [k]\). In words, each node \(V(\overline{G}) \ni u \in X_j\) is replaced by all of its copies \(u_i\) in \(\widehat{X}_j\). Observe that, by construction, \(|\widehat{X}_j |= \rho |X_j |\). Thus, it suffices to show that the collection of sets \(\{\widehat{X}_j \}_{j=1}^k\) forms a legitimate tree decomposition. First, since \(V(\overline{G}) \subseteq \bigcup _j X_j\), it follows that \(V(\widehat{G}_{\rho }) \subseteq \bigcup \widehat{X}_j\). Moreover, consider any two sets \(\widehat{X}_j, \widehat{X}_{\ell }\), both containing a node \(u_i \in V(\widehat{G}_{\rho })\) for some \(i \in [\rho ]\). Then, we know that all the tree-nodes in the (unique) path between \(X_j\) and \(X_{\ell }\) based on the original tree decomposition include u since \(X_j\) and \(X_{\ell }\) both include u and \(\{ X_j \}\) is a tree decomposition of \(\overline{G}\). In turn, this implies that all the tree-nodes in the path between \(\widehat{X}_j\) and \(\widehat{X}_{\ell }\) also contain \(u_i\). Thus, the tree-nodes containing \(u_i\) form a connected subtree. Finally, we know that for every edge \(\{u, v\} \in E(\overline{G})\) there exists a subset \(X_j\) such that \(u, v \in X_j\). Hence, we can infer that for every edge in \(E(\widehat{G}_{\rho })\) there is a tree-node \(\widehat{X}_j\) which includes both incident endpoints. As a result, we have constructed a tree decomposition in \(\widehat{G}_{\rho }\) with width \(\max _{j \in [k]} |\widehat{X}_j |- 1 \le \rho (w + 1) - 1\). \(\square \)

Combining this guarantee with Fact 7, Lemma 8, Theorem 6, and Claim 12, we obtain the following immediate consequence.

Corollary 14

Let \(\overline{G}\) be an \(\overline{n}\)-node communication network of diameter at most D and treewidth \({{\,\textrm{tw}\,}}(\overline{G})\). Then, we can solve with high probability any \(\rho \)-congested part-wise aggregation problem in \(\overline{G}\) within \(\widetilde{O}(\rho ^2 \cdot {{\,\textrm{tw}\,}}(\overline{G}) \cdot D )\) rounds of \({{\,\textrm{CONGEST}\,}}\).

Minor density in the layered graph. In light of Lemma 13, a natural question is whether an analogous bound holds with respect to the minor density of the underlying graph; i.e., whether \(\delta (\widehat{G}_{\rho }) = {{\,\textrm{poly}\,}}(\rho ) \delta ({G})\). Unfortunately, this is not possible, as illustrated in Fig. 3.

Observation 2

There exists an n-node graph G with minor density \(\delta ({G}) = \widetilde{O}(1)\), but its 2-layered version \(\widehat{G}_{2}\) has minor density \(\delta (\widehat{G}_{2}) = \Omega (\sqrt{{n}})\).

3.1.3 General graphs

We conclude with our main result of Sect. 3.1: a near-optimal distributed algorithm for solving the \(\rho \)-congested part-wise aggregation problem in general graphs. In light of our reduction in Lemma 8, the technical crux is to control the degradation in the shortcut quality incurred by the transformation into the layered graph. Surprisingly, we show that the shortcut quality of \(\widehat{G}_{\rho }\) does not increase by more than a polylogarithmic factor even when the number of layers is polynomial:

Theorem 15

For any \(\overline{n}\)-node graph \(\overline{G}\) and any \({\mathbb {Z}}_{\ge 1} \ni \rho \le {{\,\textrm{poly}\,}}(\overline{n})\), we have that \({{\,\textrm{SQ}\,}}(\widehat{G}_{\rho }) =\widetilde{O}({{\,\textrm{SQ}\,}}(\overline{G}))\).

This theorem improves over our previous result for treewidth-bounded graphs (Lemma 13) since the latter guarantee inevitably induces a linear factor of \(\rho \) in the shortcut quality of \(\widehat{G}_{\rho }\); in contrast, Theorem 15 guarantees merely a polylogarithmic in \(\rho \) degradation in the shortcut quality. While this will not affect the asymptotic performance of the Laplacian solver, this improvement might prove to be important for future applications. Assuming that we have shown Theorem 15, we can then utilize the efficient shortcut constructions given in Theorem 5 to solve \(\rho \)-congested part-wise aggregations on any graph.

Corollary 16

There exists a randomized distributed algorithm that, for any \(\overline{n}\)-node graph \(\overline{G}\) and \(\rho \in {\mathbb {Z}}_{\ge 1} \le {{\,\textrm{poly}\,}}(\overline{n})\), solves with high probability any \(\rho \)-congested part-wise aggregation instance on \(\overline{G}\) with the following guarantees:

  • In the \({{\,\textrm{CONGEST}\,}}\), the algorithm terminates in at most \(\rho \cdot {{\,\textrm{poly}\,}}\left( {{\,\textrm{SQ}\,}}(\overline{G}) \right) \cdot \overline{n}^{o(1)}\) rounds.

  • In the \({{\,\textrm{CONGEST}\,}}\) model on graphs with minor density \(\delta \), it requires \(\widetilde{O}(\rho \cdot \delta \cdot D)\) rounds.

  • In the Supported-\({{\,\textrm{CONGEST}\,}}\), the algorithm terminates in \(\widetilde{O}(\rho \cdot {{\,\textrm{SQ}\,}}(\overline{G}))\) rounds.

The rest of this subsection is dedicated to the proof of Theorem 15. First, to argue about the shortcut quality of the layered graph, we need to develop several generalized notions of node connectivity.

Pair node connectivity. Given a (multi)set of source-sink pairs \({\mathcal {P}}= \{ (s_i, t_i) \}_{i=1}^k\) in G, we say that \({\mathcal {P}}\) has pair node connectivity \(\rho \) if there exist paths \(P_1, \ldots , P_k\), with \(s_i\) and \(t_i\) being the endpoints of each \(P_i\), such that every node \(v \in V(G)\) is contained in at most \(\rho \) many paths, i.e., for all v we have \(|\{ i: V(P_i) \ni v \}|\le \rho \). If \({\mathcal {P}}\) has pair node connectivity 1 we say that the pairs in \({\mathcal {P}}\) are node-disjointly connectable.

Any-to-any node connectivity. Suppose that we are given multisets of k sources \(S = \{ s_1, \ldots , s_k \}\) and k sinks \(T = \{ t_1 \ldots , t_k \}\). We say that (ST) have any-to-any node connectivity \(\rho \) if there is a permutation \(\pi : \{ 1, \ldots , k \} \rightarrow \{ 1, \ldots , k \}\) such that the pairs \(\{ (s_i, t_{\pi (i)}) \}_{i=1}^k\) have pair node connectivity \(\rho \). If (ST) have any-to-any node connectivity 1 we say that the multisets (ST) are any-to-any node-disjointly connectable.

The following decomposition lemma states that two sets with any-to-any node connectivity \(\rho \) can be decomposed into \(\widetilde{O}(\rho )\) many pairs of subsets that are any-to-any node-disjointly connectable.

Lemma 17

Given a graph G, suppose we are given any two multisets of nodes \(S \subseteq V(G)\) and \(T \subseteq V(G)\) of size \(k:= |S|= |T |\) that have any-to-any node connectivity \(\rho \). Then, we can partition \(S = S_1 \uplus S_2 \uplus \ldots \uplus S_{O(\rho \log k)}\) and \(T = T_1 \uplus T_2 \uplus \ldots T_{O(\rho \log k)}\) such that \(|S_i |= |T_i |\) and \((S_i, T_i)\) are any-to-any node-disjointly connectable.

Proof

Suppose that each edge in G has infinite capacity while each node in G has unit capacity. Then, let us connect a super-source s to each node \(x \in S\) with a unit-capacity edge, and a super-sink t to each node \(x \in T\) with a unit capacity edge. By assumption, we know that there exists a flow f over E(G) which sends k units of flow from s to t with edge congestion 1 and node congestion at most \(\rho \). Therefore, the flow \(f / \rho \) sending \(k / \rho \) units of flow from s to t is a feasible solution of the maximum flow linear program with node constraints (i.e., it satisfies both edge and node capacity constraints). Since that linear program is integral (i.e., has an integrality gap of 1), there exists an integral flow \(f'\) which sends at least \(k / \rho \) units of flow and satisfies both node and edge capacity restrictions. In other words, there exist at least \(k / \rho \) node disjoint paths (with the exception of the endpoints) between s and t. Let \(S_1 \subseteq S\) (\(T_1 \subseteq T\)) be the set of nodes on these paths immediately following the super-source (just before the super-sink, respectively). Clearly, by construction, \((S_1, T_1)\) are any-to-any node-disjointly connectable. Finally, we define \(S' \leftarrow S {\setminus } S_1, T' \leftarrow T {\setminus } T_1\) and proceed iteratively as above (producing \(S_2, T_2\) instead of \(S_1, T_1\)). In each step, the size of \(S'\) and \(T'\) decreases by at least a multiplicative factor of \(1 - 1/\rho \). Hence, \(O(\rho \log k)\) steps suffice so that \(S' = T' = \emptyset \). \(\square \)

Next, we introduce two communication tasks that will be useful for characterizing the shortcut quality.

Multiple-unicast problem. Suppose that we are given k source-sink pairs \({\mathcal {P}}= \{ (s_i, t_i) \}_{i=1}^k\). The goal is to find the smallest possible completion time \(\tau \) such that there are k paths \(P_1, \ldots , P_k\) for which (1) the endpoints of each \(P_i\) are exactly \(s_i\) and \(t_i\); (2) the dilation is \(\tau \), i.e., each path \(P_i\) has at most \(\tau \) hops; and (3) the congestion is \(\tau \), i.e., each edge \(e \in E(G)\) is contained in at most \(\tau \) many paths.

Any-to-any-cast problem. Suppose we are given k sources \(S = \{ s_1, \ldots , s_k \}\) and k sinks \(T = \{ t_1 \ldots , t_k \}\). The goal is to find the smallest \(\tau \) so that there is a permutation \(\pi : \{ 1, \ldots , k \} \rightarrow \{ 1, \ldots , k \}\) for which the multiple-unicast problem on \(\{ ( s_i, t_{\pi (i)} ) \}_{i=1}^k\) has at most \(\tau \) completion time.

Finally, we now recall (a reinterpretation of) a result characterizing shortcut quality from [18, 47]. Shortcut quality was originally defined as the smallest completion-time of the worst-case generalized (with respect to parts) multiple-unicast (i.e., multi-commodity) problem over a pair node-disjointly connectable instance (Definition 3). Using recent network coding gap results, we can equivalently express shortcut quality as the smallest completion-time of the worst-case any-to-any-cast (i.e., single-commodity) problem over sources and sinks that are any-to-any node-disjointly connectable. The formal statement follows.

Theorem 18

([18, 47]) Consider any graph G and let \(\tau \) be the worst-case completion time of any-to-any-cast problems taken over all any-to-any node-disjointly connectable sets \((S \subseteq V(G), T \subseteq V(G))\). Then, \(\tau = \widetilde{\Theta }({{\,\textrm{SQ}\,}}(G))\).

Proof

It was proven in [18, Lemma 2.8 in the Full Version] that \({{\,\textrm{SQ}\,}}(G)\) is, up to \(\widetilde{\Theta }(1)\) factors, equal to the completion time C of some multiple-unicast instance with respect to some source-sink pairs \({\mathcal {P}}:= \{ (s_i, t_i) \}_{i=1}^k\) that are pair node-disjointly connectable. We note that, since sources and sinks are disjoint, it follows that \(k = {{\,\textrm{poly}\,}}(n)\) and \(O(\log k) = O(\log n)\). Furthermore, Haeupler et al. [47] proved that there exists a sub-instance \({\mathcal {P}}' = \{ (s'_i, t'_i)_{i=1}^{k'} \} \subseteq {\mathcal {P}}\) such that \({{\,\textrm{SQ}\,}}(G)\) is (up to \(\widetilde{\Theta }(1)\) factors) equal to the completion time \(\tau \) of the any-to-any-cast problem with respect to \((\{ s'_i \}_{i=1}^{k'}, \{ t'_i \}_{i=1}^{k'})\). One side of the claim is clear: for any sub-instance \({\mathcal {P}}' \subseteq {\mathcal {P}}\) we have that \(\tau \le C\). The other direction is harder and we sketch its proof here using the terminology by Haeupler et al. [47]. By definition and strong duality, \(\textrm{Cut}_{{\mathcal {P}}}(2C) = \textrm{ConcurrentFlow}_{{\mathcal {P}}}(2C) \le 1\). Furthermore, \(\textrm{Cut}_{{\mathcal {P}}}(C/10) = \textrm{Cut}_{{\mathcal {P}}}(2C)/20 \le 1/10\). Hence, by [18, Lemma 2.6] there is a sub-instance \({\mathcal {P}}' \subseteq {\mathcal {P}}\) with a moving cut of distance \(\tau := \widetilde{\Omega }(C)\) and capacity less than \(|{\mathcal {P}}' |\). Therefore, this proves that the completion time of any-to-any-cast problem on \({\mathcal {P}}'\) is at least \(\tau \). With this in mind, we have that \(\widetilde{\Omega }({{\,\textrm{SQ}\,}}(G)) = \widetilde{\Omega }(C) = \tau \le C = \widetilde{\Theta }({{\,\textrm{SQ}\,}}(G))\).

Finally, since \({\mathcal {P}}= \{ (s_i, t_i) \}_{i=1}^k\) is pair node-disjointly connectable, it follows from the definition that the sub-instance \((\{ s'_i \}_{i=1}^{k'}, \{ t'_i \}_{i=1}^{k'})\) is any-to-any node-disjointly connectable. Therefore, \((\{ s'_i \}_{i=1}^{k'}, \{ t'_i \}_{i=1}^{k'})\) satisfies the constraints of this result and has completion-time \(\tau = \widetilde{\Theta }({{\,\textrm{SQ}\,}}(G))\), as required. It is also clear that, by shortcut quality, any any-to-any node-disjointly connectable instance has completion time at most \({{\,\textrm{SQ}\,}}(G)\) using the node-disjoint paths that witness the any-to-any node-disjointness as parts of the shortcut, making \((\{ s'_i \}_{i=1}^{k'}, \{ t'_i \}_{i=1}^{k'})\) the worst-case such instance (modulo polylogarithmic factors). \(\square \)

Finally, combining all of the previous ingredients, we are ready to show Theorem 15.

Proof of Theorem 15

Let \(S \subseteq V(\widehat{G}_{\rho })\) and \(T \subseteq V(\widehat{G}_{\rho })\) be any-to-any node-disjointly connectable sets such that the completion time of any-to-any-cast between S and T is \(\widetilde{\Theta }({{\,\textrm{SQ}\,}}(\widehat{G}_{\rho }))\) (Theorem 18). Let \(k:= |S |= |T |\), and suppose that \(S':= \biguplus _{s \in S} \{\pi (s) \} \subseteq V(\overline{G})\) and \(T':= \biguplus _{t \in T} \{\pi (t) \} \subseteq V(\overline{G})\) are the multisets induced by projecting S and T to \(\overline{G}\), respectively. By construction of \(\widehat{G}_{\rho }\), \(S'\) and \(T'\) have any-to-any node connectivity \(\rho \); to see this, consider the witness paths disjointly connecting them in \(\widehat{G}_{\rho }\) and project them to \(\overline{G}\). Therefore, we can partition \(S' = S'_1 \uplus \ldots \uplus S'_{O(\rho \log k)}\) and \(T' = T'_1 \uplus \ldots \uplus T'_{O(\log k)}\) such that \(|S'_i |= |T'_i |\) and \((S'_i, T'_i)\) are any-to-any node-disjointly connectable in \(\overline{G}\) (Lemma 17).

By definition of shortcut quality, for each \(i \in \{1, \ldots , O(\rho \log k)\}\) there exists a set of paths \((P^i_j)_{j=1}^{|S'_i |}\) in \(\overline{G}\) between \(S'_i\) and \(T'_i\) of quality (i.e., both congestion and dilation) at most \({{\,\textrm{SQ}\,}}(\overline{G})\). Then, we inject the first \(O(\log k)\) collections of paths \((P^1_j)_j, (P^2_j)_j, \ldots , (P_j^{O(\log k)})_j\) to the first layer \(\overline{G}_1\) of \(\widehat{G}_{\rho }\); the second \(O(\log k)\) collections to the second layer \(\overline{G}_2\), and so on, until we finally inject the last \(O(\log k)\) collections to the last layer \(\overline{G}_\rho \). Note that only the paths on the same layer interact, so both the congestion and dilation after injecting all paths into \(\widehat{G}_{\rho }\) is \(O({{\,\textrm{SQ}\,}}(\overline{G}) \log k)\). Hence, the same applies for the shortcut quality. Finally, to solve the any-to-any-cast problem on S and T one might need to add an between-layer edge at the beginning and at the end since each injected path is restricted to some adversarially chosen layer. However, this only increases the congestion and dilation by O(1). Hence, the completion time of any-to-any-cast between S and T is \(\widetilde{O}({{\,\textrm{SQ}\,}}(\overline{G}))\), implying that \({{\,\textrm{SQ}\,}}(\widehat{G}_{\rho }) =\widetilde{O}({{\,\textrm{SQ}\,}}(\overline{G}))\). \(\square \)

3.2 The \({{\,\mathrm{\textsc {NCC}}\,}}\) model

We next turn our attention to the \({{\,\mathrm{\textsc {NCC}}\,}}\) model. In particular, we observe that the \(\rho \)-congested part-wise aggregation problem admits a solution in \({{\,\textrm{poly}\,}}(\rho , \log \overline{n})\) rounds of \({{\,\mathrm{\textsc {NCC}}\,}}\):

Lemma 19

Let \(\overline{G}\) be an \(\overline{n}\)-node graph. Then, we can solve with high probability any \(\rho \)-congested part-wise aggregation problem on \(\overline{G}\) after \(O(\rho + \log \overline{n})\) rounds of \({{\,\mathrm{\textsc {NCC}}\,}}\).

This lemma is established after appropriately translating the communication primitives established for \({{\,\mathrm{\textsc {NCC}}\,}}\) by Augustine et al. [27]. In particular, let us first describe one of their key communication primitives.

The aggregation problem. In the aggregation problem, as defined by Augustine et al. [27], we are given a distributive function and a set of aggregation parts \(\{ P_1, \dots , P_k \}\), with \(P_i \subseteq V(\overline{G})\) for all i. Every aggregation part is associated with some target node \(t_i \in P_i\).Footnote 6 Assuming that every node holds exactly one input value for each aggregation part of which it is a member, the goal is to let all the target nodes learn the aggregate values with respect to the associated aggregation parts. This setting allows a node to be part of multiple groups, and in particular, we let \(\ell \) be the local load: the number of groups a given node may be included in—or an upper bound thereof. In addition, if \(L = \sum _{i=1}^k |P_i |\) represents the global load of the aggregation problem, [27, Theorem 2.3] established the following result.

Lemma 20

([27]) There exists an aggregation algorithm which solves with high probability the aggregation problem in \(O(L/\overline{n} + \ell /\log \overline{n} + \log \overline{n})\) rounds of \({{\,\mathrm{\textsc {NCC}}\,}}\).

In the context of the \(\rho \)-congested part-wise aggregation problem (Definition 6), it is clear that \(\ell \le \rho \) and \(L \le \rho \overline{n}\). Thus, we are now ready to establish Lemma 19.

Proof of Lemma 19

We first employ the communication protocol of Lemma 20 so that after \(O(\rho + \log \overline{n})\) rounds of \({{\,\mathrm{\textsc {NCC}}\,}}\) each target node learns with high probability the aggregate values with respect to the associated aggregation parts. Next, we can reverse in time the previous communication pattern, but instead using the aggregate values as determined by the target nodes. As a result, every node will know with high probability the aggregate value for each of its aggregation parts after \(O(\rho + \log \overline{n})\) rounds of \({{\,\mathrm{\textsc {NCC}}\,}}\). \(\square \)

4 Almost universally optimal Laplacians

In this section, we relate the congested part-wise aggregation problem we studied in the previous section with the Laplacian solver of Forster et al. [11]. To present a unifying analysis for both \({{\,\textrm{CONGEST}\,}}\) and \({{\,\textrm{HYBRID}\,}}\), as well as for future applications and extensions, we analyze the distributed Laplacian solver under the following hypothesis.

Assumption 1

Consider a model of computation which incorporates \({{\,\textrm{CONGEST}\,}}\). We assume that we can solve with high probability any \(\rho \)-congested part-wise aggregation problem in \(Q(\rho ) = O(\rho ^c {\mathcal {Q}})\) rounds, for some universal constant \(c \ge 1\).

The important connection between the congested part-wise aggregation problem (Definition 6) and the distributed Laplacian solver of Forster et al. [11] revolves around the concept of a low-congestion minor, a central component in the work of Forster et al. [11].

Definition 7

([11]) A graph G is a minor of \(\overline{G}\) if the following properties hold:

  1. 1.

    For every node \(u^G \in V(G)\) there exists:

    1. (i)

      A subset of nodes of \(\overline{G}\), which is termed as a super-node, \(S^{G \rightarrow \overline{G}}(u^G)\), with a leader node \(\ell (u^G) \in S^{G \rightarrow \overline{G}}(u^G)\);

    2. (ii)

      A connected subgraph of \(\overline{G}\) on \(S^{G \rightarrow \overline{G}}(u^G)\), for which we maintain a spanning tree \(T^{G \rightarrow \overline{G}}(u^G)\).

  2. 2.

    There exists a mapping of the edges of G onto edges of \(\overline{G}\), or self-loops, such that for any \(\{u^G, v^G\} \in E(G)\), the mapped edge \(\{u, v\}\) satisfies \(u \in S^{G \rightarrow \overline{G}}(u^G)\) and \(v \in S^{G \rightarrow \overline{G}}(v^G)\).

Moreover, we say that this minor G has congestion \(\rho \), or G is a \(\rho \)-minor, if:

  1. 1.

    Every node \(u \in \overline{G}\) is contained in at most \(\rho \) super-nodes \(S^{G \rightarrow \overline{G}}(u^G)\), for some \(u^G \in V(G)\);

  2. 2.

    Every edge of \(\overline{G}\) appears as the image of an edge of G or in one of the trees connecting super-nodes (i.e., \(T^{G \rightarrow \overline{G}}(u^G)\) for some \(u^G\)) at most \(\rho \) times.

Finally, we say that G is \(\rho \)-minor distributed over \(\overline{G}\) if every \(u \in V(\overline{G})\) stores:

  1. 1.

    All \(u^{G} \in V(G)\) for which \(u \in S^{G \rightarrow \overline{G}}(u^G)\);

  2. 2.

    For every edge e incident to u, (i) all the nodes \(u^G\) for which \(e \in T^{G \rightarrow \overline{G}}(u^G)\), and (ii) all edges \(e^{G}\) that map to it.

We remark that the basis of Definition 7 was the earlier concept of a distributed cluster graph [48]. Now the upshot is that the congested part-wise aggregation problem we introduced is the central ingredient that allows performing certain “local” operations on a graph \(\rho \)-minor distributed into the underlying communication network. Indeed, the following lemma is a direct consequence of Definition 7.

Lemma 21

Let \(G = (V, E)\) be an n-node graph \(\rho \)-minor distributed into an \(\overline{n}\)-node communication network \(\overline{G} = (\overline{V}, \overline{E})\) for which Assumption 1 holds for some \(Q = Q(\rho )\). Then, we can perform with high probability the following operations simultaneously for all \(u^G \in V(G)\), within \(O(Q(\rho ))\) rounds:

  1. 1.

    Every leader \(\ell (u^G)\) sends an \(O(\log \overline{n})\)-bit message to all the nodes in \(S^{G \rightarrow \overline{G}}(u^G)\);

  2. 2.

    All the nodes in \(S^{G \rightarrow \overline{G}}(u^G)\) compute an aggregation function on \(O(\log \overline{n})\)-bit inputs.

Armed with this connection, our next crucial observation is that the performance of the Laplacian solver of Forster et al. [11] can be parameterized in terms of the complexity of the congested part-wise aggregation problem. Indeed, we revisit and refine the main building blocks of their solver in Section A, leading to our main result below.

Theorem 22

Consider a weighted \(\overline{n}\)-node graph \(\overline{G}\) for which Assumption 1 holds for some \(Q(\rho ) = O(\rho ^c {\mathcal {Q}})\), where c is a universal constant and \({\mathcal {Q}} = {\mathcal {Q}}(\overline{G})\) is some parameter. Then, we can solve any Laplacian system after \(\overline{n}^{o(1)} {\mathcal {Q}} \log (1/\varepsilon )\) rounds.

Combining this theorem with Corollary 16 and Lemma 19 yields the following immediate consequences.

Lower bound in Supported-\({{\,\textrm{CONGEST}\,}}\). Finally, we complement our positive results with an almost-matching lower bound on any graph \(\overline{G}\), applicable even under the Supported-\({{\,\textrm{CONGEST}\,}}\) model, thereby establishing universal optimality up to an \(\overline{n}^{o(1)}\) factor. Our reduction leverages the refined hardness result established by Haeupler et al. [18] for the spanning connected subgraph problem [12]. In this problem a subgraph \(\overline{H}\) of \(\overline{G}\) is specified with nodes knowing all of the incident edges belonging to \(\overline{H}\). The goal is to let every node learn whether \(\overline{H}\) is connected and spans the entire network.

Theorem 23

([18]) Let \({\mathcal {A}}\) be any algorithm which is always correct with probabilityFootnote 7 at least \(\frac{2}{3}\) for the spanning connected subgraph problem, and \(T(\overline{G}) =\max _{{\mathcal {I}}} T_{{\mathcal {A}}}({\mathcal {I}}; \overline{G})\) be the worst-case round-complexity of \({\mathcal {A}}\) under \(\overline{G}\). Then, \(T(\overline{G}) = \widetilde{\Omega }({{\,\textrm{SQ}\,}}(\overline{G}))\).

In this context, we show that a Laplacian solver can be leveraged to solve the spanning connected subgraph problem, leading to the following lower bound.

Proof

First of all, as pointed out in [11, Theorem 2], it suffices to establish the lower bound for a high-precision solver, i.e. for a sufficiently small \(\varepsilon = 1/{{\,\textrm{poly}\,}}(\overline{n})\). Indeed, a low-accuracy solver (\(\varepsilon \le \frac{1}{2}\)) can always be “boosted” with only an \(O(\log \overline{n})\) overhead in the overall complexity.

In this context, let \(\overline{H}\) be the input to the spanning connected subgraph problem. We construct a resistor network \(H'\) so that \(\varvec{r}(e) = 1\) if \(e \in E(\overline{H})\), and \(\varvec{r}(e) = \overline{n}^4 \) for every edge \(e \notin E(\overline{H})\). Moreover, let us select arbitrarily a node \(v \in V(\overline{G})\). The key idea of the proof is to consider as input to the Laplacian solver a vector \(\varvec{b} \in {\mathbb {R}}^{\overline{n}}\) such that \(\varvec{b}(u) = -1\) for all \(u \in V(\overline{G}) {\setminus } \{ v\}\), while \(\varvec{b}(v) = \overline{n}-1\).

To analyze the output of that Laplacian system, we first analyze the simpler Laplacian system with input a vector \(\varvec{\chi }_{v, u} \in {\mathbb {R}}^{\overline{n}}\) for which the coordinate corresponding to node v is 1; the coordinate corresponding to node u is \(-1\); and any other coordinate is set to 0. We recall the following well-known facts.

Fact 24

Let \(\varvec{\phi } = {\mathcal {L}}(H')^\dagger \varvec{\chi }_{v, u}\). Then, for any node \(w \in V(\overline{G})\) it holds that \(\varvec{\phi }(v) \ge \varvec{\phi }(w) \ge \varvec{\phi }(u)\).

In the statement above, we use \({\mathcal {L}}(H')^\dagger \) to denote the Moore-Penrose pseudo-inverse of matrix \({\mathcal {L}}(H')\).

Fact 25

Let \(\varvec{\phi } = {\mathcal {L}}(H')^\dagger \varvec{\chi }_{v, u}\). Then, the \(v-u\) effective resistance is such that \({{\,\textrm{res}\,}}_{H'}(v, u) = \varvec{\phi }(v) - \varvec{\phi }(u)\).

As argued by Forster et al. [11], the output of the Laplacian with input \(\varvec{\chi }_{v, u}\) and a sufficiently small error \(\varepsilon = 1/{{\,\textrm{poly}\,}}(\overline{n})\) can be used to determine whether v and u are connected. Indeed, the following arguments have been extracted from their lower bound.

Claim 26

If u and v are connected in \(\overline{H}\) it follows that \({{\,\textrm{res}\,}}_{H'}(v, u) \le \overline{n}-1\).

Proof

It is well-known that the effective resistances satisfy the triangle inequality. Moreover, given that v and u are connected in \(\overline{H}\), it follows that there exists a path of length at most \(\overline{n} - 1\) in \(H'\) so that every edge has resistance 1 (by construction of the resistor network \(H'\)). As a result, the triangle inequality implies that \({{\,\textrm{res}\,}}_{H'}(v, u) \le \overline{n} - 1\). \(\square \)

Claim 27

If v and u are not connected in \(\overline{H}\) it follows that \({{\,\textrm{res}\,}}_{H'}(v, u) \ge \overline{n}^2\).

Proof

Suppose that \(e_1, \dots , e_k\) are the edges leaving the connected component of v in \(\overline{H}\), for some \(k \le \overline{n}^2\). Then, the Nash-Williams inequality implies that

$$\begin{aligned} {{\,\textrm{res}\,}}_{H'}(v, u) \ge \frac{1}{\sum _{i=1}^k \frac{1}{\varvec{r}(e_i)}} \ge \overline{n}^2, \end{aligned}$$

by construction of the resistor network. \(\square \)

The next step of the proof is to incorporate in the analysis the error of the solver. To this end, let \(\varvec{\phi }'\) be an \(\varepsilon \)-approximate solution to the linear system \({\mathcal {L}}(H') \varvec{\phi } = \varvec{\chi }_{v, u}\) in the sense that

$$\begin{aligned} \Vert \varvec{\phi }' - {\mathcal {L}}(H')^\dagger \varvec{\chi }_{v, u} \Vert _{{\mathcal {L}}(H')} \le \varepsilon \Vert \varvec{\chi }_{v, u}\Vert _{{\mathcal {L}}(H')^\dagger } = \varepsilon \sqrt{{{\,\textrm{res}\,}}_{H'}(v, u)}. \end{aligned}$$

Moreover, since the Laplacian matrix has integer resistances up to range \({{\,\textrm{poly}\,}}(\overline{n})\), it follows that for any \(\varvec{x}\), \(\Vert \varvec{x} \Vert _\infty \le {{\,\textrm{poly}\,}}(\overline{n}) \Vert \varvec{x}\Vert _{{\mathcal {L}}}\). Thus, by setting \(\varepsilon = 1/{{\,\textrm{poly}\,}}(\overline{n})\) to be sufficiently small, we have that

$$\begin{aligned} {{\,\textrm{res}\,}}_{H'}(v, u) - \frac{1}{\overline{n}} \le \varvec{\phi }'(v) - \varvec{\phi }'(u) \le {{\,\textrm{res}\,}}_{H'}(v, u) + \frac{1}{\overline{n}}. \end{aligned}$$

Now we will use these bounds to argue about the initial Laplacian system with input vector \(\varvec{b}\). By linearity, a solution of the Laplacian system with input \(\varvec{b}\) can be expressed as the sum of solutions of Laplacians with input \(\varvec{\chi }_{v, u}\) over all \(u \in V(\overline{G}) {\setminus } \{ v \}\). Next, we let \(\varvec{\phi } ={\mathcal {L}}(H')^\dagger \varvec{b}\), and \(\varvec{\phi }'\) be the output of the Laplacian solver for a sufficiently small \(\varepsilon =1/{{\,\textrm{poly}\,}}(\overline{n})\). Our analysis distinguishes between the following cases.

Case I. Suppose that \(\overline{H}\) is connected. In turn, this implies that v is connected with any node \(u \in V(\overline{G})\). As a result, it follows from Fact 24, Fact 25 and Claim 26 that for any node u,

$$\begin{aligned} \varvec{\phi }'(v) - \varvec{\phi }'(u) \le (\overline{n} - 1)^2 + 1. \end{aligned}$$
(1)

Case II. In the contrary case, there must be node u such that v and u are disconnected on \(\overline{H}\). By Claim 27 and Fact 24 this yields that

$$\begin{aligned} \varvec{\phi }'(v) - \varvec{\phi }'(u) \ge \overline{n}^2 - 1. \end{aligned}$$
(2)

Thus, (1) and (2) imply that the output \(\varvec{\phi }'\) of the Laplacian solver contains enough information to determine whether \(\overline{H}\) is connected or not since \(\overline{n}^2-1>(\overline{n}-1)^2 + 1\) for any \(\overline{n}\ge 2\).

To leverage this in the \({{\,\textrm{CONGEST}\,}}\) model we proceed as follows. First, node v sends to every other node in the graph its own part of the output from the Laplacian solver. This step can be clearly completed after \(D(\overline{G})\) rounds. Then, each node u inspects whether the value \(\varvec{\phi }'(v) - \varvec{\phi }'(u)\) is larger than \(\overline{n}^2 - 1\). In that case, node u can transmit this information to the entire network; this step is easily seen to be implementable in \(D(\overline{G})\) rounds. As a result, assuming that \({{\,\textrm{SQ}\,}}(\overline{G}) \ge 3 D(\overline{G})\), the proof follows immediately from Theorem 23. But the contrary case is also immediate since on any topology solving a Laplacian system trivially requires \(\Omega (D(\overline{G}))\) rounds. This completes the proof. \(\square \)

5 Conclusions

In this paper, we have established almost universally optimal Laplacian solvers for both the (Supported-)\({{\,\textrm{CONGEST}\,}}\) and the \({{\,\textrm{HYBRID}\,}}\) model. One of our main technical contributions was to introduce and study a congested generalization of the standard part-wise aggregation problem, which we believe may find further applications beyond the Laplacian paradigm in the future. For example, one candidate problem would be to refine the distributed algorithm for max-flow due to Ghaffari et al. [48]. We also hope that our accelerated Laplacian solvers will be used as a basic primitive for obtaining improved distributed algorithms for other fundamental optimization problems as well.