1 Introduction

This paper revisits the fundamental problem of computing the QR decomposition: Given a matrix \(\textbf{A}\in \mathbb {R}^{m\times n}\), its (thin) QR decomposition is the multiplication of the orthogonal matrix \(\textbf{Q}\in \mathbb {R}^{m\times n}\) and the upper-triangular matrix \(\textbf{R}\in \mathbb {R}^{n\times n}\) [30]. This decomposition originated seven decades ago in works by Rutishauser [48] and Francis [21]. There are three classical algorithms for QR decomposition: Gram-Schmidt and its modified version [28, 52], Householder [31], and Givens rotations [24].

QR decomposition lies at the core of many linear algebra techniques and their machine learning applications [26, 53, 55] such as the matrix (pseudo) inverse and the least squares used in the closed-form solution for linear regression. In particular, the upper-triangular matrix \(\textbf{R}\) shares the same singular values with \(\textbf{A}\); its singular value decomposition can be used for the principal component analysis of \(\textbf{A}\); it constitutes a Cholesky decomposition of \(\textbf{A}^{\textbf{T}}\textbf{A}\), which is used for training (non)linear regression models using gradient descent [51]; the product of its diagonal entries equals (ignoring the sign) the determinant of \(\textbf{A}\), the product of the eigenvalues of \(\textbf{A}\), and the product of the singular values of \(\textbf{A}\).

In the classical linear algebra setting, the input matrix \(\textbf{A}\) is fully materialized and the process that constructs \(\textbf{A}\) is irrelevant. Our database setting is different: \(\textbf{A}\) is not materialized and instead defined symbolically by the join of the relations in the input database. Our goal is to compute the QR decomposition of \(\textbf{A}\) without explicitly constructing \(\textbf{A}\). This is desirable in case \(\textbf{A}\) is larger than the input database. By pushing the decomposition past the join down to the input relations, the runtime improvement is proportional to the size gap between the materialized join output and the input database. This database setting has been used before for learning over relational data [40]. Joins are used to construct the training dataset from all available data sources and are unselective by design (the more labelled samples available for training, the better). This is not only the case for many-to-many joins; even key-fkey joins may lead to large (number of values in the) output, where data functionally determined by a key in a relation is duplicated in the join output as many times as this key appears in the join output. By avoiding the materialization of this large data matrix and by pushing the learning task past the joins, learning can be much faster than over the materialized matrix [40]. Prior instances of this setting include pushing sum aggregates past joins [34, 59] and the computation of query probability in probabilistic databases [41].

This article introduces FiGaRo, an algorithm for computing the upper-triangular matrix \(\textbf{R}\) in the QR decomposition of the matrix \(\textbf{A}\) defined by the natural join of the relations in the input database. The acronym FiGaRo stands for Factorized Givens Rotations with the letters a and i swapped.

FiGaRo enjoys several desirable properties.

  1. 1.

    FiGaRo ’s execution is equivalent to a sequence of Givens rotations over the join output. Yet instead of effecting the rotations individually as in the classical Givens QR decomposition, it performs fast block transformations whose effects are the same as for long sequences of rotations.

  2. 2.

    FiGaRo takes time linear in the input data size and independently of the size of the (potentially much larger) join output for acyclic joins. It achieves this by pushing the computation past the joins.

  3. 3.

    Its transformations can be applied independently and in parallel to disjoint blocks of rows and also to different columns. This sets it apart from inherently sequential mainstream methods for QR decomposition of materialized matrices such as Gram-Schmidt.

  4. 4.

    FiGaRo can outperform both in runtime performance and accuracy the LAPACK library Intel MKL by a factor proportional to the gap between the join output and input sizes, which is up to three orders of magnitude in our experiments (Sect. 9). We show this to be the case for both key - foreign key joins over two real-world databases and for many-to-many joins of one real-world and one synthetic database. We considered matrices with 2 M-125 M rows (84 M-150 M rows in the join matrix) and both narrow (up to 50 data columns) and wide (thousands of data columns). The choice of the join tree can significantly influence the performance (up to 47x). FiGaRo is more accurate than MKL as it introduces far less rounding errors in case the join output is larger than the input database.

In Sect. 8, we further extended FiGaRo to compute: the orthogonal matrix \(\textbf{Q}\) in the QR decomposition of \(\textbf{A}\), the singular value decomposition (SVD) of \(\textbf{A}\), and the principal component analysis (PCA) of \(\textbf{A}\). These computations rely essentially on the fast and accurate computation of the upper-triangular matrix \(\textbf{R}\), are done without materializing \(\textbf{A}\), can benefit from pushing past joins dot products that are computed once and reused several times, and are amenable to partial computation in case only some output vectors are needed, such as the top-k principal components. We show experimentally that these optimizations yield runtime and accuracy advantages of FiGaRo over Intel MKL.

For QR decomposition, we designed an accuracy experiment of independent interest (Appendix A). The accuracy of an algorithm for QR decomposition is commonly based on how close the computed matrix \(\textbf{Q}\) is to an orthogonal matrix. We introduce an alternative approach that allows for a fragment \(\textbf{R}_\text {fixed}\) of the upper-triangular matrix \(\textbf{R}\) to be chosen arbitrarily and to serve as ground truth. Two relations are defined based on \(\textbf{R}_\text {fixed}\). We then compute the QR decomposition of the Cartesian product of the two relations and check how close \(\textbf{R}_\text {fixed}\) is to the corresponding fragment from the computed upper-triangular matrix.

Our work complements prior work on linear algebra computation powered by database engines [7, 18, 37, 49, 61] and on languages that unify linear algebra and relational algebra [16, 22, 32]. No prior work considered the interaction of QR decomposition with database joins. This requires a redesign of the decomposition algorithm from first principles. FiGaRo is the first approach to take advantage of the structure and sparsity of relational data to improve the performance and accuracy of QR decomposition. The sparsity typically accounts for blocks of zeroes in the data. In our work, sparsity is more general as it accounts for blocks of arbitrary values that are repeated many times in the join output. FiGaRo avoids such repetitions: It computes once over a block of values and reuses the computed result instead of recomputing it at every repetition.

We introduce FiGaRo in several steps. We first explain how it works on the materialized join (Sect. 4) and then on the unmaterialized join modelled on any of its join trees (Sect. 6). FiGaRo requires the computation of a batch of group-by counts over the join. This can be done in only two passes over the input data (Sect. 5). To obtain the desired upper-triangular matrix, FiGaRo ’s output is further post-processed (Sect. 7).

This article extends a conference paper [43] with three new contributions. The extension of FiGaRo to also compute: the orthogonal matrix \(\textbf{Q}\) in the QR decomposition of \(\textbf{A}\), an SVD of \(\textbf{A}\), and the PCA of \(\textbf{A}\) (Sect. 8). This extension was implemented and benchmarked against mainstream approaches based on Intel MKL for both runtime performance and accuracy (Sect. 9). The article also extends the introductory example (Sect. 1.1) and the related work (Sect. 10).

1.1 Givens rotations on the Cartesian product

We next showcase the main ideas behind FiGaRo and start with introducing a (special case of) Givens rotation. A \(2 \times 2\) Givens rotation matrix is a matrix \(\textbf{G} = \begin{bmatrix} \cos \theta &{} -\sin \theta \\ \sin \theta &{} \cos \theta \end{bmatrix}\) for some angle \(\theta \) (see Def. 1 for the definition of the general \(d \times d\) case). If \(\theta \) is selected appropriately, applying a Givens rotation introduces zeros in matrices. Consider a matrix \(\textbf{B} = \begin{bmatrix} a \\ b \end{bmatrix}\), where ab are real numbers with \(b \ne 0\). We can visualize \(\textbf{B}\) as a vector in the two-dimensional Cartesian coordinate system. The matrix multiplication \(\textbf{G} \textbf{B}\) represents the application of the rotation \(\textbf{G}\) to \(\textbf{B}\): Its effect is to rotate the vector \(\textbf{B}\) counter-clockwise through the angle \(\theta \) about the origin. We can choose \(\theta \) such that the rotated vector lies on the x-axis, so, having 0 as second component: If we choose \(\theta \) such that \(\sin \theta = - \frac{{{\,\textrm{sign}\,}}(a)b}{\sqrt{a^2 + b^2}}\) and \(\cos \theta = \frac{|a|}{\sqrt{a^2 + b^2}}\) [4], then \(\textbf{G} \textbf{B} = \begin{bmatrix} r \\ 0 \end{bmatrix}\), where \(r = {{\,\textrm{sign}\,}}(a)\sqrt{a^2 + b^2}\).

When applied to matrices that represent the output of joins, Givens rotations can compute the QR decomposition more efficiently than for arbitrary matrices. We sketch this next for the Cartesian product of two unary relations. Consider the matrices \(\textbf{S} = \begin{bmatrix} s_1 \\ \cdots \\ s_p \end{bmatrix}\), \(\textbf{T} = \begin{bmatrix} t_1 \\ \cdots \\ t_q \end{bmatrix}\), representing unary relations \(S(Y_1)\) and \(T(Y_2)\). The matrix \(\textbf{A} = \begin{bmatrix} \textbf{A}_1 \\ \cdots \\ \textbf{A}_p \end{bmatrix}\), where \(\textbf{A}_i\) is the \(q\times 2\) block matrix for \(i\in [p]\), represents the Cartesian product of these two unary relations, so, their natural join. We would like to compute the upper-triangular matrix \(\textbf{R}\) in the QR decomposition of \(\textbf{A}\): \(\textbf{A} = \textbf{Q} \textbf{R}\).Footnote 1

The classical Givens rotations algorithm constructs the upper-triangular matrix \(\textbf{R}\) from \(\textbf{A}\) by using Givens rotations to zero each cell below the diagonal in \(\textbf{A}\). A sequence of Givens rotations can be used to set all entries below the diagonal of any matrix \(\textbf{A}\) to 0, thus obtaining an upper triangular matrix. The left multiplication of these rotation matrices yields the orthogonal matrix \(\textbf{Q}\) in the QR decomposition of \(\textbf{A}\). This approach needs time quadratic in the input \(\textbf{S}\) and \(\textbf{T}\): It involves applying \(2pq-3\) rotations, one rotation for zeroing each cell below the diagonal in \(\textbf{A}\). This takes \(13(2pq-3)\) multiplication, division, square, and square root operations. In contrast, FiGaRo constructs \(\textbf{R}\) from \(\textbf{S}\) and \(\textbf{T}\) in linear time using \(30(p+q)\) such operations.

Suppose we use Givens rotations to introduce zeros in the first column of the block . So, we want to apply Givens rotations to a matrix column that consists of a single value and our goal is to set every occurrence of this value to 0, apart of the first one.

Fig. 1
figure 1

Applying Givens rotations to a \(1 \times 3\) matrix with all entries having the value s

This setting is sketched in Fig. 1: we have a matrix that consists of a single column and all entries are equal to the same value s. Using Givens rotations, all but the first entry need to be set to 0. The first rotation is applied to the first and the second occurrence of s, so to a vector that has the value s in both components. This vector therefore has an angle of 45 degrees to the x-axis, independent of the value of s. So, the rotation angle \(\theta \) is independent of s. We see this also when we calculate \(\sin \theta = - \frac{{{\,\textrm{sign}\,}}(s)s}{\sqrt{s^2 + s^2}} = - \sqrt{\frac{1}{2}}\) and \(\cos \theta = \frac{|s|}{\sqrt{s^2 + s^2}} = \sqrt{\frac{1}{2}}\) as explained above. The result of the rotation is that the second occurrence of s is set to 0, the first occurrence becomes \({{\,\textrm{sign}\,}}(s) \sqrt{s^2 + s^2} = \sqrt{2}s\).

For the second rotation, we choose \(\theta \) based on the value \(\sqrt{2} s\) of the first entry and the value s of the third entry of the first column, with the aim of setting the latter to 0. The corresponding rotation with \(\sin \theta = - \sqrt{\frac{1}{3}}\) and \(\cos \theta = \sqrt{\frac{2}{3}}\) sets the first entry to \(\sqrt{3} s\). This argument can be continued for larger columns: The rotation, which is used to set the k-th occurrence of s to 0, has \(\sin \theta = - \sqrt{\frac{1}{k}}\) and \(\cos \theta = \sqrt{\frac{k-1}{k}}\). We emphasize again that these rotations do not depend on the value s but only on the number of such values.

So, to set for the block all occurrences of the value \(s_1\), apart the first, to 0, we can use a series of Givens rotations with \(\sin \theta = - \sqrt{\frac{1}{2}}, \ldots , - \sqrt{\frac{1}{q}}\) and \(\cos \theta = \sqrt{\frac{1}{2}}, \ldots , \sqrt{\frac{q-1}{q}}\), respectively. The same applies to all other blocks \(\textbf{A}_i\) as well, so the same sequence of rotation angles can also be used to introduce zeros in the first column of every matrix \(\textbf{A}_i\).

Insight 1

For the matrix \(\textbf{A}_i\), there is a sequence \(\textbf{G}_s\) of Givens rotations that sets the first occurrence of \(s_i\) to \(s_i\sqrt{q}\) and all other occurrences to 0. \(\textbf{G}_s\) is independent of the values in \(\textbf{A}_i\).

The blocks \(\textbf{A}_i\) have a second column, each consisting of the same values \(t_1, \ldots , t_q\). As the same rotations are applied to each \(\textbf{A}_i\), the corresponding results have the same values \(t'_j\) throughout all blocks. They do not depend on the values \(s_i\), so they can be computed from \(\textbf{T}\) alone. In Sect. 3, we show that this is possible in time \(\mathcal {O}(q)\).

Insight 2

\(\textbf{G}_s\) yields the same values \(t'_1,\ldots ,t'_q\) in each matrix \(\textbf{A}'_i\). These values can be computed once.

Among the blocks \(\textbf{A}'_i\), there are p occurrences of the row \(\begin{bmatrix} 0&t'_j \end{bmatrix}\), for each \(2 \le j \le q\), and p rows of the form \(\begin{bmatrix} s_i\sqrt{q}\hspace{.4em}&t'_1 \end{bmatrix}\). These rows can be grouped to have the same structure as the blocks \(\textbf{A}_i\) above: the second column consists of the same value \(t'_j\). So, rotations analogous to \(\textbf{G}_s\) can be applied to these blocks to yield one row \(\begin{bmatrix} 0&t'_j \sqrt{p} \end{bmatrix}\) and \(p-1\) rows \(\begin{bmatrix} 0&0 \end{bmatrix}\) for each \(j\ge 2\), as well as one row \(\begin{bmatrix} s'_1\sqrt{q}\hspace{.4em}&t'_1 \sqrt{p} \end{bmatrix}\) and \(p-1\) rows \(\begin{bmatrix} s'_i\sqrt{q}\hspace{.4em}&0 \end{bmatrix}\). The values \(s'_i\) can be computed from \(\textbf{S}\) in time \(\mathcal {O}(p)\) the same way as mentioned above and detailed in Sect. 3.

When all mentioned rotations are applied, we obtain the matrix \(\textbf{A}''\) that consists of the blocks and for \(2\le i \le p\).

We observe the following four key points:

  1. (1)

    The matrix \(\textbf{A}''\) has only \(p+q\) nonzero values. We do not need to represent all zero rows as they are not part of the result \(\textbf{R}\).

  2. (2)

    The values in \(\textbf{A}''\) can be computed in one pass over \(\textbf{S}\) and \(\textbf{T}\). We need to compute: \(\sqrt{p}\), \(\sqrt{q}\), and the values \(s'_1, \ldots , s'_p,t'_1,\ldots ,t'_q\).

  3. (3)

    Our linear-time computation of the nonzero entries in \(\textbf{A}''\) yields the same result as quadratically many Givens rotations.

  4. (4)

    We see in Sect. 3 that the computation of the values in \(\textbf{A}''\) does not require taking the squares of the input values, as done by the Givens QR decomposition. This means less rounding errors that would otherwise appear when squaring very large or very small input values.

So far, \(\textbf{A}''\) is not upper-triangular. We obtain \(\textbf{R}\) from \(\textbf{A}''\) using a post-processing step that applies a sequence of \(p+q-3\) rotations to zero all but the top three remaining nonzero values.

In summary, FiGaRo needs \(\mathcal {O}(p+q)\) time to compute \(\textbf{R}\) in our example. Including post-processing, the number of needed squares, square roots, multiplications and divisions is at most \(4(p+q)\), \(3(p+q)\), \(17(p+q)\) and \(6(p+q)\), respectively. In contrast, the Givens QR decomposition works on the materialized Cartesian product and needs \(\mathcal {O}(pq)\) time and computes 4pq squares, 2pq square roots, 16pq multiplications and 4pq divisions.

The linear-time behaviour of FiGaRo holds not only for the example of a Cartesian product, but for matrices defined by any acyclic join over arbitrary relations. In contrast, the runtime of any algorithm, which works on the materialized join output, as well as its square, square root, multiplication, and division computations are proportional to the size of the join output.

2 The FiGaRo algorithm: setup

Relations and Joins. A database \({\mathcal D} \) consists of a set \(S_1, \ldots , S_r\) of relations. Each relation has a schema \((Z_{1}, \ldots , \) \(Z_{k})\), which is a tuple of attribute names, and contains a set of tuples over this schema. We denote tuples \((Z_1, \ldots , Z_\ell )\) of attribute names by \(\bar{Z}\) and tuples \((z_1, \ldots , z_\ell )\) of values by \(\bar{z}\). For every relation \(S_i\), we denote by \(\bar{X}_i\) the tuple of its key (join) attributes and by \(\bar{Y}_i\) the tuple of the data (non-join) attributes. We allow key values of any type, e.g., categorical, while the data values are real numbers. We denote by \(\bar{X}_{ij}\) the join attributes common to both relations \(S_i\) and \(S_j\). We consider the natural join of all relations in a database and write \(\bar{X}\) and \(\bar{Y}\) for the tuple of all key and data attributes in the join output. A database is fully reduced if there is no dangling input tuple that does not contribute to the join output. A join is (\(\alpha \)-)acyclic if and only if it admits a join tree [1]: In a join tree, each relation is a node and there is a path between the nodes for two relations \(S_i\) and \(S_j\) if all nodes along that path correspond to relations whose keys include \(\bar{X}_{ij}\).

From Relations to Matrices. For a natural number n, we write [n] for the set \(\{1, \ldots , n\}\). Matrices are denoted by bold upper-case letters, (column) vectors by bold lower-case letters. Let a matrix \(\textbf{A}\) with m rows and n columns, a row index \(i \in [m]\) and a column index \(j \in [n]\). Then \(\textbf{A}[i: j]\) is the entry in row i and column j, \(\textbf{A}[i:]\) is the i-th row, \(\textbf{A}[: j]\) is the j-th column of \(\textbf{A}\), and \(|\textbf{A}|\) is the number of rows in \(\textbf{A}\). For sets \(I \subseteq [m]\) and \(J \subseteq [n]\), \(\textbf{A}[I: J]\) is the matrix consisting of the rows and columns of \(\textbf{A}\) indexed by I and J, respectively.

Each relation \(S_i\) is encoded by a matrix \(\textbf{S}_i\) that consists of all rows \((\bar{x}_i, \bar{y}_i)\) \(\in S_i\). The relation representing the output of the natural join of the input relations is encoded by the matrix \(\textbf{A}\) whose rows \((\bar{x}, \bar{y})\) are over the key attributes \(\bar{X}\) and data attributes \(\bar{Y}\). We keep the keys and the column names as contextual information [16] to facilitate convenient reference to the data in the matrix. For ease of presentation, in this paper we use relations and matrices interchangeably following our above mapping. We use relational algebra over matrices to mean over the relations they encode.

We use an indexing scheme for matrices that exploits their relational nature. We refer to the matrix columns by their corresponding attribute names. We refer to the matrix rows and blocks by tuples of key and data values. Consider for example a ternary relation S with attribute names \(X, Y_1, Y_2\). We represent this relation as a matrix \(\textbf{S}\) such that for every tuple \((x,y_1,y_2) \in S\) there is a matrix row \(\textbf{S}[x,y_1,y_2:] = \begin{bmatrix} x&y_1&y_2 \end{bmatrix}\) with entries \(\textbf{S}[x,y_1,y_2: X] = x\), \(\textbf{S}[x,y_1,y_2: Y_1] = y_1\), and \(\textbf{S}[x,y_1,y_2: Y_2] = y_2\). We also use row indices that are different from the content of the row. We use \(*\) to denote sets of indices. For example, \(\textbf{S}[x, *: X, Y_1]\) denotes the block that contains all rows of \(\textbf{S}\) whose identifier starts with x, restricted to the columns X and \(Y_1\).

Input. The input to FiGaRo consists of (1) the set of matrices \(\textbf{S}_1,\ldots , \textbf{S}_r\), one per relation in the input fully-reduced database \({\mathcal D} \), and (2) a join tree \(\tau \) of the acyclic natural join of these matrices (or equivalently of the underlying relations).Footnote 2

Output. FiGaRo computes the upper-triangular matrix \(\textbf{R}\) in the QR decomposition of the matrix block \(\textbf{A}[: \bar{Y}]\), which consists of the data columns \(\bar{Y}\) in the join of the database relations. It first computes an almost upper-triangular matrix \(\textbf{R}_0\) such that \(\textbf{A}[: \bar{Y}] = \textbf{Q}_0 \textbf{R}_0\) for an orthogonal matrix \(\textbf{Q}_0\). A post-processing step (Sect. 7) further decomposes \(\textbf{R}_0\): \(\textbf{R}_0 = \textbf{Q}_1 \textbf{R}\), where \(\textbf{Q}_1\) is orthogonal. Thus, \(\textbf{A}[: \bar{Y}] = (\textbf{Q}_0 \textbf{Q}_1) \textbf{R}\), where the final orthogonal matrix is \(\textbf{Q} = \textbf{Q}_0 \textbf{Q}_1\). FiGaRo does not compute \(\textbf{Q}\) explicitly, it only computes \(\textbf{R}\).

The QR decomposition always exists; whenever \(\textbf{A}\) has full rank, and \(\textbf{R}\) has positive diagonal values, the decomposition is unique [55, Chapter 4, Theorem 1.1].

3 Heads and tails

Given a matrix, the classical Givens QR decomposition algorithm repeatedly zeroes the values below its diagonal. Each zero is obtained by one Givens rotation. In case the matrix encodes a join output, a pattern emerges: The matrix consists of blocks representing the Cartesian products of arbitrary matrices and one-row matrices (see Fig. 2a). The effects of applying Givens rotations to such blocks are captured using so-called heads and tails. Before we introduce these notions, we recall the notion of Givens rotations.

Definition 1

(Givens rotation) A \(d \times d\) Givens rotation matrix is obtained from the d-dimensional identity matrix by changing four entries: \(\textbf{G}[i:i] = \textbf{G}[j:j] = \cos \theta \), \(\textbf{G}[i:j] = - \sin \theta \) and \(\textbf{G}[j:i] = \sin \theta \), for some pair ij of indices and an angle \(\theta \).

We denote such a d-dimensional rotation matrix for given i, j and \(\theta \) by \({{\,\mathrm{\textbf{Giv}}\,}}_d(i,j,\sin \theta , \cos \theta )\). \(\square \)

A rotation matrix is orthogonal. The result \(\textbf{B} = \textbf{G} \textbf{A}\) of the product of a rotation \(\textbf{G}\) and a matrix \(\textbf{A} \in \mathbb {R}^{d \times d}\) is \(\textbf{A}\) except for the i-th and the j-th rows, which are subject to the counterclockwise rotation through the angle \(\theta \) about the origin of the 2-dimensional Cartesian coordinate system: for each column c, \({\textbf{B}[i:c]} = {\textbf{A}[i:c]} \cos \theta - {\textbf{A}[j:c]} \sin \theta \) and \(\textbf{B}[j:c] = \textbf{A}[i:c] \sin \theta + {\textbf{A}[j:c]} \cos \theta \).

The following notions are used throughout the paper. Given matrices \(\textbf{S} \in \mathbb {R}^{m_1 \times n_1}\) and \(\textbf{T} \in \mathbb {R}^{m_2 \times n_2}\), their Cartesian product \(\textbf{S} \times \textbf{T}\) is the matrix \(\begin{bmatrix} \textbf{A}_1 \\ \vdots \\ \textbf{A}_{m_1} \end{bmatrix}\in \mathbb {R}^{m_1m_2\times (n_1+n_2)}\) with \(\textbf{A}_i = \begin{bmatrix} \textbf{S}[i:] &{} \textbf{T}[1:] \\ \vdots &{} \vdots \\ \textbf{S}[i:] &{} \textbf{T}[m_2:] \\ \end{bmatrix}\). For \(\textbf{S} \in \mathbb {R}^{1 \times n}\) and \(\textbf{v} \in \mathbb {R}^{m}\), the Kronecker product \(\textbf{S} \otimes \textbf{v}\) is \(\begin{bmatrix} v[1] \textbf{S} \\ \vdots \\ v[m] \textbf{S} \end{bmatrix}\). We denote by \(||\textbf{v}||_2\) the \(\ell _2\) norm \(\sqrt{\sum _{i=1}^m \textbf{v}[i]^2}\) of a vector \(\textbf{v} \in \mathbb {R}^m\).

We now define the head and tail of a matrix, which we later use to express the Givens rotations over Cartesian products of matrices.

Definition 2

(Head and Tail) Let \(\textbf{A}\) be a matrix from \(\mathbb {R}^{m \times n}\).

The head \({\mathcal H} (\textbf{A})\in \mathbb {R}^{1 \times n}\) of \(\textbf{A}\) is the one-row matrix

$$\begin{aligned}{\mathcal H} (\textbf{A}) = \frac{1}{\sqrt{m}} \sum _{i = 1}^m \textbf{A}[i:].\end{aligned}$$

The tail \({\mathcal T} (\textbf{A})\in \mathbb {R}^{(m-1) \times n}\) of \(\textbf{A}\) is the matrix (for \(j \in [m-1]\))

$$\begin{aligned}{\mathcal T} (\textbf{A})[j:] = \frac{1}{\sqrt{j+1}} \big (\sqrt{j} \textbf{A}[j+1:] - \frac{\sum _{i=1}^j \textbf{A}[i:]}{\sqrt{j}} \big ) \hspace{2em}\Box \end{aligned}$$

Let a matrix \(\textbf{A}\) that represents the Cartesian product of a one-row matrix \(\textbf{S}\) and an arbitrary matrix \(\textbf{T}\). Then \(\textbf{A}\) is obtained by extending each row in \(\textbf{T}\) with the one-row \(\textbf{S}\), as in Fig. 2a. As exemplified in Sect. 1.1, if all but the first occurrence of \(\textbf{S}\) in \(\textbf{A}\) is to be replaced by zeros using Givens rotations, the specific sequence of rotations only depends on the number of rows of \(\textbf{T}\) and not on the entries of \(\textbf{S}\). The result of these rotations can be described by the head and tail of \(\textbf{T}\).

Lemma 1

For matrices \(\textbf{S} \in \mathbb {R}^{1 \times n_1}\) and \(\textbf{T} \in \mathbb {R}^{m \times n_2}\), let \(\textbf{A} = \textbf{S} \times \textbf{T} \in \mathbb {R}^{m \times (n_1+n_2)}\) be their Cartesian product. Let \(\textbf{R}_i = {{\,\mathrm{\textbf{Giv}}\,}}_m(1,i, -\frac{1}{\sqrt{i}}, \frac{\sqrt{i-1}}{\sqrt{i}})\), for all \(i \in \{2, \ldots , m\}\), be a Givens rotation matrix and let \(\textbf{G}= \textbf{R}_m \cdots \textbf{R}_2\) be the orthogonal matrix that is the product of the rotations \(\textbf{R}_m\) to \(\textbf{R}_2\).

The matrix \(\textbf{U} = \textbf{G} \textbf{A}\) obtained by applying the rotations \(\textbf{R}_i\) to \(\textbf{A}\) is:

$$\begin{aligned} \textbf{U} = \begin{bmatrix} {\mathcal H} (\textbf{A}) \\ {\mathcal T} (\textbf{A}) \end{bmatrix} = \begin{bmatrix} \sqrt{m}\textbf{S} &{} {\mathcal H} (\textbf{T}) \\ \textbf{0}^{{(m-1)} \times n_1} &{} {\mathcal T} (\textbf{T}) \end{bmatrix}.\end{aligned}$$

In other words, the application of the \(m-1\) rotations to \(\textbf{A}\) yields a matrix with four blocks: the head and tail of \(\textbf{T}\), \(\textbf{S}\) scaled by the square root of the number m of rows in \(\textbf{T}\), and zeroes over \(m-1\) rows and \(n_1\) columns. So instead of applying the rotations, we may compute these blocks directly from \(\textbf{S}\) and \(\textbf{T}\).

Fig. 2
figure 2

Visualization of Lemmas 1 and 2: Matrices and results of applying the (generalized) head and tail

We also encounter cases where blocks of matrices do not have the simple form depicted in Fig. 2a, but where the multiple copies of the one-row matrix \(\textbf{S}\) are scaled by different real numbers \(v_1\) to \(v_m\), as in Fig. 2b. In these cases, we cannot compute heads and tails as in Lemma 1 to capture the effect of a sequence of Givens rotations. However, a generalized version of heads and tails can describe these effects, as defined next.

Definition 3

(Generalized Head and Tail) For any vector \(\textbf{v} \in \mathbb {R}_{>0}^{m}\) and matrix \(\textbf{A} \in \mathbb {R}^{m \times n}\), the generalized head \({\mathcal H} (\textbf{A}, \textbf{v})\in \mathbb {R}^{1 \times n}\) and generalized tail \({\mathcal T} (\textbf{A}, \textbf{v})\in \mathbb {R}^{(m-1) \times n}\) of \(\textbf{A}\) weighted by \(\textbf{v}\) are:

$$\begin{aligned} {\mathcal H} (\textbf{A}, \textbf{v})&= \frac{1}{||\textbf{v}||_2} \sum _{i = 1}^m \textbf{v}[i] \textbf{A}[i:]\\ {\mathcal T} (\textbf{A}, \textbf{v})[j:]&= \frac{1}{||\textbf{v}[1, \ldots , j+1]||_2} \\ \Big (||\textbf{v}[1, \ldots , j&]||_2 \textbf{A}[j+1:] - \frac{\textbf{v}[j+1]\sum _{i=1}^j \textbf{v}[i] \textbf{A}[i:]}{||\textbf{v}[1, \ldots , j]||_2} \Big ) \hspace{0.3em}\Box \ \end{aligned}$$

If \(\textbf{v}\) is the vector of ones, then \(||\textbf{v}||_2 = \sqrt{m}\) and the generalized head and generalized tail become the head and tail, respectively.

We next generalize Lemma 1, where we consider each copy of \(\textbf{S}\) in \(\textbf{A}\) weighted by a positive scalar value from a weight vector \(\textbf{v}\): the i-th row of \(\textbf{T}\) is appended by \(v_i \textbf{S}\), for some positive real \(v_i\), see Fig. 2b. This scaling is expressed using the Kronecker product \(\otimes \). Here again, to set all but the first (scaled) occurrences of \(\textbf{S}\) to zero using Givens rotations, we use that these rotations are independent of \(\textbf{S}\) and only depend on the scaling factors \(\textbf{v}\) and the number of rows in \(\textbf{T}\). We use \({\mathcal H} (\textbf{A}, \textbf{v})\) and \({\mathcal T} (\textbf{A}, \textbf{v})\) to construct the result of applying the rotations to \(\textbf{A}\).

Lemma 2

Let \(\textbf{v} \in \mathbb {R}_{>0}^m, \textbf{S} \in \mathbb {R}^{1 \times n_1}\), and \(\textbf{T} \in \mathbb {R}^{m \times n_2}\) be given and let \(\textbf{A} = \begin{bmatrix} \textbf{S} \otimes \textbf{v}&\textbf{T} \end{bmatrix}\in \mathbb {R}^{m \times (n_1+n_2)}\). Let \(\textbf{R}_i = {{\,\mathrm{\textbf{Giv}}\,}}_m(1,i, -\frac{\textbf{v}[i]}{||\textbf{v}[1, \ldots , i]||_2}, \frac{||\textbf{v}[1, \ldots , i-1]||_2}{||\textbf{v}[1, \ldots , i]||_2})\), for all \(i \in \{2, \ldots , m\}\), be a Givens rotation matrix and let \(\textbf{G}\) be the orthogonal matrix \(\textbf{G} = \textbf{R}_m \cdots \textbf{R}_2\).

The matrix \(\textbf{U} = \textbf{G} \textbf{A}\) obtained by applying the rotations \(\textbf{R}_i\) to \(\textbf{A}\) is:

$$\begin{aligned} \textbf{U} = \begin{bmatrix} {\mathcal H} (\textbf{A}, \textbf{v}) \\ {\mathcal T} (\textbf{A}, \textbf{v}) \end{bmatrix} = \begin{bmatrix} ||\textbf{v}||_2\textbf{S} &{} {\mathcal H} (\textbf{T},\textbf{v}) \\ \textbf{0}^{{(m-1)} \times n_1} &{} {\mathcal T} (\textbf{T},\textbf{v}) \end{bmatrix}.\end{aligned}$$

The generalized head and tail can be computed in linear time in the size of the input matrix \(\textbf{A}\).

Lemma 3

Given any matrix \(\textbf{A} \in \mathbb {R}^{m \times n}\) and vector \(\textbf{v} \in \mathbb {R}_{>0}^m\), the generalized head \({\mathcal H} (\textbf{A}, \textbf{v})\) and the generalized tail \({\mathcal T} (\textbf{A}, \textbf{v})\) can be computed in time \(\mathcal {O}(mn)\).

The statement is obvious for generalized heads \({\mathcal H} (\textbf{A}, \textbf{v})\). We give the explanation for the tail.

Given the \(\ell _2\)-norm \(||\textbf{v}[1, \ldots , j]||_2\) of the vector \(\textbf{v}\) restricted to the first j entries, we can compute the same norm of \(\textbf{v}\), now restricted to the first \(j+1\) entries, in constant time. Likewise, the sum \(\sum _{i=1}^j \textbf{v}[i] \textbf{A}[i:]\) of the rows \(\textbf{v}[i] \textbf{A}[i:]\), where i goes up to j, can be reused to compute the sum of these rows up to \(j+1\) in time proportional to the row length n. See Alg. 1 for details.

Algorithm 1
figure a

Computing generalized tails in linear time.

The following lemma follows immediately from Def. 3.

Lemma 4

For any matrix \(\textbf{A} \in \mathbb {R}^{m \times n}\), any vector \(\textbf{v} \in R_{>0}^m\), and any numbers \(k \in \mathbb {R}\) and \(\ell \in \mathbb {R}_{>0}\), the following holds:

$$\begin{aligned} {\mathcal H} (k \textbf{A}, \ell \textbf{v}) =&\ k\ {\mathcal H} (\textbf{A}, \textbf{v}) \\ {\mathcal T} (k \textbf{A}, \ell \textbf{v}) =&\ k\ {\mathcal T} (\textbf{A}, \textbf{v}) \end{aligned}$$

4 Example: rotations on a join output

Section 1.1 shows how to transform a Cartesian product of two matrices into an almost upper-triangular matrix by applying Givens rotations. Section 3 subsequently shows how the same result can be computed in linear time from the input matrices using the new operations head and tail, whose effects on matrices of a special form are summarized in Lemmas 1 and 2.

Towards an algorithm for the general case, this section gives an example of applying Givens rotations to a matrix that represents the natural join of four input relations. The insights gained from this example lead to the formulation of the FiGaRo algorithm.

Our approach is as follows. The result of joining two relations is a union of Cartesian products: for each join key \(\bar{x}\), all rows from the first relation that have \(\bar{x}\) as their value for the join attributes are paired with all rows from the second relation with \(\bar{x}\) as their join value. Section 1.1 and 3 explain how to process Cartesian products. Then we take the union of their results, so, accumulate the resulting rows in a single matrix.

Fig. 3
figure 3

Matrices and join tree used in Sect. 4

Figure 3 sketches the input matrices \(\textbf{S}_1, \ldots , \textbf{S}_4\) and the join tree \(\tau \) used in this section. We explain \(\textbf{S}_1\), the other matrices are similar. The matrix \(\textbf{S}_1\) in Fig. 3a has a column \(X_{12}\) that represents the key attribute common with the matrix \(\textbf{S}_2\). It contains two distinct values \(a_1\) and \(a_2\) for its join attribute \(X_{12}\). For each \(a_j\), there is one vector \(\textbf{S}_1^{a_j}\) of values for the data column \(Y_1\) in \(\textbf{S}_1\). The rows of \(\textbf{S}_1\) are the union of the Cartesian products \([a_j] \times \textbf{S}_1^{a_j}\), for all \(j \in \{1,2\}\).

The output \(\textbf{A}\) of the natural join of \(\textbf{S}_1, \ldots , \textbf{S}_4\) contains the Cartesian product

$$\begin{aligned}{}[a_j \; b_k \; c_\ell ] \times \textbf{S}_1^{a_j} \times \textbf{S}_2^{a_ j b_k c_\ell } \times \textbf{S}_3^{b_k} \times \textbf{S}_4^{c_\ell } \end{aligned}$$

for every tuple \((a_j, b_k, c_\ell )\) of join keys, with \(j,k,l \in \{1,2\}\). As mentioned in Sect. 2, we want to transform the matrix \(\textbf{A}[\bar{Y}]\), i.e., the projection of \(\textbf{A}\) to the data columns \(\bar{Y}\), into a matrix \(\textbf{R}_0\) that is almost upper-triangular. In particular, the number of rows with nonzero entries in \(\textbf{R}_0\) needs to be linear in the size of the input matrices. We do this by repeatedly identifying parts of the matrix \(\textbf{A}[\bar{Y}]\) that have the form depicted in Fig. 2 and by applying Lemmas 1 and 2 to these parts. Recall that each of these applications corresponds to the application of a sequence of Givens rotations.

The matrix \(\textbf{A}[: \bar{Y}]\) contains the rows with the following Cartesian product associated with the join keys \((a_1, b_1, c_1)\):

figure b

In turn, this Cartesian product is the union of Cartesian products \([t_1 \; t_2 \; t_3] \times \textbf{S}_4^{c_1}\), for each choice \(t_1, t_2, t_3\) of rows of \(\textbf{S}_1^{a_1}\), \(\textbf{S}_2^{a_1b_1c_1}\) and respectively \(\textbf{S}_3^{b_1}\). The latter products have the form depicted in Fig. 2a: the Cartesian product of a one-row matrix \([t_1 \; t_2 \; t_3]\) and an arbitrary matrix \(\textbf{S}_4^{c_1}\). By applying Lemma 1 to each of them, we obtain the following rows, where we use \(\alpha = \sqrt{|\textbf{S}_4^{c_1}|}\):

figure c

This matrix contains one copy of \({\mathcal T} (\textbf{S}_4^{c_1})\) for every choice of \(t_1, t_2, t_3\). More copies of \({\mathcal T} (\textbf{S}_4^{c_1})\) arise when we analogously apply Lemma 1 to the rows of \(\textbf{A}[: \bar{Y}]\) associated with join keys \((a_j, b_k, c_1)\) that are different from \((a_1, b_1, c_1)\). In total, the number of copies of \({\mathcal T} (\textbf{S}_4^{c_1})\) we obtain is

$$\begin{aligned}\sum _{j,k \in \{1,2\}} |\textbf{S}_1^{a_j}|\cdot |\textbf{S}_2^{a_jb_kc_1}|\cdot |\textbf{S}_3^{b_k}| = \big | \sigma _{X_{24}=c_1} (\textbf{S}_1 \mathop {\mathrm {\bowtie }}\limits \textbf{S}_2 \mathop {\mathrm {\bowtie }}\limits \textbf{S}_3) \big |,\end{aligned}$$

so, the size of the join of all matrices except \(\textbf{S}_4\), where the attribute \(X_{24}\) is fixed to \(c_1\). We denote this number by \(\varPhi _4^\circ (c_1)\).

We apply more Givens rotations to the rows of the form \([0 \; 0 \; 0] \times \mathbf {{\mathcal T}}(\textbf{S}_4^{c_1})\). These are again Cartesian products, namely the Cartesian products of the zero matrix of dimension \(\varPhi _4^\circ (c_1) \times 3\) and each row t of \({\mathcal T} (S_4^{c_1})\). We can apply again Lemma 1 to these Cartesian products and obtain the rows

figure d

The rows that only consist of zeros can be discarded, rows with nonzero entries become part of the (almost upper-triangular) result matrix \(\textbf{R}_0\).

We now turn back to the remaining rows, which are of the form

figure e

As above, we view these rows as unions of (column-permutations of) Cartesian products \([\alpha t_1 \; \alpha t_2 \; {\mathcal H} (\textbf{S}_4^{c_1})] \times \alpha S_3^{b_1}\), for each choice \(t_1, t_2\) of rows of \(\textbf{S}_1^{a_1}\) and \(\textbf{S}_2^{a_1b_1c_1}\), respectively. The corresponding applications of Lemma 1 yield the rows

figure f

where \(\beta = \sqrt{|\textbf{S}_3^{b_1}|}\) and \(\gamma = \alpha \beta \). Considering also the rows that we analogously obtain for join keys \((a_j, b_1, c_\ell )\) different from \((a_1, b_1, c_1)\), we get \(\sum _{j\in \{1,2\}} |\textbf{S}_1^{a_j}|\cdot |\textbf{S}_2^{a_jb_1c_1}|\) copies of \(\sqrt{|\textbf{S}_4^{c_1}|}{\mathcal T} (\textbf{S}_3^{b_1})\) in total, and \(\sum _{j \in \{1,2\}} |\textbf{S}_1^{a_j}|\cdot |\textbf{S}_2^{a_jb_1c_2}|\) copies of \(\sqrt{|\textbf{S}_4^{c_2}|}{\mathcal T} (\textbf{S}_3^{b_1})\).

We apply further Givens rotations to the rows that consist of zeros and the scaled copies of \({\mathcal T} (\textbf{S}_3^{b_1})\). For each row t of \({\mathcal T} (\textbf{S}_3^{b_1})\), we apply this time Lemma 2 to the matrix of all scaled copies of t and three columns of zeros. The result contains some all-zero rows and one row with entry t scaled by the \(\ell _2\) norm of the vector of the original scaling factors. This factor is the square root of

$$\begin{aligned}&\sum _{j \in \{1,2\}} |\textbf{S}_1^{a_j}|\cdot |\textbf{S}_2^{a_jb_1c_1}|\cdot \sqrt{|\textbf{S}_4^{c_1}|}^2 \\&\qquad + \sum _{j \in \{1,2\}} |\textbf{S}_1^{a_j}|\cdot |\textbf{S}_2^{a_jb_1c_2}|\cdot \sqrt{|\textbf{S}_4^{c_2}|}^2 \\ =&\sum _{j,\ell \in \{1,2\}} |\textbf{S}_1^{a_j}|\cdot |\textbf{S}_2^{a_jb_1c_\ell }|\cdot |\textbf{S}_4^{c_\ell }| \\ =&\, \big | \sigma _{X_{23}=b_1} (\textbf{S}_1 \mathop {\mathrm {\bowtie }}\limits \textbf{S}_2 \mathop {\mathrm {\bowtie }}\limits \textbf{S}_4) \big |, \end{aligned}$$

which, similarly as above, is the size of the join of all matrices except \(\textbf{S}_3\), where the attribute \(X_{23}\) is fixed to \(b_1\). We denote this number by \(\varPhi _3^\circ (b_1)\). In summary, the result of applying Lemma 2 contains, besides some all-zero rows, the rows

figure g

The remaining columns are processed analogously. Up to now, the transformations result in all-zero rows, rows that contain the scaled tail of a part of an input matrix with other columns being 0, and one row for every join key \((a_j, b_k, c_\ell )\), for \(j,k,\ell \in \{1,2\}\). We give the resulting rows for \(a_j = a_1\), where for \(k,\ell \in \{1,2\}\) we use \(\alpha _{k \ell } = \sqrt{|\textbf{S}_1^{a_1}|\cdot |\textbf{S}_2^{a_1 b_k c_\ell }|\cdot |\textbf{S}_3^{b_k}|\cdot |\textbf{S}_4^{c_\ell }|}\).

The first column \(Y_1\) has entries

figure h

the remaining columns are

figure i

We observe that the entries in the column \(Y_1\) only differ by the scaling factors \(\alpha _{k \ell }\). It follows that we can apply Lemma 2, with \(\textbf{S} = {\mathcal H} (\textbf{S}_1^{a_1})\), \(\textbf{T}\) consisting of the columns \(Y_2, Y_3, Y_4\) above, and \(\textbf{v} = \begin{bmatrix} \frac{\alpha _{11}}{\sqrt{|\textbf{S}_1^{a_1}|}} \\ \vdots \\ \frac{\alpha _{22}}{\sqrt{|\textbf{S}_1^{a_1}|}} \end{bmatrix} = \begin{bmatrix} \frac{\sqrt{|\textbf{S}_1^{a_1}|\cdot |\textbf{S}_2^{a_1 b_1 c_1}|\cdot |\textbf{S}_3^{b_1}|\cdot |\textbf{S}_4^{c_1}|}}{\sqrt{|\textbf{S}_1^{a_1}|}} \\ \vdots \\ \frac{\sqrt{|\textbf{S}_1^{a_1}|\cdot |\textbf{S}_2^{a_1 b_2 c_2}|\cdot |\textbf{S}_3^{b_2}|\cdot |\textbf{S}_4^{c_2}|}}{\sqrt{|\textbf{S}_1^{a_1}|}} \end{bmatrix} = \begin{bmatrix} \sqrt{|\textbf{S}_2^{a_1b_1c_1}| {\cdot } |\textbf{S}_3^{b_1}| {\cdot } |\textbf{S}_4^{c_1}|} \\ \vdots \\ \sqrt{|\textbf{S}_2^{a_1b_2c_2}| {\cdot } |\textbf{S}_3^{b_2}| {\cdot } |\textbf{S}_4^{c_2}|} \end{bmatrix}\).

The first row of the result is the generalized head of the above rows. For column \(Y_1\), this is \(||\textbf{v}||_2 {\mathcal H} (\textbf{S}_1^{a_1})\), where \(||\textbf{v}||_2\) equals \(\sqrt{\sum _{k, \ell \in \{1,2\}}|\textbf{S}_2^{a_1b_kc_\ell }|\cdot |\textbf{S}_3^{b_k}|\cdot |\textbf{S}_4^{c_\ell }|} = \sqrt{|\sigma _{X_{12}=a_1} \textbf{S}_2 \mathop {\mathrm {\bowtie }}\limits \textbf{S}_3 \mathop {\mathrm {\bowtie }}\limits \textbf{S}_4|}\). The term under the square root is the size of the join of all matrices in the subtree of \(\textbf{S}_3\) in \(\tau \), where \(X_{12}\) is fixed to \(a_1\). We denote this number by \(\varPhi _3^\downarrow (a_1)\). We skip the discussion for the other columns.

The remaining rows are formed by the generalized tail. In column \(Y_1\), these rows only have zeros. For column \(Y_4\), the result is \(\sqrt{|\textbf{S}_1^{a_1}|} \textbf{T}_{4,a_1}\), where \(\textbf{T}_{4,a_1} = {\mathcal T} \) \(\left( \begin{bmatrix} \sqrt{|\textbf{S}_2^{a_1 b_1 c_1}|\cdot |\textbf{S}_3^{b_1}|} {\mathcal H} (\textbf{S}_4^{c_1}) \\ \vdots \\ \sqrt{|\textbf{S}_2^{a_1 b_2 c_2}|\cdot |\textbf{S}_3^{b_2}|} {\mathcal H} (\textbf{S}_4^{c_2}) \end{bmatrix},\textbf{v}\right) \), and analogously for the columns \(Y_2\) and \(Y_3\).

Remark 1

Suppose \(\textbf{S}_1\) would have another child \(\textbf{S}_5\) in \(\tau \), so, \(\textbf{S}_1\) would also have the join attributes \(X_{15}\). Then we would get one scaled copy of \(\textbf{T}_{4,a_1}\) for every value of \(X_{15}\) that occurs together with the value \(a_1\) for \(X_{12}\). We would then apply Lemma 2 again. The scaling factor for \(T_{4,a_1}\) we obtain in the general case is the square root of the size of the join of all relations that are not in the subtree of \(\textbf{S}_3\) in \(\tau \), where \(X_{12}\) is fixed to \(a_1\). In the simple case considered here, this value is \(\sqrt{|\textbf{S}_1^{a_1}|}\).

To sum up, the resulting nonzero rows are of the following form:

  1. 1.

    \(\sqrt{\varPhi ^\circ _i(\bar{x}_i)} {\mathcal T} (\textbf{S}_i^{\bar{x}_i})\), for all \(1 \le i \le 4\) and values \(\bar{x}_i\) for the join attributes of \(\textbf{S}_i\),

  2. 2.

    \(\sqrt{|\textbf{S}_1^{a_j}|} [\textbf{T}_{2,a_j} \; \textbf{T}_{3,a_j} \; \textbf{T}_{4,a_j}]\), for \(j \in \{1,2\}\), and

  3. 3.

    one row of scaled generalized heads for \(a_1\) and \(a_2\).    

The number of rows of type (1) is bounded by the overall number of rows in the input matrices minus the number of different join keys, as the tail of a matrix has one row less than the input matrix. The number of rows of type (2) is bounded by the number of different values for the join attributes \((X_{12},X_{23}, X_{24})\) minus the number of different values for \(X_{12}\). The number of rows of type (3) is bounded by the number of different values for the join attribute \(X_{12}\). Together, the number of nonzero rows is bounded by the overall number of rows in the input matrices, as desired.

5 Scaling factors as count queries

The example in Sect. 1.1 applies Givens rotations to a Cartesian product. The resulting matrix uses scaling factors that are square roots of the numbers of rows of the input matrices. These numbers can be trivially computed directly from these matrices. In case of a matrix defined by joins, as seen in the preceding Sect. 4, these scaling factors are defined by group-by count queries over the joins. FiGaRo needs three types of such count queries at each node in the join tree. They can be computed together in linear time in the size of the input matrices. This section defines these queries and explains how to compute them efficiently.

Fig. 4
figure 4

Count queries used by FiGaRo, defined over the join tree \(\tau \) of the r input matrices \(\textbf{S}_1, \ldots , \textbf{S}_r\). \(\bar{X}_i\) are the join attributes of \(\textbf{S}_i\) and \(\bar{X}_p\) are the join attributes that \(\textbf{S}_i\) has in common with its parent in \(\tau \). \(\varPhi _i^\circ \) gives the number of tuples in the join of all relations except \(\textbf{S}_i\), grouped by the attributes \(\bar{X}_i\). \(\varPhi _i^\uparrow \) and \(\varPhi _i^\downarrow \) give the number of tuples in the join of the relations that are in the subtree of \(\textbf{S}_i\) in \(\tau \) respectively not in this subtree, grouped by the attributes \(\bar{X}_p\)

We define the count queries \(\varPhi _i^\circ \), \(\varPhi _i^\uparrow \) and \(\varPhi _i^\downarrow \) for each input matrix \(\textbf{S}_i\) in a given join tree \(\tau \) (\(i \in [r]\)). Figure 4 depicts graphically the meaning of these count queries: They give the sizes of joins of subsets of the input matrices, grouped by various join attributes of \(\textbf{S}_i\): \(\varPhi _i^\uparrow \) and \(\varPhi _i^\downarrow \) join all matrices above and respectively below \(\textbf{S}_i\) in the join tree \(\tau \), while \(\varPhi _i^\circ \) joins all matrices except \(\textbf{S}_i\).

Each of these aggregates can be computed individually in time linear in the size of the input matrices using standard techniques for group-by aggregates over acyclic joins [42, 59]. This section shows how to further speed up their evaluation by identifying and executing common computation across all these aggregates. This reduces the necessary number of scans of each of the input matrices from \(\mathcal {O}(r)\) scans to two scans.

We recall our notation: \(\tau \) be the join tree; \(\bar{X}_i\) are the join attributes of \(\textbf{S}_i\) and \(\bar{X}_{ij}\) are the common join attributes of \(\textbf{S}_i\) and its child \(\textbf{S}_j\) in \(\tau \). We assume that: each matrix \(\textbf{S}_i\) is sorted on the attributes \(\bar{X}_i\); and a child \(\textbf{S}_j\) of \(\textbf{S}_i\) in \(\tau \) is first sorted on the attributes \(\bar{X}_{ij}\).

We first initialize the counts \(\textsc {rows\_per\_key}_i(\bar{x}_i)\) for each matrix \(\textbf{S}_i\) and value \(\bar{x}_i\) for its join attributes \(\bar{X}_i\). These counts give the number of those rows in \(\textbf{S}_i\) that contain the join key \(\bar{x}_i\) for the columns \(\bar{X}_i\). This step is done in one pass over the sorted matrices and takes linear time in their size. The aggregates \(\varPhi _i^\downarrow \) are computed in the same data pass. The computation of the aggregates \(\varPhi _i^\circ \) and \(\varPhi _i^\uparrow \) is done in a second pass over the matrices. We utilize the following relationships between the aggregates.

Computing \(\varPhi _i^\downarrow \): This aggregate gives the size of the join of the matrices in the subtree of \(\textbf{S}_i\) in \(\tau \) grouped by the join attributes \(\bar{X}_{p}\) that are common to \(\textbf{S}_i\) and its parent in \(\tau \). If \(\textbf{S}_i\) is a leaf in \(\tau \), then \(\bar{X}_{p} = \bar{X}_i\) and \(\varPhi _i^\downarrow (\bar{x}_i) = \textsc {rows\_per\_key}_i(\bar{x}_i)\). If \(\textbf{S}_i\) is not a leaf in \(\tau \), then we compute the intermediate aggregate

$$\begin{aligned}\varTheta ^\downarrow _{i}(\bar{x}_i) = \Big |\sigma _{\bar{X}_i = \bar{x}_i}\Big (\mathop {\mathrm {\bowtie }}\limits \limits _{\begin{array}{c} j \in [r] \\ \textbf{S}_j \text { is in the subtree of } \textbf{S}_i \end{array}} \textbf{S}_j\Big )\Big | \end{aligned}$$

which gives the size of the join of the same matrices as \(\varPhi _i^\downarrow \), but grouped by \(\bar{X}_i\) instead of \(\bar{X}_{p}\). The reason for computing \(\varTheta _i^\downarrow \) is that it is useful for other aggregates as well, as discussed later. This aggregate can also be expressed as

$$\begin{aligned}\varTheta _i^\downarrow (\bar{x}_i) = \textsc {rows\_per\_key}_i(\bar{x}_i) \cdot \prod _{\text {child }\textbf{S}_j \text { of } \textbf{S}_i} \varPhi _j^\downarrow (\bar{x}_{ij}),\end{aligned}$$

where \(\bar{x}_{ij}\) is the projection of \(\bar{x}_i\) to the join attributes \(\bar{X}_{ij}\). The aggregate \(\varPhi _i^\downarrow (\bar{x}_{p})\) is obtained by summing all values \(\varTheta _i^\downarrow (\bar{x}_i)\) such that \(\bar{x}_i\) projected to \(\bar{X}_p\) equals \(\bar{x}_p\).

All aggregates \(\varPhi _1^\downarrow ,\ldots ,\varPhi _r^\downarrow \) can thus be computed in a bottom-up traversal of the join tree \(\tau \) and in one pass over the sorted matrices.

Computing \(\varPhi _i^\uparrow \): This aggregate is defined similarly to \(\varPhi _i^\downarrow \), where the join is now over all matrices that are not in the subtree of \(\textbf{S}_i\) in the join tree \(\tau \). It is computed in a top-down traversal of \(\tau \).

Assume \(\varPhi _i^\uparrow (\bar{x}_{p})\) is already computed, where \(\bar{x}_p\) is a value for the join attributes \(\bar{X}_p\) that \(\textbf{S}_i\) has in common with its parent in \(\tau \). The intermediate aggregate \(\textsc {full\_join\_size}_i(\bar{x}_i) = \varPhi _i^\uparrow (\bar{x}_{p}) \cdot \varTheta _i^\downarrow (\bar{x}_i)\) gives the size of the join over all matrices, grouped by \(\bar{X}_{i}\). For the root \(\textbf{S}_1\) of \(\tau \), we set \(\textsc {full\_join\_size}_1(\bar{x}_1) = \varTheta _1^\downarrow (\bar{x}_1)\).

We next define aggregates \(\textsc {full\_join\_size}_{ij}(\bar{x}_{ij})\) that give the size of the join of all matrices for each value of the join attributes \(\bar{X}_{ij}\) that are common to \(\textbf{S}_i\) and its child \(\textbf{S}_j\). They can be computed by summing up the values \(\textsc {full\_join\_size}_i(\bar{x}_i)\) for the keys \(\bar{x}_i\) that agree with \(\bar{x}_{ij}\) on \(\bar{X}_{ij}\). Since this size is the product of \(\varPhi _j^\downarrow (\bar{x}_{ij})\) and \(\varPhi _j^\uparrow (\bar{x}_{ij})\), we have \(\varPhi _j^\uparrow (\bar{x}_{ij}) = \textsc {full\_join\_size}_{ij}(\bar{x}_{ij})/\varPhi _j^\downarrow (\bar{x}_{ij})\), where \(\varPhi _j^\downarrow \) is already computed. For all children \(\textbf{S}_j\) of \(\textbf{S}_i\), we can compute all sums \(\textsc {full\_join\_size}_{ij}\) together in one pass over \(\textbf{S}_i\).

Algorithm 2
figure j

Computing the count aggregates for FiGaRo.

Computing \(\varPhi _i^\circ \): This aggregate gives the size of the join of all matrices except \(\textbf{S}_i\), grouped by the attributes \(\bar{X}_i\). If \(\textbf{S}_i\) is a leaf in \(\tau \), then \(\varPhi _i^\circ = \varPhi _i^\uparrow \) by definition. For a non-leaf \(\textbf{S}_i\), we re-use the values \(\textsc {full\_join\_size}_i(\bar{x}_i)\) defined above. This value is very similar to \(\varPhi _i^\circ (\bar{x}_i)\), but also depends on the size \(\textsc {rows\_per\_key}_i(\bar{x}_i)\). More precisely, we need to divide \(\textsc {full\_join\_size}_i(\bar{x}_i)\) by \(\textsc {rows\_per\_key}_i(\bar{x}_i)\) to obtain \(\varPhi _i^\circ (\bar{x}_i)\).

Alg. 2 gives the procedure compute-counts for shared computation of the three aggregate types. First, in a bottom-up pass of the join tree \(\tau \), it computes the aggregates \(\varTheta _i^\downarrow \) and \(\varPhi _i^\downarrow \). Then, it computes \(\varPhi _i^\uparrow \) and \(\varPhi _i^\circ \) in a top-down pass over \(\tau \). The view \(\textsc {full\_join\_size}_i\) is defined depending on whether \(\textbf{S}_i\) is the root. For every child \(\textbf{S}_j\), the value \(\textsc {full\_join\_size}_i\) is added to \(\varPhi _j^\uparrow (\bar{x}_{ij})\). The division by \(\varPhi _j^\downarrow (\bar{x}_{ij})\) is done just before the recursive call. Before that call, \(\varPhi _i^\circ \) is computed.

Lemma 5

Given the matrices \(\textbf{S}_1, \ldots , \textbf{S}_r\) and the join tree \(\tau \), the aggregates \(\varPhi _i^\downarrow \), \(\varPhi _i^\uparrow \) and \(\varPhi _i^\circ \) from Fig. 4 can be computed in time \(\mathcal {O}(|\textbf{S}_1|+\cdots +|\textbf{S}_r|)\) and with two passes over these matrices.

We parallelize Alg. 2 by executing the loops starting in lines 7 and 15 in parallel for different values \(\bar{x}_i\). Atomic operations are used for the assignments in lines 13 and 22 to handle concurrent write operations of different threads.

6 FiGaRo: pushing rotations past joins

We are now ready to introduce FiGaRo. Algorithm 3 gives its pseudocode. It takes as input the matrices \(\textbf{S}_1, \ldots , \textbf{S}_r\) and a join tree \(\tau \) of the natural join of the input matrices. It computes an almost upper-triangular matrix \(\textbf{R}_0\) with at most \(|\textbf{S}_1| + \cdots + |\textbf{S}_r|\) nonzero rows such that \(\textbf{R}_0\) can be alternatively obtained from the natural join of the input matrices by a sequence of Givens rotations. It does so in linear time in the size of the input matrices. In contrast to the example in Sect. 4, FiGaRo does not require the join output \(\textbf{A}\) to be materialized. Instead, it traverses the input join tree bottom-up and computes heads and tails of the join of the matrices under each node in the join tree.

FiGaRo uses the following two high-level ideas.

First, the join output typically contains multiple copies of an input row, where the precise number of copies depends on the number of rows from other matrices that share the same join key. When using Givens rotations to set all but the first occurrence of a value to zero, this first occurrence will be scaled by some factor, which is given by a count query (Sect. 5). FiGaRo applies the scaling factor directly to the value, without the detour of constructing the multiple copies and getting rid of them again.

Second, all matrix head and tail computations can be done independently on the different columns of the input matrix. FiGaRo performs them on the input matrix, where these columns originate, and combines the results, including the application of scaling factors, in a bottom-up pass over the join tree \(\tau \).

Algorithm 3
figure k

FiGaRo computes the almost upper-triangular matrix \(\textbf{R}_0\) from the input matrices \(\textbf{S}_1, \ldots , \textbf{S}_r\) with join tree \(\tau \).

The algorithm. FiGaRo manages two matrices \(\textbf{Data} \) and \(\textbf{Out} \). The matrix \(\textbf{Out} \) holds the result of FiGaRo, the matrix \(\textbf{Data} \) holds (generalized) heads that will be processed further in later steps. The algorithm proceeds recursively along the join tree \(\tau \), starting from the root. It maintains the invariant that after the computation finished for a non-root node \(\textbf{S}_i\) of the join tree, \(\textbf{Data} \) contains exactly one row for every value of the join attributes that are shared between \(\textbf{S}_i\) and its parent in \(\tau \).

For a matrix \(\textbf{S}_i\), FiGaRo first iterates over all values \(\bar{x}_i\) of its join attributes \(\bar{X}_i\). For each \(\bar{x}_i\) it computes head and tail of the matrix \(\textbf{S}_i^{\bar{x}_i}\) that is obtained by selecting all rows of \(\textbf{S}_i\) that have the join key \(\bar{x}_i\) and then by projecting these rows onto the data columns \(\bar{Y}_i\). The tail is multiplied with the scaling factor \(\varPhi _i^\circ (\bar{x}_i)\) and written to the output \(\textbf{Out}\), adding zeros to other columns. The head is stored in the matrix \(\textbf{Data} \). The scaling factor \(\sqrt{|\textbf{S}_i^{\bar{x}_i}|}\) that, as per Lemma 1, is applied to other columns, is stored in the vector \(\textbf{scales} \). If \(\textbf{S}_i\) is a leaf in the join tree \(\tau \), then the invariant is satisfied.

If \(\textbf{S}_i\) is not a leaf in the join tree \(\tau \), then the algorithm is recursively called for its children. Any output of the recursive calls is written to \(\textbf{Out}\). The recursive call for the child \(\textbf{S}_j\) also returns a matrix \(\textbf{Data} _j\) that contains columns \(\bar{Y}_k\) with computed heads corresponding to this column, for all k such that \(\textbf{S}_k\) is in the subtree of \(\textbf{S}_j\) in \(\tau \), as well as scaling factors \(\textbf{scales} _j\). As to the maintained invariant, \(\textbf{Data} _j\) contains exactly one row for every value of the join attributes \(\bar{X}_{ij}\) that are shared between \(\textbf{S}_i\) and \(\textbf{S}_j\).

The contents of the matrices \(\textbf{Data} \) and \(\textbf{Data} _j\) are joined. More precisely, for each join value \(\bar{x}_i\) of the attributes \(\bar{X}_i\) of \(\textbf{S}_i\), the entry \(\textbf{Data} [\bar{x}_i: \bar{Y}_k]\) is set to the entry \(\textbf{Data} _j[\bar{x}_{ij}: \bar{Y}_k]\), where \(\bar{x}_{ij}\) is the projection of \(\bar{x}_i\) onto \(\bar{X}_{ij}\). Also, scaling factors are applied to the different columns of \(\textbf{Data} \). For the columns \(\bar{Y}_k\), all scaling factors from \(\textbf{scales} \) and from the vectors \(\textbf{scales} _j\) need to be applied, for all j such that \(\textbf{S}_k\) is not in the subtree of \(\textbf{S}_j\). In the case of \(\bar{Y}_i\), all factors from the vectors \(\textbf{scales} _j\) are applied, but not from \(\textbf{scales} \).

If \(\textbf{S}_i\) is not the root of the join tree, rows of \(\textbf{Data} \) are aggregated in project_away_join_attributes to satisfy the invariant. Before entering the loop of Line 28, \(\textbf{Data} \) contains one row for every value \(\bar{x}_i\) of the join attributes \(\bar{X}_i\) of \(\bar{S}_i\). The number of rows is reduced in project_away_join_attributes such that there is only one row for every value \(\bar{x}_p\) of the join attributes \(\bar{X}_p\) that are shared between \(\textbf{S}_i\) and its parent in \(\tau \). To do so, FiGaRo iterates over all values \(\bar{x}_p\) of these attributes. For each \(\bar{x}_p\) it takes the rows of \(\textbf{Data} \) for the keys \(\bar{x}_i\) that agree with \(\bar{x}_p\) on \(\bar{X}_p\), and computes their generalized head and tail. The generalized tail is scaled by the factor \(\sqrt{\varPhi _i^\uparrow (\bar{x}_p)}\) and written to \(\textbf{Out} \), adding zeros for other columns. The matrix \(\textbf{Data}\) is overwritten with the collected generalized heads. So, the invariant is satisfied now. The scaling factor \(\sqrt{\varPhi _i^\downarrow (\bar{x}_p)}\) equals the scaling factor that is applied to other columns as per Lemma 2, and is stored.

The next theorem states that indeed the result of FiGaRo is an almost upper-triangular matrix with a linear number of nonzero rows. It also states the runtime guarantees of FiGaRo.

Theorem 1

Let the matrices \(\textbf{S}_1, \ldots , \textbf{S}_r\) represent fully reduced relations and let \(\tau \) be any join tree of the natural join over these relations. Let \(\textbf{A}\) be the matrix representing this natural join. Let M be the overall number of rows and N be the overall number of data columns in the input matrices \(\textbf{S}_i\).

On input \((\textbf{S}_1, \ldots , \textbf{S}_r\),\(\tau \),1), FiGaRo returns a matrix \(\textbf{R}_0\) in \(\mathcal {O}(MN)\) time such that:

  1. (1)

    \(\textbf{R}_0\) has at most M rows,

  2. (2)

    there is an orthogonal matrix \(\textbf{Q}\) such that

    $$\begin{aligned}\textbf{A}[: \bar{Y}] = \textbf{Q} \begin{bmatrix} \textbf{R}_0 \\ \textbf{0} \end{bmatrix}.\end{aligned}$$

We can parallelize FiGaRo by executing the loop starting in line 12 of Alg. 3 in parallel for different values \(\bar{x}_i\), similarly for the loops starting in lines 28 and 21. As the executions are independent, no synchronization between the threads is necessary.

Comparison with Sect. 4. We exemplify the execution of FiGaRo using the same input as in Sect. 4 and depicted in Fig. 3. We will obtain the same result \(\textbf{R}_0\) as in that section, modulo permutations of rows.

After the initial call of \({\textsc {FiGaRo}} (\textbf{S}_1, \ldots , \textbf{S}_4,\tau ,1)\), at first heads and tails of \(\textbf{S}_1\) are computed. We skip the explanation of this part and focus on the following recursive call of \({\textsc {FiGaRo}} (\textbf{S}_1, \ldots , \textbf{S}_4,\tau ,2)\). When computing the heads and tails of \(\textbf{S}_2\), at first the matrices \(\sqrt{\varPhi ^\circ _2(a_jb_kc_\ell )}{\mathcal T} (\textbf{S}_2^{a_jb_kc_\ell })\) are padded with zeros and written to \(\textbf{Out} \), for all \(j, k, \ell \in \{1,2\}\). Then, the entry \(\textbf{Data} [a_jb_kc_\ell : Y_3]\) is set to \({\mathcal H} (\textbf{S}_2^{a_jb_kc_\ell })\), \(\textbf{scales} [a_jb_kc_\ell ]\) becomes \(\sqrt{|\textbf{S}_2^{a_jb_kc_\ell }|}\).

Next, the children \(\textbf{S}_3\) and \(\textbf{S}_4\) of \(\textbf{S}_2\) in \(\tau \) are processed and the intermediate results are joined. The recursive calls for the children return \(\textbf{Data} _3[b_k: Y_3] = {\mathcal H} (\textbf{S}_3^{b_k})\) and \(\textbf{scales} _3[b_k] = \sqrt{|\textbf{S}_3^{b_k}|}\) as well as \(\textbf{Data} _4[c_\ell : Y_4] = {\mathcal H} (\textbf{S}_4^{c_\ell })\) and \(\textbf{scales} _4[c_\ell ] = \sqrt{|\textbf{S}_4^{c_\ell }|}\); the matrices \(\textbf{Out} _3\), \(\textbf{Out} _4\) contain as nonzero entries \(\sqrt{\varPhi ^\circ _3(b_k)} {\mathcal T} (\textbf{S}_3^{b_k})\) and \(\sqrt{\varPhi ^\circ _4(c_\ell )} {\mathcal T} (\textbf{S}_4^{c_\ell })\), respectively, for every \(k, \ell \in \{1,2\}\).

After joining the results and applying the scaling factors, the rows \(\textbf{Data} [a_jb_kc_\ell :]\) are as follows for every \(j, k, \ell \in \{1,2\}\), with \(\beta _{jk \ell } = \sqrt{|\textbf{S}_2^{a_jb_kc_\ell }|\cdot |\textbf{S}_3^{b_k}|\cdot |\textbf{S}_4^{c_\ell }|}\):

figure l

It holds \(\textbf{scales} [a_j b_kc_\ell ] =\beta _{j k \ell }\).

Notice the similarity to the corresponding rows from Sect. 4. There we obtained, for \(j=1\) and with \(\alpha _{k \ell } = \sqrt{|\textbf{S}_1^{a_1}|\cdot |\textbf{S}_2^{a_1 b_k c_\ell }|\cdot |\textbf{S}_3^{b_k}|\cdot |\textbf{S}_4^{c_\ell }|}\), rows that have a column \(Y_1\) consisting of \(\frac{\alpha _{k \ell }}{\sqrt{|\textbf{S}_1^{a_1}|}} {\mathcal H} (\textbf{S}_1^{a_1})\), and the columns

figure m

For \(j = 1\), the only difference in the columns \(Y_2\) to \(Y_4\) is in the scaling factors \(\alpha _{k \ell }\) and \(\beta _{1 k \ell }\), which only differ by the factor \(\sqrt{|\textbf{S}_1^{a_1}|}\).

The join attributes \(X_{23}\) and \(X_{24}\) are then projected away, as they do not appear in the parent \(\textbf{S}_1\) of \(\textbf{S}_2\). Observe that the vectors that are used for defining the generalized heads and tails are the same here and in Sect. 4. For example, the part of \(\textbf{scales} \) that corresponds to \(a_j = a_1\) equals \(\textbf{v} = \begin{bmatrix} \sqrt{|\textbf{S}_2^{a_1b_1c_1}|\cdot |\textbf{S}_3^{b_1}|\cdot |\textbf{S}_4^{c_1}|} \\ \vdots \\ \sqrt{|\textbf{S}_2^{a_1b_2c_2}|\cdot |\textbf{S}_3^{b_2}|\cdot |\textbf{S}_4^{c_2}|} \end{bmatrix}\), the vector used in Sect. 4. So, the generalized heads and tails for the rows of \(\textbf{Data} \) associated with join keys \(a_1\) and \(a_2\), respectively, have the same results as in Sect. 4 for the columns \(Y_2\) to \(Y_4\), modulo the factor \(\sqrt{|\textbf{S}_1^{a_1}|}\). This factor equals \(\sqrt{\varPhi _2^\uparrow (a_1)}\), which FiGaRo applies to the generalized tails. That factor is also applied to the generalized heads in \(\textbf{Data} \), namely by the procedure process_and_join_children in the call \({\textsc {FiGaRo}} (\textbf{S}_1, \ldots , \textbf{S}_4,\tau ,1)\), after the call of FiGaRo for \(\textbf{S}_2\) terminates. There, also the column \(Y_1\) is added. The rows in the final result of \({\textsc {FiGaRo}} (\textbf{S}_1, \ldots , \textbf{S}_4,\tau ,1)\) are then the same as in Sect. 4.

7 Post-processing

The output of FiGaRo is so far not an upper-triangular matrix, but a matrix \(\textbf{R}_0\) consisting of linearly many rows in the size of the input matrices. The transformation of \(\textbf{R}_0\) into the final upper-triangular matrix \(\textbf{R}\) is the task of a post-processing step that we describe in this section. This step can be achieved by general techniques for QR factorization, such as Householder transformations and textbook Givens rotations approaches. The approach we present here exploits the structure of \(\textbf{R}_0\), which has large blocks of zeros.

The nonzero blocks of \(\textbf{R}_0\) are either tails of matrices (see line 13 of Alg. 3), generalized tails of sets of matrices (line 29), or the content of the matrix \(\textbf{Data} \) at the end of the execution of FiGaRo (line 8). In a first post-processing step, these blocks are upper-triangularized individually and potentially arising all-zero rows are discarded. In a second post-processing step, the resulting block of rows is transformed into the upper-triangular matrix \(\textbf{R}\).

In both steps we need to make blocks of rows upper-triangular, i.e., we need to compute their QR decomposition. This can be done using off-the-shelf Householder transformations. We implemented an alternative method dubbed THIN [12, 27]. It first divides the rows of a block among the available threads, each thread then applies a sequence of Givens rotations to bring its share of rows into upper-triangular form. Then THIN uses a classical parallel Givens rotations approach [26, p. 257] on the collected intermediate results to obtain the final upper-triangular matrix.

All approaches need time \(\mathcal {O}(MN^2)\), where M is the number of rows and N is the number of columns of the input \(\textbf{R}_0\) to post-processing. While the main part of FiGaRo works in linear time in the overall number of rows and columns of the input matrices (Theorem 1), post-processing is only linear in the number of rows.

8 From \(\textbf{R}\) to \(\textbf{Q}\), SVD and PCA

The previous sections focus on computing the upper-triangular matrix \(\textbf{R}\in \mathbb {R}^{n\times n}\) in the QR decomposition of the matrix \(\textbf{A}\in \mathbb {R}^{m\times n}\) representing the join output with m tuples and n data attributes. Using \(\textbf{R}\) and the non-materialized join matrix \(\textbf{A}\), this section shows how to compute the orthogonal matrix \(\textbf{Q}\) in the QR decomposition of \(\textbf{A}\), the singular value decomposition of \(\textbf{A}\), and the principal component analysis of \(\textbf{A}\).

8.1 The orthogonal matrix \(\textbf{Q}\)

Using \(\textbf{A} = \textbf{Q} \textbf{R}\), we can compute the orthogonal matrix \(\textbf{Q}\in \mathbb {R}^{m\times n}\) as follows: \(\textbf{Q} = \textbf{A} \textbf{R}^{-1}\). The upper triangular matrix \(\textbf{R}\) admits an inverse, which is also upper triangular, in case the values along the diagonal are nonzero. To compute \(\textbf{Q}\), we do not need to materialize \(\textbf{A}\)! Each row in \(\textbf{A}\) represents one tuple in the join result. We can enumerate the rows in \(\textbf{A}\) without materializing \(\textbf{A}\) using factorization techniques from prior work [42, 44]. This enumeration has the delay constant in the size of the input database and linear in the number n of the data attributes. The delay is the maximum of three times: the time to output the first tuple, the time between outputting one tuple and outputting the next tuple, and the time to finish the enumeration after the last tuple was outputted. Before the enumeration starts, we calibrate the input relations by removing the dangling tuples, i.e., tuples that do not contribute to any output tuple. For acyclic joins (as considered in this article), this calibration takes time linear in the size of the input database [1]. To enumerate, we keep an iterator over the tuples of each relation. We construct an output tuple in a top-down traversal of the join tree, where for each tuple at the iterator of a relation P we consider the possible matching tuples at the iterators of the relations that are children of P in the join tree.

Let \(\textbf{R}^{-1} = [\textbf{r}_1 \cdots \textbf{r}_n]\), where \(\textbf{r}_1, \ldots , \textbf{r}_n\) are column vectors. Let the rows in \(\textbf{A}\) be \(\textbf{t}_1,\ldots , \textbf{t}_m\). The cell \(\textbf{Q}(i,j)\) is the dot product \(\langle \textbf{t}_i,\textbf{r}_j\rangle \). Once \(\textbf{R}^{-1}\) is computed in \(O(n^3)\) time, we can enumerate the cells in \(\textbf{Q}\) with delay constant in the size of the input database and linear in the number n of data attributes. To construct the entire matrix \(\textbf{Q}\), it then takes \(O(mn^2)\) time.

We further use an optimization that pushes the computation of \(\textbf{Q}\) past the enumeration of the join output to reduce the complexity from \(O(mn^2)\) to \(O(mn\ell )\), where \(\ell \) is the number of relations in the database. The idea is to compute dot products of each tuple in each relation with selected fragments of each column vector in \(\textbf{R}^{-1}\). Such dot products can then be reused in the computation of many cells in \(\textbf{Q}\), as explained next.

We assign a unique index \(i\in [n]\) to each of the n data attributes. Let I be the set of indices of the data attributes in an input relation. For each tuple \(\textbf{s}\) in that relation, we compute the dot products \(\langle \textbf{s}(I), \textbf{r}_1(I)\rangle ,\ldots ,\) \(\langle \textbf{s}(I), \textbf{r}_n(I)\rangle \), where \(\textbf{s}(I)\) is the vector of the values for data attributes in \(\textbf{s}\) and \(\textbf{r}_i(I)\) is the vector of those values in \(\textbf{r}_i\) at indices in I. Let us denote these dot products computed using \(\textbf{s}\) as \(\textbf{s}.d_1,\ldots ,\textbf{s}.d_n\). The cell (ij) in \(\textbf{Q}\) is then the sum of the j-th dot products for the input tuples \(\textbf{s}_1,\ldots ,\textbf{s}_\ell \) that make up the i-th output tuple in the enumeration: \(\textbf{Q}(i,j) = \sum _{k\in [\ell ]}\textbf{s}_k.d_j\).

In case of categorical data attributes, such dot products can be performed even faster. Assume a data attribute X with k distinct categories, and a tuple \(\textbf{s}\) that has at position i the \(\ell \)-th value of X. If we were to one-hot encode X in \(\textbf{s}\), then \(\textbf{s}\) would include a vector \(\textbf{v}\) of k values in place of this category, where \(\textbf{v}(\ell )=1\) and \(\textbf{v}(\ell ')=0\) for \(\ell '\in [k], \ell '\ne \ell \). Given a k-dimensional vector \(\textbf{r}\), which is part of a vector in \(\textbf{R}^{-1}\), the dot product \(\langle \textbf{v}, \textbf{r} \rangle \) is then \(\textbf{r}(\ell )\). This can be achieved even without an explicit one-hot encoding of X, the index \(\ell \) suffices.

8.2 Singular value decomposition

The singular value decomposition of \(\textbf{A}\in \mathbb {R}^{m\times n}\) is given by \(\textbf{A} = \textbf{U} \mathbf {\Sigma }\textbf{V}^\textsf{T}\), where \(\textbf{U}\in \mathbb {R}^{m\times n}\) is the orthogonal matrix of left singular vectors, \(\mathbf {\Sigma }\in \mathbb {R}^{n\times n}\) is the diagonal matrix with the singular values along the diagonal, and \(\textbf{V}\in \mathbb {R}^{n\times n}\) is the orthogonal matrix of right singular vectors. We can compute the SVD of \(\textbf{A}\) without materializing \(\textbf{A}\) using FiGaRo as follows.

The first step computes the upper triangular matrix \(\textbf{R}\in \mathbb {R}^{n\times n}\) in the QR decomposition of \(\textbf{A}\): \(\textbf{A} = \textbf{Q} \textbf{R}\).

The second step computes the SVD of \(\textbf{R}\) as: \(\textbf{R} = \textbf{U}_R \mathbf {\Sigma }_R \textbf{V}_R^\textsf{T}\), where \(\textbf{U}_R, \mathbf {\Sigma }_R, \textbf{V}_R\in \mathbb {R}^{n\times n}\). There are several off-the-shelf SVD algorithms [11], we used the divide-and-conquer algorithm [29] as this was numerically the most accurate in our experiments. Note that computing the SVD takes time \(O(n^3)\) for \(\textbf{R}\in \mathbb {R}^{n\times n}\) and \(O(mn^2)\) for \(\textbf{A}\in \mathbb {R}^{m\times n}\). The former computation time is much less than the latter, since n is the number of data columns in the database, whereas \(m \gg n\) is the size of the join output.

The SVD of \(\textbf{A}\) can then be computed using the SVD of \(\textbf{R}\) as follows: \(\textbf{A} = \textbf{U} \mathbf {\Sigma }\textbf{V}^\textsf{T}= \textbf{Q} \textbf{R} = \textbf{Q} \textbf{U}_R \mathbf {\Sigma }_R \textbf{V}_R^\textsf{T}\), where \(\textbf{U} = \textbf{Q} \textbf{U}_R\) is orthogonal as it is the multiplication of two orthogonal matrices, \(\mathbf {\Sigma }= \mathbf {\Sigma }_R\), and \(\textbf{V} = \textbf{V}_R\). This means that the singular values of \(\textbf{R}\), as given by \(\mathbf {\Sigma }_R\), are also the singular values of \(\textbf{A}\). The right singular vectors of \(\textbf{A}\) are also those of \(\textbf{R}\), as given by \(\textbf{V}_R\).

The third and final step computes the left singular vectors of \(\textbf{A}\):

$$\begin{aligned} \textbf{U}&= \textbf{Q} \textbf{U}_R = \textbf{A} \textbf{R}^{-1} \textbf{U}_R = \textbf{A} (\textbf{U}_R \mathbf {\Sigma }_R \textbf{V}_R^\textsf{T})^{-1} \textbf{U}_R \\&= \textbf{A} \textbf{V}_R \mathbf {\Sigma }_R^{-1} \textbf{U}_R^{-1} \textbf{U}_R = \textbf{A} \textbf{V}_R \mathbf {\Sigma }_R^{-1}, \end{aligned}$$

where we use that \(\textbf{U}_R^{-1} \textbf{U}_R = \textbf{I}_n\). The matrices \(\textbf{V}_R\) and \(\mathbf {\Sigma }_R\) are as computed in the previous step. The inverse \(\mathbf {\Sigma }_R^{-1}\) of the diagonal matrix \(\mathbf {\Sigma }_R\) is a diagonal matrix that has the diagonal values \(\sigma _{i,i}^{-1}\) corresponding to the diagonal values \(\sigma _{i,i}\) in \(\mathbf {\Sigma }_R\). The multiplication \(\textbf{S} = \mathbf {\Sigma }_R^{-1} \textbf{V}_R\) takes time \(O(n^2)\). We are then left to perform the multiplication of the non-materialized matrix \(\textbf{A}\) with the \(n\times n\) matrix \(\textbf{S}\), for which we proceed as for computing \(\textbf{Q}\) in Sect. 8.1.

8.3 Principal component analysis

Following standard derivations, we can compute the principal components (PCs) of \(\textbf{A}\) using the computation of the SVD of \(\textbf{A}\) explained in Sect. 8.2.

We compute the eigenvalues and eigenvectors of

$$\begin{aligned}\textbf{A}^\textsf{T}\textbf{A} = (\textbf{U} \mathbf {\Sigma }\textbf{V}^\textsf{T})^\textsf{T}\textbf{U} \mathbf {\Sigma }\textbf{V}^\textsf{T}= \textbf{V} \mathbf {\Sigma }\textbf{U}^\textsf{T}\textbf{U} \mathbf {\Sigma }\textbf{V}^\textsf{T}= \textbf{V} \mathbf {\Sigma }^2 \textbf{V}^\textsf{T}\end{aligned}$$

The sought eigenvalues and their corresponding eigenvectors are the squares of the singular values and, respectively, the right singular vectors of \(\textbf{A}\), or equivalently of the upper-triangular matrix \(\textbf{R}\) computed by FiGaRo. The PCs of \(\textbf{A}\) are these right singular vectors in \(\textbf{V} = \textbf{V}_R\). It takes \(O(n^3)\) to compute all these PCs from \(\textbf{R}\); in contrast, it takes \(O(mn^2)\) to compute them directly from \(\textbf{A}\).

A truncated SVD of \(\textbf{A}\) is \(\textbf{U}_{:, 1:k} \mathbf {\Sigma }_{1:k, 1:k} \textbf{V}_{:,1:k}^\textsf{T}\), where we only keep the top-k largest singular values and their corresponding left and right singular vectors. This defines a k-dimensional linear projection of \(\textbf{A}\):

$$\begin{aligned}\textbf{A} \textbf{V}_{:,1:k} = \textbf{U}_{:, 1:k} \mathbf {\Sigma }_{1:k, 1:k} \textbf{V}_{:,1:k}^\textsf{T}\textbf{V}_{:,1:k} = \textbf{U}_{:, 1:k} \mathbf {\Sigma }_{1:k, 1:k}.\end{aligned}$$

Among all possible sets of k n-dimensional vectors, the vectors in \(\textbf{V}_{:,1:k}\), which are the top-k PCs of \(\textbf{A}\), preserve the maximal variance in \(\textbf{A}\). Equivalently, they induce the lowest error \(||\textbf{A} - \textbf{A}\textbf{V}_{:,1:k}\textbf{V}_{:,1:k}^\textsf{T}||_2\) for reconstructing \(\textbf{A}\) as a k-dimensional linear projection (Eckart-Young-Mirsky theorem [17, 39]). Applications of PCA require the top-k PCs for some value of k and sometimes also the k-dimensional linear projection of \(\textbf{A}\).

9 Experiments

We evaluate the runtime performance and accuracy of FiGaRo against Intel MKL 2021.2.0 using the MKL C++ API (MKL called from numpy is slower) and our custom implementations. We also benchmarked OpenBLAS 0.13.3 called from numpy 1.22.0 built from source, but it was consistently slower (1.5x) than MKL, so we do not report its performance further. Both MKL and OpenBlAS implement the Householder algorithm for the QR decomposition of dense matrices [31]. We also benchmarked THIN as a standalone algorithm to compute \(\textbf{R}\) (Sect. 7).

We use the following naming conventions: FiGaRo-THIN and FiGaRo-MKL are FiGaRo with THIN and MKL post-processing, respectively. In the plots, we also precede the system names by SVD or PCA to denote their versions that compute SVD or PCA respectively. FiGaRo and THIN use row-major order for storing matrices, while MKL uses column-major order; we verified that they perform best for these orders.

Table 1 Characteristics of the datasets (original) and their one-hot encodings (OHE)

Experimental Setup. All experiments were performed on an Intel Xeon Silver 4214 (24 physical/48 logical cores, 1GHz, 188GiB) with Debian GNU/Linux 10. We use g++ 10.1 for compiling the C++ code using the Ofast optimization flag. The performance numbers are averages over 20 consecutive runs. We do not consider the time to load the database into RAM and assume that all relations and the join output are sorted by their join attributes. All systems use all cores in all experiments apart from Experiment 2.

Datasets. We use three datasets. Retailer (R) [51] and Favorita (F) [19] are used for forecasting user demands and sales. Yelp (Y) [60] has review ratings given by users to businesses. The characteristics of these datasets (Table 1) are common in retail and advertising, where data is generated by sales transactions or click streams. Retailer has a snowflake schema, Favorita has a star schema. Both have key-fkey joins, a large fact table, and several small dimension tables. Yelp has a star schema with many-to-many joins. We also consider a one-hot-encoding version of 1% of these datasets (OHE), where some keys are one-hot encoded in new data columns. They yield wider matrices. Some plots show performance for a percentage of the join output. The corresponding input dataset is computed by projecting the fragment of the join output onto the schemas of the relations. When taking a percentage of an OHE dataset, we map an attribute domain to that percentage of it using division hashing.

We also use synthetic datasets of relations \(\textbf{S},\textbf{T} \in {\mathbb {R}}^{ m \times n}\), whose join is the Cartesian product. The data in each column follows a uniform distribution in the range \([-3, 3)\). For accuracy experiments, we fix (part of) the output \(\textbf{R}_{\textit{fixed}}\) and derive the input relation \(\textbf{S}\) so that the QR decomposition of the Cartesian product agrees with \(\textbf{R}_{\textit{fixed}}\). The advantage of this approach is that \(\textbf{R}_{\textit{fixed}}\) can be effectively used as ground truth for checking the accuracy of the algorithms.

9.1 Summary

The main takeaway of our experiments is that FiGaRo significantly outperforms its competitors in runtime for computing QR, SVD, and PCA, as well as in accuracy for computing \(\textbf{R}\), by a factor proportional to the gap between the join output and input sizes. This factor is up to two orders of magnitude. This includes scenarios with key-fkey joins over real data with millions of rows and tens of data columns and with many-to-many joins over synthetic data with thousands of rows and columns. Whenever the join input and output sizes are very close, however, FiGaRo ’s benefit is small; we verify this by reducing the number of rows and increasing the number of columns of our real data. When the number of data columns is at least the square root of the number of rows, e.g., 1K columns and 1 M rows in our experiments, the performance gap almost vanishes. This is to be expected, since the complexity of computing the matrix \(\textbf{R}\) increases linearly with the number of rows but quadratically with the number of columns.

We further show that FiGaRo ’s performance depends on the join tree. Also, the accuracy of the orthogonal matrices \(\textbf{Q}\) and \(\textbf{U}\) computed by FiGaRo depends on the condition number of the datasets and can be better or worse than MKL.

Yet there is not one flavour of FiGaRo that performs best in all our experiments. In case of thin matrices, i.e., with less than 50 data columns in the join output, or for very sparse matrices, such as those obtained by one-hot encoding, FiGaRo-THIN is the best as its QR post-processing phase can exploit effectively the sparsity in the result of FiGaRo ’s Givens rotations. For dense and wide matrices, i.e., with hundreds or thousands of data columns, FiGaRo-MKL is the best as its QR post-processing phase uses MKL that parallelizes better than THIN. Therefore, the following experiments primarily focus on the performance (in both runtime and accuracy) of the best of the two algorithms, with brief notes on the performance of the other algorithm. This means that we use FiGaRo-MKL for the experiments with the synthetic datasets and FiGaRo-THIN for the real-world datasets and their OHE versions.

FiGaRo is the only algorithm that works directly on the input database, all others work on the materialized join output. For FiGaRo, we report the time to compute both the matrix decompositions and the join intertwined, whereas for the others we only report the time to compute the matrix decompositions over the precomputed join matrix. Table 1 gives the times for materializing the join for the three datasets; these join times are typically larger than the MKL compute times.

9.2 QR decomposition

We first consider the task of computing the upper-triangular matrix \(\textbf{R}\) only. After that, we investigate the computation of the orthogonal matrix \(\textbf{Q}\).

Fig. 5
figure 5

Exp. 1: Runtime performance for computing \(\textbf{R}\) in the QR decomposition over the original datasets R, F, Y (top) and their OHE versions (bottom)

Experiment 1: Runtime performance for computing \(\textbf{R}\). When compared to its competitors, FiGaRo performs very well for real data in case the join output is larger than its input. Figure 5 (top) gives the runtime performance of FiGaRo-THIN, THIN, and MKL for the three real datasets as a function of the percentage of the dataset. The performance gap remains mostly unchanged as the data size is increased. Relative to MKL, FiGaRo-THIN is 2.9x faster for Retailer, 16.1x for Favorita and 120.5x for Yelp. MKL outperforms THIN, except for Favorita where the latter is 4.3x faster than the former. This is because Favorita has the smallest number of columns amongst the three datasets and THIN works particularly well on thin matrices. We do not show FiGaRo-MKL to avoid clutter; it consistently performs worse (up to 3x) than FiGaRo-THIN.

We next consider the OHE versions of Retailer and Favorita, for which the join input and output sizes are close. Figure 5 (bottom) gives the runtime performance of FiGaRo-THIN, THIN, and MKL for the one-hot encoded fragments of the three real datasets as a function of the percentage of one-hot encoded columns. The strategy of FiGaRo to push past joins is less beneficial in this case, as it copies heads and tails along the join tree without a significant optimization benefit. Relative to MKL, it is 2.7x faster for Retailer and 3.1x faster for Favorita. The explanation for this speed-up is elsewhere: Whereas the speed-up in Fig. 5 (top) is due to structural sparsity, as enforced by the joins, here we have value sparsity due to the many zeros introduced by one-hot encoding. By sorting on the one-hot encoded attribute and allocating blocks with the same attribute value to each thread, we ensure that large blocks of zeros in the one-hot encoding will not be dirtied by the rotations performed by the thread. This effectively preserves many zeros from the input and reduces the number of rotations needed in post-processing to zero values in the output of FiGaRo. This strategy needs however to be supported by a good join tree: A relation with one-hot encoded attributes is sorted on these attributes and this order is a prefix of the sorting order used by its join with its parent relation in the join tree.

Fig. 6
figure 6

Exp. 1: Runtime performance of FiGaRo-MKL and MKL for computing \(\textbf{R}\) in the QR decomposition of the Cartesian product of two relations. The numbers of rows and columns are per relation; for relations of \(2^{13}\) rows columns and \(2^{12}\), MKL’s input is a \(2^{26}\times 2^{13}\) matrix. Left: Runtime performance of FiGaRo-MKL (sec). Right: Speed-up of FiGaRo-MKL over MKL (rounded to closest natural number). An empty cell means that MKL runs out of memory

We further verified that more input rows allowed FiGaRo to scale better, but MKL ran out of memory. THIN and MKL have similar performance. For the Yelp OHE dataset, its size remains much less than its join result and FiGaRo outperforms MKL by 11x.

Pronounced benefits are obtained for many-to-many joins, for which the join output is much larger than the input. We verified this claim for the Cartesian products of two relations of thousands of rows and columns. FiGaRo outperforms MKL by up to three orders of magnitude on this synthetic data (Fig. 6). As expected, FiGaRo scales linearly with the number of rows, while MKL scales quadratically as it works on the materialized Cartesian product. The speed-up of FiGaRo over MKL increases as we increase the number of rows and columns of the two relations. For wide relations (\(2^{12}\) columns), FiGaRo-MKL (shown in figure) is up to 10x faster than FiGaRo-THIN (not shown).

Most of the time for FiGaRo is taken by post-processing. Computing the batch of the group-by aggregates and the tails and heads of the input relations take under 10% of the overall time for OHE datasets, and about 50% for the original datasets.

Fig. 7
figure 7

Exp. 2: Speed-up of multi-threading over single-threading for FiGaRo-THIN computation of \(\textbf{R}\) in the QR decomposition over the three datasets

Experiment 2: Multi-cores scalability. FiGaRo uses domain parallelism: It splits each input relation into as many contiguous blocks as available threads and applies the same transformation to each block independently. Figure 7 shows the performance of FiGaRo-THIN as we vary the number of threads up to the number of available logical cores (48). The fastest speed-up increase is achieved up to 8 threads for all datasets and up to 16 threads for Retailer and Favorita. The speed-up still increases up to 48 threads, yet its slope gets smaller. This is in particular the case for the smallest dataset Yelp, for which FiGaRo-THIN only takes under 0.5 s using 8 threads.

Table 2 Exp. 3: FiGaRo-THIN’s runtime to compute \(\textbf{R}\) in the QR decomposition using a bad join tree and relative speed-up using a good join tree. The join trees use term notation over the abbreviated relation names: Retailer: Inventory(I), Item(T), Weather(W), Location(L), Census(C); Favorita: Sales(S), Stores(R), Oil(O), Holidays(H), Items(I), Transactions(T); Yelp: Business(B), Category(C), CheckIn(I), User(U), Hours(H), Review(R). The relation with OHE values is in bold

Experiment 3: Effect of join trees. The runtime of FiGaRo is influenced by the join tree. As in classical query optimization, FiGaRo prefers join trees such that the overall size of the intermediate results is the smallest. For our datasets, this translates to having the large fact tables involved earlier in the joins. We explain for Retailer, whose large and narrow fact table Inventory uses key-fkey joins on composite keys (item id, date, location) to join with the small dimension tables Locations, Weather, Item and Census. FiGaRo aggregates away join keys as it moves from leaves to the root of the join tree. By aggregating away join keys over the large table as soon as possible, it creates small intermediate results and reduces the number of copies of dimension-table tuples in the intermediate results to pair with tuples in the fact table. If the fact table would be the root, then FiGaRo would first join it with its children in the join tree and then apply transformations, with no benefit over MKL as in both cases we would work on the materialized join output. Therefore, a bad join tree for Retailer has Inventory (I) as a root. In contrast, a good join tree first joins Inventory with Weather and Item, thereby aggregating away the item id and date keys and reducing the size of the intermediate result to be joined with Census and Location. As shown in Table 2, FiGaRo-THIN performs 8.06x faster using a good join tree instead of a bad one.

For the OHE datasets, a key performance differentiator is the sorting order of the relations with one-hot encoded attributes, as explained in Experiment 1. For instance, the relation Inventory in the Retailer dataset (printed in bold in Table 2) has the one-hot encoded attribute product id. In a bad join tree, it is not sorted on this attribute: it joins with its parent relation Location on location id, so it is primarily sorted on location id. In a good join tree, Inventory is sorted on product id as it joins with its parent Item on product id.

Fig. 8
figure 8

Exp. 4: Runtime performance for computing the fully materialized \(\textbf{Q}\) for the original datasets R, F, Y (top) and their OHE versions (bottom)

Experiment 4: Runtime performance for computing \(\textbf{Q}\). FiGaRo outperforms MKL for computing the orthogonal matrix \(\textbf{Q}\) (Fig. 8). For this experiment, FiGaRo takes as input the already computed \(\textbf{R}\), while MKL takes as input the precomputed Household vectors, which are also based on \(\textbf{R}\). The speed-up is 5.4x for Retailer, 3.7x for Favorita, and 13x for Yelp (Fig. 8 top). As we linearly increase the join matrix size, the runtime performance of FiGaRo and MKL increases linearly. For the OHE datasets, the speed-up is 1.9–6.7x for Retailer, 1.9–10.5x for Favorita, and 29–47x for Yelp (Fig. 8 bottom).

Fig. 9
figure 9

Exp. 5: Runtime performance for end-to-end QR decomposition for the original datasets R, F, Y (top) and their OHE versions (bottom)

Experiment 5: Runtime performance for end-to-end QR decomposition. To complete the picture, Fig. 9 reports the performance for computing both \(\textbf{R}\) and \(\textbf{Q}\). The input to FiGaRo-THIN is the database and to MKL is the materialized join output. As expected, FiGaRo-THIN clearly outperforms MKL for all considered datasets.

Experiment 6: Accuracy of computed \(\textbf{R}\). To assess the accuracy of computing \(\textbf{R}\), we designed a synthetic dataset consisting of two relations and want to compute the upper-triangular \(\textbf{R}\) in the QR decomposition of their Cartesian product. The input relations are defined based on a given matrix \(\textbf{R}_{\textit{fixed}}\) such that \(\textbf{R}_{\textit{fixed}}\) is part of \(\textbf{R}\). The construction is detailed in App. A. We report the relative error \(\frac{{\Vert {\textbf{R}_{\textit{fixed}} - \hat{\textbf{R}}_{\textit{fixed}}} \Vert }_{\textrm{F}}}{{\Vert {\textbf{R}_{\textit{fixed}}} \Vert }_{\textrm{F}}}\) of the computed partial result \(\hat{\textbf{R}}_{\textit{fixed}}\) compared to the ground truth \(\textbf{R}_{\textit{fixed}}\), where \({\Vert {\cdot } \Vert }_{\textrm{F}}\) is the Frobenius norm.

Table 3 Exp. 6: (Left) Error for FiGaRo-MKL relative to the ground truth. (Right) Division of relative error of MKL over the relative error of FiGaRo-MKL. The empty bottom-right cell is due to the out-of-memory error for MKL

Table 3 (left) shows the error of FiGaRo-MKL as we vary the number of rows and columns of the two relations following a geometric progression. As the number of rows increases, the accuracy only changes slightly. The accuracy drops as the number of columns increases. The error remains however sufficiently close to the machine representation unit (\(10^{-16}\)). We also verified that FiGaRo-THIN’s error is very similar.

Fig. 10
figure 10

Exp. 7: Orthogonality error of \(\textbf{Q}\) computed by both FiGaRo variants and MKL for the three datasets (top) and their OHE versions (bottom)

We also computed the relative error for MKL. Table 3 (right) shows the result of dividing the relative errors of MKL and FiGaRo-MKL. A number greater than 1 means that FiGaRo-MKL is more accurate than MKL. The error gap increases with the number of rows and decreases with the number of columns. The latter is as expected, as post-processing dominates the computation for wide matrices. FiGaRo-MKL is up to three orders of magnitude more accurate than MKL.

Experiment 7: Accuracy of computed \(\textbf{Q}\). The accuracy of computing \(\textbf{Q}\in {\mathbb {R}}^{ m \times n}\) is given by the orthogonality error \(\frac{{\Vert {{\textbf{Q}^{\textsf{T}}} \textbf{Q} - \textbf{I}} \Vert }_{\textrm{F}}}{{\Vert {\textbf{I}} \Vert }_{\textrm{F}}}\), where \(\textbf{I} \in {\mathbb {R}}^{ n \times n}\) is the identity matrix and \({\Vert {\cdot } \Vert }_{\textrm{F}}\) is the Frobenius norm [30, p. 360]. An orthogonality error of 0 is the ideal outcome, it means that \(\textbf{Q}\) is perfectly orthogonal.

Figure 10 depicts the orthogonality errors of the matrices \(\textbf{Q}\) constructed by FiGaRo-MKL, FiGaRo-THIN and MKL. We disregard FiGaRo-THIN in the following analysis, as it is less accurate than FiGaRo-MKL in this experiment. This is as expected, as the THIN post-processing is far less optimized with respect to numerical stability than MKL. We see that FiGaRo-MKL is more accurate for Retailer (one order of magnitude) and Favorita (two orders of magnitude), but less accurate for Yelp (one order of magnitude). The reason is that the condition number [30] of the join matrix \(\textbf{A}\) influences the orthogonality of \(\textbf{Q}\) computed by FiGaRo. The condition number is the largest singular value of \(\textbf{A}\) divided by the smallest singular value of \(\textbf{A}\). A very large condition number means that the smallest singular value is very close to 0, so, the matrix is almost singular [14, 46]. If this is the case for \(\textbf{A}\), the same holds for the matrix \(\textbf{R}\) of its QR decomposition, as they have the same singular values. Recall that FiGaRo inverts \(\textbf{R}\) to compute \(\textbf{Q}\) (Sect. 8), this step may introduce larger rounding errors when \(\textbf{R}\) is close to being singular.

Table 4 Exp. 7: Condition numbers and smallest singular values for the three datasets and their OHE versions
Table 5 Exp. 7: (Left) Orthogonality error of \(\textbf{Q}\) for FiGaRo-MKL over the Cartesian product. (Right) Division of orthogonality error of MKL by the orthogonality error of FiGaRo-MKL. The right-bottom cell is empty as MKL ran out of memory

Table 4 gives the condition numbers and smallest singular values for the join matrix \(\textbf{A}\) for each of our three datasets and their OHE versions. Since Favorita has the smallest condition number, the orthogonality error of \(\textbf{Q}\) computed by FiGaRo-MKL is very low, i.e., \(\textbf{Q}\) is very close to an orthogonal matrix. In contrast, Yelp has the largest condition number and FiGaRo produces a less orthogonal matrix \(\textbf{Q}\) for this dataset.

Table 4 also shows that the condition numbers are much larger for the OHE versions of Retailer and Favorita (by a factor of 100x), while the condition number remains almost the same for the OHE version of Yelp. We therefore expect less accurate matrices \(\textbf{Q}\) computed by FiGaRo-MKL. This is indeed the case, as shown in Fig. 10 (bottom). Remarkably, MKL improves the orthogonality of \(\textbf{Q}\) in case of the OHE datasets relative to the original datasets, which suggests that it can effectively exploit the sparsity due to the one-hot encoding to avoid accumulating too many rounding errors.

Table 5 gives the orthogonality error of \(\textbf{Q}\) for the Cartesian product of two relations as we vary their numbers of rows and columns. The condition numbers, the smallest singular values, and even the orthogonality vary in the same range as for Retailer original and OHE in Table 4. The orthogonality error for \(\textbf{Q}\) computed by FiGaRo-MKL remains rather low and is 1–100x lower than for MKL (Table 5 right). The gap between the two systems closes as we increase the number of columns.

9.3 Singular value decomposition

We extended FiGaRo to compute the SVD of the join matrix \(\textbf{A}\) as detailed in Sect. 8. We call this extension SVD-Fig. It works directly on the input database. We compare it against SVD-MKL, the divide &conquer approach of MKL that computes the QR decomposition of \(\textbf{A}\) and then bidiagonalizes the upper-triangular matrix \(\textbf{R}\). When only singular values are computed, all methods use the dqds algorithm [20] to compute these values from the intermediate bidiagonal matrices. We also experimented with the QR iteration, the power iteration, and the eigendecomposition approaches. QR iteration performs similar to SVD-MKL, although for the OHE datasets it is one order of magnitude less accurate, where accuracy is measured as the orthogonality of the matrix \(\textbf{U}\). Power iteration is at least one order of magnitude slower than SVD-Fig for achieving a comparable accuracy. In our tests, the orthogonality error of the eigendecomposition approach was much larger than for the other approaches.

Fig. 11
figure 11

Exp. 8: Runtime performance of SVD-Fig and MKL over the original datasets R, F and Y. Top: Only singular values (\(\mathbf {\Sigma }\)) are computed. Bottom: The entire SVD is computed: \(\textbf{U}, \mathbf {\Sigma }, \textbf{V}\)

Fig. 12
figure 12

Exp. 9: Orthogonality error for \(\textbf{U}\) computed by SVD-Fig and MKL for our datasets R, F and Y as we vary the percentage of non-truncated columns in \(\textbf{U}\). Top: Original datasets. Bottom: OHE versions

Experiment 8: Runtime performance for computing SVD. Fig. 11 (bottom) shows that SVD-Fig clearly outperforms SVD-MKL. They both scale linearly with the join output size, yet SVD-Fig is 5x faster than SVD-MKL for Retailer and Favorita and 16x for Yelp. When only the singular values are computed (Fig. 11 top), SVD-Fig outperforms SVD-MKL by factors 1–1.9x for Retailer, 2–4.5x for Favorita, and up to 90x for Yelp. The reason for the large gap for Yelp is twofold. The singular values of \(\textbf{A}\) are the same as for the upper-triangular matrix \(\textbf{R}\) in the QR decomposition of \(\textbf{A}\). To compute \(\textbf{R}\), FiGaRo does not need the materialized join matrix \(\textbf{A}\), which is much larger than the Yelp input dataset. For the OHE datasets, SVD-Fig is faster than SVD-MKL by 2–3x faster for Retailer, 2–5x for Favorita, and 16–22x for Yelp (not shown in the figure). These speed-ups are at the same scale as in Experiment 4.

Experiment 9: Accuracy of computed SVD. We first investigate the orthogonality error for the truncated matrix \(\textbf{U}\) of left singular vectors as we vary the percentage of non-truncated columns (Fig. 12). The reason we look into the accuracy of truncated \(\textbf{U}\) is twofold. First, this is the largest matrix in the SVD of \(\textbf{A}\). Second, its truncated version is used for PCA. When the percentage of non-truncated columns increases, the orthogonality error increases for SVD-Fig, for MKL it remains roughly the same.

This is because the condition number of the join matrix \(\textbf{A}\) grows with the number of non-truncated columns.

Fig. 13
figure 13

Exp. 9: Error of SVD-Fig relative to MKL for computing the top \(k-5\) to k per cent of the largest singular values of the join matrices for the original datasets

We see a similar behaviour for the accuracy of the computed singular values. Figure 13 reports the relative error of a sliding window of 5% of the singular values as computed by \({\textsc {SVD-Fig}} \) and by MKL, sliding over all singular values in decreasing order. For vectors \(\textbf{v}_1\) and \(\textbf{v}_2\) computed by \({\textsc {SVD-Fig}} \) and respectively MKL, the relative error is \(\frac{{\Vert {\textbf{v}_1-\textbf{v}_2} \Vert }_{\textrm{F}}}{{\Vert {\textbf{v}_2} \Vert }_{\textrm{F}}}\). The difference between the singular values as computed by SVD-Fig and MKL tends to be small for the largest singular values and increases for smaller singular values. Also, the singular values computed by SVD-Fig are closer to the values computed by MKL for the join matrices with smaller condition numbers such as Favorita, while for Retailer and Yelp they are more different. We observed similar behaviour for the OHE datasets.

We finally investigate the accuracy for the synthetic dataset and considered different percentages of truncated columns in the matrix \(\textbf{U}\). Table 6 reports the orthogonality error of the matrix \(\textbf{U}\) truncated to 40% of columns. The matrix \(\textbf{U}\) computed by SVD-Fig is more orthogonal than the one computed by MKL, the reasoning is similar to the orthogonality of \(\textbf{Q}\) in Experiment 7. If we compute the entire matrix \(\textbf{U}\), then SVD-Fig incurs a higher orthogonality error than MKL due to the large condition numbers, as shown in Table 7.

Table 6 Exp. 9: (Left) Orthogonality error for \(\textbf{U}\) computed by SVD-Fig and truncated to 40% columns. (Right) Division of orthogonality error of MKL by the orthogonality error of SVD-Fig , the result greater than one means SVD-Fig is more accurate. The right-bottom cell is empty as MKL ran out of memory
Table 7 Exp. 9: (Left) Orthogonality error for the full \(\textbf{U}\) computed by SVD-Fig . (Right) Division of the orthogonality error of MKL by the orthogonality error of SVD-Fig , the result greater than one means SVD-Fig is more accurate. The right-bottom cell is empty as MKL ran out of memory

9.4 Principal component analysis

Section 8 shows how to compute PCA using SVD. The top-k principal components of the join matrix \(\textbf{A}\) are the right singular vectors in \(\textbf{V}\) that correspond to the top-k largest singular values of \(\textbf{A}\). Both \(\textbf{V}\) and the singular values of \(\textbf{A}\) are also of the upper-triangular matrix \(\textbf{R}\) computed by FiGaRo. The projection of \(\textbf{A}\) onto the k-dimensional space is given by \(\textbf{U}_{:,1:k} \mathbf {\Sigma }_{1:k,1:k}\), where \(\textbf{U}_{:,1:k}\) is the truncated matrix \(\textbf{U}\) of left singular vectors and the diagonal matrix \(\mathbf {\Sigma }_{1:k,1:k}\) has the k largest singular values along the diagonal. Experiment 9 on the accuracy of the computed truncated \(\textbf{U}\) and the singular values carry over to PCA immediately.

Experiment 10: Runtime performance for computing PCA. Fig. 14 reports the runtimes to compute \(\textbf{U}_{:,1:k} \mathbf {\Sigma }_{1:k,1:k}\) for \(k= 10\%\) of the number of data columns, i.e, k is 4 for Retailer, 3 for Favorita, and 5 for Yelp. PCA-Fig is the FiGaRo adaptation to PCA. It takes as input the database. PCA-MKL is a custom implementation that uses MKL to compute SVD from the join matrix \(\textbf{A}\) and then multiplies the truncated matrices \(\textbf{U}\) and \(\mathbf {\Sigma }\). For this experiment, we did not center the data. Figure 14 shows a consistent gap of one order of magnitude between the two systems. This is similar to the performance reported in Fig. 11, since the most expensive computation is taken by the SVD.

Fig. 14
figure 14

Exp. 10: Runtime performance of PCA-Fig and MKL for our datasets R, F, and Y

FiGaRo can compute the principal components corresponding to the largest singular values in our experiments faster than MKL and with similar accuracy. FiGaRo is thus a good alternative to MKL in case of matrices defined by joins over relational data.

10 Related work

Our work sits at the interface of linear algebra and databases and is the first to investigate QR decomposition over database joins using Givens rotations. It complements seminal seven-decades-old work on QR decomposition of matrices. Our motivation for this work is the emergence of machine learning algorithms and applications that are expressed using linear algebra and are computed over relational data.

Improving performance of linear algebra operations. Factorized computation [42] is a paradigm that uses structural compression to lower the cost of computing a variety of database and machine learning workloads. It has been used for the efficient computation of \(\textbf{A}^{\textbf{T}}\textbf{A}\) over training datasets \(\textbf{A}\) created by feature extraction queries, as used for learning a variety of linear and nonlinear regression models [33, 36, 40, 50, 51]. It is also used for linear algebra operations, such as matrix multiplication and element-wise matrix and scalar operations [9]. Our work also uses a factorized multiplication operator of the non-materialized matrix \(\textbf{A}\) and another materialized matrix for the computation of the orthogonal matrix \(\textbf{Q}\) in the QR decomposition of \(\textbf{A}\) and for the SVD and PCA of \(\textbf{A}\). Besides structural compression, also value-based compression is useful for fast in-memory matrix–vector multiplication [18]. Existing techniques for matrix multiplication, as supported by open-source systems like SystemML [7], Spark [62], and Julia [3], exploit the sparsity of the matrices. This calls for sparsity estimation in the form of zeroes [54].

The matrix layout can have a large impact on the performance of matrix computation in a distributed environment [38]. Distributed database systems that can support linear algebra computation [37] may represent an alternative to high-performance computing efforts such as ScaLAPACK [10]. There are distributed algorithms for QR decomposition in ScaLAPACK [8, 58].

QR decomposition. There are three main approaches to QR decomposition of a materialized matrix \(\textbf{A}\). The first one is based on Givens rotations [24], as also considered in our work. Each application of a Givens rotation to a matrix \(\textbf{A}\) sets one entry of \(\textbf{A}\) to 0. This method can be parallelized, as each rotation only affects two rows of \(\textbf{A}\). It is particularly efficient for sparse matrices. One form of sparsity is the presence of many zeros. We show its efficiency for another form of sparsity: the presence of repeating blocks whose one-off transformation can be reused multiple times.

Householder transformations [31] are particularly efficient for dense matrices [30, p. 366]. MKL and openblas used in our experiments implement this method. One Householder transformation sets all but one entry of a vector to 0, so the QR decomposition of an \(m \times n\) matrix can be obtained using n transformations. Both Givens and Householder are numerically stable [30, Chapter 19].

The Gram-Schmidt process [28, 52] computes the orthogonal columns of \(\textbf{Q}\) one at a time by subtracting iteratively from the i-th column of \(\textbf{A}\) all of its projections onto the previously computed \(i-1\) orthogonal columns of \(\textbf{Q}\). A slightly modified variant [5] is numerically stable. The modified Gram-Schmidt is mathematically and numerically equivalent to applying Householder transformations to a matrix that is padded with zeros [6].

Implementing Givens rotations. There are several options on how to set the values c and s of a Givens rotation matrix in the presence of negative values. We follow the choice proposed by Bindel [4]. Anderson [2] defends the original choice [24] of signs using numerical stability arguments. Gentleman [23] shows that one can compute an upper-triangular matrix \(\textbf{R}'\) and a diagonal matrix \(\textbf{D}\) without computing square roots such that \(\textbf{R} = \textbf{D}^{\frac{1}{2}} \textbf{R}'\). Stewart [55, p. 291] discusses in depth these fast rotations.

SVD The approach to computing the SVD that is closest to ours proceeds as follows [26, p. 285], [35]. It computes the matrices \(\textbf{R}\) and \(\textbf{Q}\) in the QR decomposition of \(\textbf{A}\), followed by the computation of the SVD of \(\textbf{R} = \textbf{U}_R \mathbf {\Sigma }_R \textbf{V}_R^\textsf{T}\). Finally, it computes the orthogonal matrix \(\textbf{U} = \textbf{Q} \textbf{U}_R\). The SVD of \(\textbf{A}\) is then \(\textbf{U} \mathbf {\Sigma }_R \textbf{V}_R^\textsf{T}\). Our approach does not compute \(\textbf{Q}\) but instead expresses it as a multiplication of the non-materialized \(\textbf{A}\) and the inverse of \(\textbf{R}\). Further approaches use Householder transformations to transform \(\textbf{A}\) into a bidiagonal matrix [26, p. 284], [25], or compute the eigenvalue decomposition of the matrix \(\textbf{A}^{\textsf{T}}\textbf{A}\) [26, p.  486], where the singular values are the square roots of the eigenvalues and the matrix \(\textbf{V}\) consists of the corresponding eigenvectors. All prior approaches require the input matrix \(\textbf{A}\) to be materialized. The existing approaches to computing an SVD of a bidiagonal matrix vary in terms of accuracy [11]. When only singular values are required, the squareroot-free or dqds method is the most popular [20, 47]. The QR-iteration [13, 15] and divide-and-conquer methods [29] are used in case also the left and right singular matrices are required.

PCA Principal component analysis is a technique for reducing the dimensionality of a dataset [45]. The approach taken in this paper to computing the PCA of a matrix \(\textbf{A}\) relies on the SVD of the upper-triangular matrix \(\textbf{R}\) in the QR decomposition of \(\textbf{A}\). The novelty of FiGaRo over all prior approaches to PCA is that it does not require the materialization of \(\textbf{A}\) to compute \(\textbf{R}\), in case \(\textbf{A}\) is defined by database joins. A further approach to PCA over database joins was recently proposed in the database theory literature, without a supporting implementation [33]: It uses the min-max theorem based on the Rayleigh quotient [56] and computes iteratively one eigenvector of \(\textbf{A}^\textsf{T}\textbf{A}\) at a time. The benefit of that approach is that the computation of \(\textbf{A}^\textsf{T}\textbf{A}\) can be pushed past joins and may take time less than materializing the matrix \(\textbf{A}\) representing the join result.

11 Conclusion

This article introduces FiGaRo, an algorithm that computes the QR decomposition of matrices defined by joins over relational data. By pushing the computation past the joins, FiGaRo can significantly reduce the number of computation steps and rounding errors. FiGaRo can also be used to compute singular value and eigenvalue decompositions as well as the principal component analysis of the non-materialized matrix representing the joins over relational databases. We experimentally verified that FiGaRo can outperform in runtime and in many cases also in accuracy the linear algebra package Intel MKL. In the future, we plan to extend FiGaRo to work with categorical data directly, to avoid one-hot encoding such data.