research-article

Open Access

On the Complexity of String Matching for Graphs

Authors:
Massimo Equi

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland
View Profile

,
Veli Mäkinen

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland
View Profile

,
Alexandru I. Tomescu

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland
View Profile

,
Roberto Grossi

Università di, Pisa, Italy

Università di, Pisa, Italy
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 19 Issue 3Article No.: 21pp 1–25https://doi.org/10.1145/3588334

Published:12 April 2023Publication History

ACM Transactions on Algorithms

Abstract

Exact string matching in labeled graphs is the problem of searching paths of a graph G=(V, E) such that the concatenation of their node labels is equal to a given pattern string P[1.m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks.

We prove a conditional lower bound stating that, for any constant ε > 0, an O(|E|^{1 - ε} m) time, or an O(|E| m^{1 - ε})time algorithm for exact string matching in graphs, with node labels and pattern drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. This holds even if restricted to undirected graphs with maximum node degree 2—that is, to zig-zag matching in bidirectional strings, or to deterministic directed acyclic graphs whose nodes have maximum sum of indegree and outdegree 3. These restricted cases make the lower bound stricter than what can be directly derived from related bounds on regular expression matching (Backurs and Indyk, FOCS’16). In fact, our bounds are tight in the sense that lowering the degree or the alphabet size yields linear time solvable problems.

An interesting corollary is that exact and approximate matching are equally hard (i.e., quadratic time) in graphs under SETH. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively (approximate pattern matching having also a matching SETH lower bound (Backurs and Indyk, STOC’15)).

1 INTRODUCTION

String matching is the classical problem of finding the occurrences of a pattern as a substring of a text [36]. As most of today’s data is linked, it is natural to investigate string matching not only in text strings but also in labeled graphs. Indeed, large-scale labeled graphs are becoming ubiquitous in several areas, such as graph databases, graph mining, and computational biology. Applications require sophisticated operations on these graphs, and they often rely on primitives that locate paths whose nodes have labels matching a pattern given at query time. The most basic pattern to search in a graph is a string, and in this article we will prove that performing string matching in graphs is computationally challenging, even on very restricted graph classes.

In graph databases, query languages provide the user with the ability to select paths based on the labels of their nodes or edges. In this way, graph databases explicitly lay out the dependencies between the nodes of data, whereas these dependencies are implicit in classical relational databases [7]. Although a standard query language has not been yet universally adopted (as it occurred for SQL in relational databases), popular query languages such as Cypher [26], Gremlin [46], and SPARQL [43] offer the possibility of specifying paths by matching the labels of their nodes.

In graph mining and machine learning for network analysis, heterogeneous networks specify the type of each node [48]. A basic task related to graph kernels [33] and node similarity [16] is to find paths whose label matches a specific pattern. For example, in the DBLP network [53], the nodes for authors can be marked with the letter ‘A,’ and the nodes for papers can be marked with the letter ‘P,’ and edges connect authors to their papers. For example, coauthors can be identified by the pattern ‘APA’ if the two ‘A’ letters match two different nodes.

In genome research, the very first step of many standard analysis pipelines of high-throughput sequencing data has been to align sequenced fragments of DNA (called reads) on a reference genome of a species. Further analysis reveals a set of positions where the sequenced individual differs from the reference genome. After years of such studies, there is now a growing dataset of frequently observed differences between individuals and the reference. A natural representation of this gained knowledge is a variation graph in which the reference sequence is represented as a backbone path and variations are encoded as alternative paths [47]. Aligning reads (i.e., string matching) on this labeled graph gives the basis for the new paradigm called computational pan-genomics [15]. There are already practical tools that use such ideas (e.g., [28]).

The string matching problem that we consider in this article is defined as follows. Given an alphabet $\Sigma$ of symbols, consider a labeled graph $G=(V,E,L)$, where $(V,E)$ represents a directed or undirected graph and $L: V \rightarrow \Sigma$ is a function that defines which symbol from $\Sigma$ is assigned to each node as label.¹ A node labeled with $\sigma \in \Sigma$ is called a $\sigma$-node, and an edge whose endpoints are labeled $\sigma _1$ and $\sigma _2$, respectively, is called a $\sigma _1\sigma _2$-edge. If G is a directed graph, we say that G is deterministic if, for any two out-neighbors of the same node, their labels are different. In the following, we introduce the acronym 3-DDAG to indicate a deterministic Directed Acyclic Graph (DAG) such that the sum of the indegree and outdegree of each node is at most 3.

Given a pattern string $P[1..m]$ over $\Sigma$, we say that P has a match in G if there is a path $u_1, \ldots , u_m$ in G such that $P = L(u_1) \cdots L(u_m)$ (we also say that P occurs in G, and that $u_1, \ldots , u_m$ is an occurrence of P).

Problem 1

(String Matching in Labeled Graphs (SMLG))

input: A labeled graph $G = (V,E,L)$ and a pattern string P, both over alphabet $\Sigma$.
output: True if and only if there is at least one occurrence of P in G.

1.1 Our Results

We give conditional bounds for the String Matching in Labeled Graphs (SMLG) problem using the Orthogonal Vectors (OV) hypothesis [52]. The latter states that for any constant $\epsilon \gt 0$, no algorithm can solve in $O(n^{2-\epsilon }\text{poly}(d))$ time the OV problem: given two sets $X, Y \subseteq \lbrace 0,1 \rbrace ^d$ such that $|X| = |Y| = n$ and $d = \omega (\log n)$, decide whether there exist $x \in X$ and $y \in Y$ such that x and y are orthogonal, namely, $x \cdot y = 0$. We observe that it is common practice to use the Strong Exponential Time Hypothesis (SETH) [34], but since SETH implies the OV hypothesis [52], it suffices to use the OV hypothesis in the bounds, as they hold also for SETH.

First, we consider the SMLG problem on directed graphs. Their weakest form is a 3-DDAG, for which we prove in Section 3 that subquadratic time for exact string matching cannot be achieved unless the OV hypothesis is false.

Theorem 1.1.

For any constant $\epsilon \gt 0$, the SMLG problem for labeled deterministic DAGs cannot be solved in either $O(|E|^{1-\epsilon } \, m)$ or $O(|E| \, m^{1-\epsilon })$ time unless the OV hypothesis fails. This holds even if restricted to a binary alphabet, and to DAGs in which the sum of outdegree and indegree of any node is at most 3 (i.e., 3-DDAGs).

Next, we consider the SMLG problem on undirected graphs and introduce the zig-zag pattern matching problem in strings, which models searching a string P along a path of an undirected graph. An exact occurrence of P in a text string is found by scanning the text forward for increasing positions in P; however, a zig-zag occurrence of P in a text can be found by partially scanning forward and backward adjacent text positions, as many times as needed (e.g., for an edge $\lbrace u,v\rbrace$ with $L(u) = \mathtt {a}$ and $L(v)=\mathtt {b}$, all patterns of the form $\mathtt {a}, \mathtt {a} \mathtt {b}, \mathtt {a} \mathtt {b} \mathtt {a}, \mathtt {a} \mathtt {b} \mathtt {a} \mathtt {b}, \ldots$ occur starting from u). We prove in Section 4 the following result.

Theorem 1.2.

The conditional lower bound stated in Theorem 1.1 holds even if it is restricted to undirected graphs whose nodes have degree at most 2, where the pattern and the node labels are drawn from a binary alphabet.

Our results can cover arbitrary graphs in this way. Interpreting the graphs from Theorem 1.2 as directed, we observe that they have nodes with both indegree 2 and outdegree 2. Looking at Theorem 1.1, we observe that it involves directed graphs with both nodes of indegree at most 1 and outdegree 2, and nodes with outdegree at most 1 and indegree 2. Thus, the only uncovered case is that of directed graphs with only nodes of indegree at most 1 or directed graphs with only nodes of outdegree at most 1. For such graphs, observe that their edges can be decomposed into forests of directed trees (arborescences), whose roots may be connected in a directed cycle (at most one cycle per forest). We show in Section 5.1 that the Knuth-Morris-Pratt (KMP) algorithm [36] can be easily extended to solve exact string matching for these special directed graphs in linear time, thus completing the picture of the complexity of the SMLG problem.

1.2 History and Implications

The idea of extending the problem of string matching to graphs, as given in SMLG, is not new. If the nodes $u_1, \ldots , u_m$ are required to be distinct (i.e., to be a simple path), this problem is NP-hard as it solves the well-known Hamiltonian Path problem, so this requirement is removed for this reason. The SMLG problem was studied more than 25 years ago as a search problem for hypertext by Manber and Wu [38]. The history of key contributions is given in Table 1, where a common feature of the reported bounds is the appearance of the quadratic term $m \, |E|$ (except for some special cases). Specifically, Amir et al. [5, 6] gave the first quadratic time solution for exact string matching in $O(N + m \cdot |E|)$ time, where $N = \sum _{u \in V} |L(u)|$.

Table 1.

Year	Authors	Graph	Exact/Approximate	Time
1992	Manber and Wu [38]	DAG	Approximate$^{\it (1)}$	$O(m\|E\| + occ\lg \lg m)$
1993	Akutsu [2]	Tree	Exact	$O(N)$
1995	Park and Kim [40]	DAG	Exact$^{\it (3)}$	$O(N + m\|E\|)$
1997	Amir et al. [6]	General	Exact	$\boldsymbol {O(N + m\|E\|)}$
1997	Amir et al. [6]	General	Approximate$^{\it (2)}$	NP-hard
1997	Amir et al. [6]	General	Approximate$^{\it (1)}$	$O(Nm\lg N + m\|E\|)$
1998	Navarro [39]	General	Approximate$^{\it (1)}$	$O(Nm + m\|E\|)$
2017	Rautiainen and Marschall [45]	General	Approximate$^{\it (1)}$	$\boldsymbol {O(N + m\|E\|)}$
2019	Jain et al. [35]	General binary alphabet	Approximate$^{\it (2)}$	NP-hard

$V,$ set of nodes; $E,$ set of edges; $occ,$ number of matches for the pattern in the graph; $m,$ pattern length; $N,$ total length of text in all nodes; (1), errors only in the pattern; (2), errors in the graph; (3), matches span only one edge. The two rows highlighted in bold report the best known bounds for exact and approximate string matching, respectively.

View Table

Table 1. State of the Art for SMLG

$V,$ set of nodes; $E,$ set of edges; $occ,$ number of matches for the pattern in the graph; $m,$ pattern length; $N,$ total length of text in all nodes; (1), errors only in the pattern; (2), errors in the graph; (3), matches span only one edge. The two rows highlighted in bold report the best known bounds for exact and approximate string matching, respectively.

In the approximate matching case, allowing errors in the graph makes the problem NP-hard [6], so onward we consider errors only in the pattern. In such case, the quadratic cost of the approximate matching in graphs is asymptotically optimal under SETH since (i) it solves the approximate string matching as a special case, since a graph consisting of just one directed path of $|E|+1$ nodes and $|E|$ edges is a text string of length $n=|E|+1$, and (ii) it has been recently proved that the edit distance of two strings of length n cannot be computed in $O(n^{2-\epsilon })$ time, for any constant $\epsilon \gt 0$, unless SETH is false [10]. This conditional lower bound explains why the $O(m|E|)$ barrier has been difficult to cross in the approximate case. Rautiainen and Marschall [45] and Jain et al. [35] recently gave the best bound for errors in pattern only, $O(N + m \cdot |E|)$ time, same as the exact string matching. The two best results for exact and approximate pattern matching, both taking quadratic time in the worst case, are highlighted in Table 1.

In this scenario and the application domains mentioned at the beginning, our results have a number of implications:

Although we can explain the complexity of approximate string matching in graphs, not much is known about the complexity of exact string matching in graphs. The classical exact string matching can be solved in linear time [36], so one could expect the corresponding problem on graphs to be easier than approximate string matching. A lower bound (i.e., NP-hard, as mentioned earlier) exists only in the case when the pattern is restricted to match only simple paths in the graph. Extensions of this type of matching for special graph classes were studied in the work of Limasset et al. [37]. Here we study the general case, where paths can pass through nodes multiple times. Somewhat surprisingly, Theorems 1.1 and 1.2 imply that exact and approximate pattern matching are equally hard in graphs, even if they are 3-DDAGs.
Our results imply that the algorithm for directed graphs by Amir et al. [5, 6] is essentially the best we can hope for asymptotic bounds unless the OV hypothesis is false. This also applies to the case of undirected graphs by the simple transformation so that each edge $\lbrace u,v\rbrace$ is transformed into a pair of arcs $(u,v)$ and $(v,u)$. Note that we need also Theorem 1.2 to explicitly state that this is the best possible also for undirected graphs of maximum degree 2. To complete the picture, we show how to get linear time for the preceding special case of directed graphs where each node has indegree at most 1 or directed graphs whose nodes have outdegree at most 1.
Our results also explain why it has been difficult to find indexing schemes for fast exact string matching in graphs, with other than best-case or average-case guarantees [27, 49], except for limited search scenarios [50]. They complement recent findings about Wheeler graphs [3, 27, 31]. Wheeler graphs are a class of graphs admitting an index structure that supports linear time exact pattern matching. Gibney and Thankachan [31] show that it is NP-complete to recognize whether a (non-deterministic) DAG is a Wheeler graph. Alanko et al. [3] give a linear time algorithm for recognizing whether a deterministic automaton is a Wheeler graph. They also give an example where the minimum size Wheeler graph is exponentially smaller than an equivalent deterministic automaton. Theorem 1.1 shows that converting an arbitrary deterministic DAG into an equivalent (not necessarily minimum size) Wheeler graph should take at least quadratic time unless the OV hypothesis is false; moreover, later refinement of this result [24] shows that exponential time for the conversion is needed under the OV hypothesis. In particular, the 3-DDAG obtained in the reduction from OV in the proof of Theorem 1.1 is not a Wheeler graph.
We describe a simple transformation in Section 5.2 so that we can see our 3-DDAG and the pattern P as two Deterministic Finite Automata (DFAs) so that our SMLG problem reduces to the emptiness intersection for the string sets recognized by these two DFAs. This highlights a connection between the two problems, and immediately provides a quadratic conditional lower bound using OV for the latter problem. However, this might not be the best that can be obtained for the emptiness intersection problem, as ongoing work attempts to prove a quadratic lower bound, under SETH [51], already when the two DFAs are trees. Nevertheless, our algorithm from Section 5.1 shows that emptiness intersection between a tree and a chain of nodes is solvable in linear time.

Our reductions share some similarities with those for string problems [1, 8, 9, 10, 11, 13, 41]. The closest connection is with a conditional hardness of several forms of regular expression matching [9]. We describe these similarities in Section 2, highlighting the main limitations of this reduction scheme. (For the interested reader, we went through the details of such reduction in an early version of this work [21].) Later we explain why our strategy is crucial in achieving stronger results such as covering the case of deterministic directed graphs with bounded degree. This strategy yields a graph of small degree and enables local merging of non-deterministic subgraphs into deterministic counterparts. Such locality feature of our reduction is of cardinal importance, since converting a Non-Deterministic Finite Automaton (NFA) into a DFA can take exponential time [44]. Finally, although this reduction works also for undirected graphs of small degree, it does not cover undirected graphs of degree 2. For this case (zig-zag matching in a bidirectional string), we need a more intricate reduction as the underlying graph has less structure.

2 OVERVIEW OF THE REDUCTION AND CONNECTIONS WITH REGULAR EXPRESSIONS

As mentioned in Section 1, our lower bounds have deep connections with previous results on regular expressions matching. We use these connections to conceptually introduce some internal components of our reductions before proceedings to their formal definitions. Additionally, this allows us to point out why a simple modification of an earlier reduction is not sufficient for our purposes.

Backurs and Indyk [9] analyzed which types of regular expressions are hard to match, and one of their lower bounds can be adapted to address the SMLG problem in the case of a non-deterministic DAG. The type of regular expressions in question is $\mid \cdot \mid$—that is, a composition of two or operations. An example of such regular expression is $[(a|b)(b|c)]|[(a|c)b]$. Given a regular expression p of this type and a text t, determining whether or not a substring of t can be generated by p requires quadratic time, unless there exists a subquadratic time algorithm for OV. The reduction adopted to prove such result consists in defining text $t = t_1\texttt {2}t_2\texttt {2}\ldots \texttt {2}t_n$ as the concatenation of all the binary vectors $t_1,\dots ,t_n$ of X, placing the separator character $\texttt {2}$ between them. Regular expression $p = G_W^{(1)} \mid G_W^{(2)} \mid \cdots \mid G_W^{(n)}$ is an or of n gadgets, one for each vector in the set Y. Moreover, gadget $G_W^{(j)}$ is designed in such a way that it accepts substring $t_i$ if and only if the i-th vector of X and the j-th vector of Y are orthogonal. Hence, it is fairly straightforward to prove that a substring of t is accepted by p if and only if there exists a pair of OVs in X and Y, respectively.

The idea behind this reduction can be modified for the SMLG problem as follows. In the SMLG problem, we need to construct a pattern P and a graph G such that P has a match in G if and only if there is a vector in X orthogonal to a vector in Y. Consider the NFA that accepts the same language as the regular expression p defined earlier, and call $\mathtt {b}$ and $\mathtt {e}$ its start and accepting states, respectively.

We can enrich such automaton with a universal gadget $G_{U}^{(j)}$, which accepts any binary vector of length d. We place $n-1$ universal gadgets on each side of the $G_W^{(i)}$’s, to allow P to shift, as shown in Figure 1. Pattern P is again defined as the concatenation of the vectors in X with separator characters. (Again, see the next section for formal definition.) Due to the fact that we placed only $n-1$ universal gadgets on each side, pattern P matches in G if and only if a subpattern of P matches in one of the $G_W^{(j)}$ gadgets, which can happen if and only if there exists a pair of OVs.

Fig. 1. A sketch of the structure of the reduction for non-deterministic graphs. Pattern P can shift to select a matching subpattern, shown in bold.

Observe that this reduction builds a non-deterministic graph because of the out-neighbors of node $\mathtt {b}$. This non-deterministic feature appears inherent to this type of construction. Our contribution is a heavy restructuring of this reduction, whose two main ideas can be intuitively summarized as follows. First, instead of placing the $G_W^{(j)}$ gadgets on a “column,” we place them on a “row.” We then place the left universal gadgets on a “row” on top of this one, the right universal gadgets on a “row” below this one, and force the pattern to have a match starting in the top row and ending in the bottom row. See Section 3.3 and Figure 4 presented later in the article. This allows us to restrain the non-deterministic parts of the graph to nodes having only two out-neighbors with the same label. Second, we then show how to remove this non-determinism by locally merging the parts of the graph labeled with the same letter while still maintaining the properties of the graph. See Section 3.4 and Figure 6 presented later in the article.

3 DETERMINISTIC DAGS

In this section, we reduce the OV problem to the SMLG problem for the restricted case of 3-DDAGs. In this scenario, 3-DDAGs are the most restricted case, as otherwise the SMLG problem can be solved in linear time (see Section 5.1).

Given an OV instance with sets $X = \lbrace x_1, \ldots , x_n\rbrace$ and $Y = \lbrace y_1, \ldots , y_{n}\rbrace$ of d-dimensional binary vectors, we show how to build a pattern P and a 3-DDAG G such that P will have a match in G if and only if there exists a vector in X orthogonal to one in Y. We first describe how to build P and how to obtain a directed graph whose nodes are labeled with a constant-sized alphabet. Then we discuss how to turn such a graph into the 3-DDAG G.

3.1 Pattern

Pattern P is over the alphabet $\Sigma = \lbrace \mathtt {b},\mathtt {e},\mathtt {0},\mathtt {1} \rbrace$, has length $|P| = O(nd)$, and can be built in $O(nd)$ time from the first set of vectors $X = \lbrace x_1, \ldots , x_n\rbrace$. Namely, we define $\begin{equation*} P = \mathtt {b} \mathtt {b} P_{x_1}\mathtt {e} \,\mathtt {b} P_{x_2}\mathtt {e} \ldots \mathtt {b} P_{x_n}\mathtt {e} \mathtt {e}, \end{equation*}$ where $P_{x_i}$ is a string of length d that is associated with $x_i \in X$, for $1 \le i \le n$. The h-th symbol of $P_{x_i}$ is either $\mathtt {0}$ or $\mathtt {1}$, for each $h \in \lbrace 1,\dots ,d\rbrace$, such that $P_{x_i}[h] = \mathtt {1}$ if and only if $x_i[h] = 1$.² We thus view the vectors in X as subpatterns $P_{x_i}$s which are concatenated by placing separator characters $\mathtt {e} \mathtt {b}$. Note that P starts with $\mathtt {b} \mathtt {b}$ and ends with $\mathtt {e} \mathtt {e}$: such strings are found nowhere else in P, marking thus its beginning and its end.

3.2 Graph Gadgets

The gadget implementing the main logic of the reduction is a directed graph $G_W = (V_W,E_W,L_W)$, illustrated in Fig. 2. Starting from the second set of vectors Y, set $V_W$ can be seen as n disjoint groups of nodes $V_W^{(1)}, V_W^{(2)}, \ldots , V_W^{(n)}$ (plus some extra nodes), where the nodes in $V_W^{(j)}$ are uniquely associated with vector $y_j \in Y$, for $1 \le j \le n$. The corresponding induced subgraph $G_W^{(j)} = (V_W^{(j)}, E_W^{(j)})$ will contain an occurrence of a subpattern $P_{x_i}$ if and only if $x_i \cdot y_j = 0$. We give more details in the following.

The nodes in $V_W^{(j)}$ are defined as follows. For $1 \le h \le d$, we consider entry $y_j[h]$ of vector $y_j \in Y$. If $y_j[h] = 1$, we place just a $\mathtt {0}$-node $w^0_{j h}$ to indicate that we only accept $P_{x_i}[h] = \mathtt {0}$ for this h coordinate. Instead, if $y_j[h] = 0$, we place both a $\mathtt {0}$-node $w^0_{j h}$ and a $\mathtt {1}$-node $w^1_{j h}$ to indicate that the value of $P_{x_i}[h]$ does not matter. The nodes in $V_W^{(j)}$ are preceded by a special begin $\mathtt {b}$-node $b_W^{(j)}$ and succeeded by a special end $\mathtt {e}$-node $e_W^{(j)}$. The overall nodes are thus $V_W = \bigcup _{1 \le j \le n} (V_W^{(j)} \cup \lbrace b_W^{(j)},e_W^{(j)}\rbrace)$, and it holds that $|V_W| = O(nd)$.

As for the edges in $E_W^{(j)}$, they properly connect the nodes inside each group $V_W^{(j)}$. Specifically, node $b_W^{(j)}$ is connected to $w^0_{j 1}$ and, if it exists, to $w^1_{j 1}$. Additionally, we place edges connecting both nodes $w^0_{j d}$ and $w^1_{j d}$ (if this exists) to node $e_W^{(j)}$. Moreover, there is an edge for every pair of nodes that are consecutive in terms of h coordinate, for $1 \le h \lt d$ (e.g., $w^1_{j h}$ is connected to $w^0_{j \, h+1}$). The overall edges are thus $E_W = \bigcup _{1 \le j \le n} E_W^{(j)}$, where $|E_W| = O(nd)$.

In this way, we define the directed graph $G_W = (V_W,E_W,L_W)$, which can be built in $O(nd)$ time from set Y and consists of n connected components $G_W^{(j)}$, one for each vector $y_j \in Y$.

We observe that pattern occurrences in $G_W$ have some useful combinatorial properties. The following lemma is an immediate observation, which follows from the fact that each $G_W^{(j)}$ is acyclic and not connected to any other $G_W^{(j^{\prime })}$.

Lemma 3.1.

If subpattern $\mathtt {b} P_{x_i} \mathtt {e}$ has a match in $G_W$, then the nodes matching $P_{x_i}$ share the same j coordinate and have distinct and consecutive h coordinates.

The following lemma instead relates the occurrence of a subpattern to the OV problem.

Lemma 3.2.

Subpattern $\mathtt {b} P_{x_i}\mathtt {e}$ has a match in $G_W$ if and only if there exist $y_j \in Y$ such that $x_i \cdot y_j = 0$.

Proof.

Recall that, by construction, $w^0_{j h} \in V_W^{(j)}$ and $w^1_{j h} \in V_W^{(j)}$ hold for those h such that $y_j[h] = 0$, whereas $w^0_{j h} \in V_W^{(j)}$ and $w^1_{j h} \not\in V_W^{(j)}$ hold in case $y_j[h] = 1$. We handle the two implications of the statement individually.

($\Rightarrow$) By Lemma 3.1, we can focus on the d distinct and consecutive nodes of $G_W^{(j)}$ that match $P_{x_i}$. In particular, we know that each character $P_{x_i}[h]$ is matched by either $w^0_{j h}$ or $w^1_{j h}$. Consider vectors $x_i \in X$ and $y_j \in Y$. If $P_{x_i}[h] = \mathtt {1}$ has a match in $G_W^{(j)}$, it means that node $w^1_{j h}$ exists and hence $y_j[h]=0$, implying $x_i[h] \cdot y_j[h] = 0$. If $P_{x_i}[h] = \mathtt {0}$, by construction we know that $x_i[h]=0$ and, no matter whether node $w^1_{j h}$ exists or not, the pattern will match $w^0_{j h}$, and it clearly holds that $x_i[h] \cdot y_j[h] = 0$. At this point, we can conclude that $x_i[h] \cdot y_j[h] = 0$ for every $1 \le h \le d$, thus $x_i \cdot y_j = 0$.

($\Leftarrow$) Consider vectors $x_i \in X$ and $y_j \in Y$ that are such that $x_i \cdot y_j = 0$. For $h = 1, 2, \ldots , d$, if $y_j[h] = 0$ then $w^0_{j h},w^1_{j h} \in V_W^{(j)}$ and $P_{x_i}[h]$ can match either $w^0_{j h}$ or $w^1_{j h}$ in $G_W^{(j)}$. If $y_j[h] = 1$ it must be $x_i[h] = 0$ since $x_i \cdot y_j = 0$, thus $P_{x_i}[h] = \mathtt {0}$ and it can match node $w^0_{j h}$, which is always present in $G_W^{(j)}$. Finally, characters $\mathtt {b}$ and $\mathtt {e}$ can match nodes $b_W^{(j)}$ and $e_W^{(j)}$, respectively. All characters of $\mathtt {b} P_{x_i} \mathtt {e}$ have now a matching node and the definition of the edges in $E_W$ allows to visit all such nodes via a matching path starting at $b_W^{(j)}$ and ending at $e_W^{(j)}$.□

In the following, we will also use gadget $G_U = (V_U,E_U,L_U)$, the degenerate case of $G_W$ with $2n-2$ (instead of just n) connected components $G_U^{(j)}$ where, for all $1 \le j \le 2n-2$ and $1 \le h \le d$, we place both a $\mathtt {0}$-node and a $\mathtt {1}$-node: we call these two nodes $u^0_{j h}$ and $u^1_{j h}$, respectively, to distinguish them from those in $G_W$. Moreover, every $\mathtt {e}$-node of this gadget is connected with the next $\mathtt {b}$-node, in terms of the j coordinate (Fig. 3). As it can be seen, any subpattern $P_{x_i}$ occurs in $G_U$, so it can be used as a “jolly” gadget.

3.3 Non-Deterministic Graph

A possible approach is based on suitably combining one instance of gadget $G_W$ and two instances of gadgets $G_U$, named $G_{U1}$ and $G_{U2}$. The idea is that when $x_i \cdot y_j = 0$, we want P to occur in G, so that the following three conditions hold:

Instance $G_{U1}$: $P_{x_1}$ occurs in $G_{U1}^{(n-1+j-(i-1))}$, ..., $P_{x_{i-1}}$ occurs in $G_{U1}^{(n-1+j-1)}$.
Instance $G_W$: $P_{x_i}$ occurs in $G_W^{(j)}$.
Instance $G_{U2}$: $P_{x_{i+1}}$ occurs in $G_{U2}^{(j)}$, ..., $P_{x_{n}}$ occurs in $G_{U2}^{(j+n-i-1)}$.

However, when $x_i \cdot y_j \ne 0$, we do not want $P_{x_i}$ to occur in $G_W^{(j)}$. We can suitably link the instances $G_W$, $G_{U1}$, and $G_{U2}$ so that we get the preceding conditions. We connect the $\mathtt {e}$-nodes in $G_{U1}$ to $\mathtt {b}$-nodes in $G_W$ and connect the $\mathtt {e}$-nodes in $G_W$ to $\mathtt {b}$-nodes in $G_{U2}$. Additionally, we place additional starting $\mathtt {b}$-nodes and additional ending $\mathtt {e}$-nodes, to properly match the $\mathtt {b} \mathtt {b}$ and $\mathtt {e} \mathtt {e}$ prefix and suffix of P, respectively. More precisely, for every $\mathtt {b}$-node in $G_{U1}$ and $G_W,$ we add a new $\mathtt {b}$-node as an in-neighbor of it, and for every $\mathtt {e}$-node in $G_{W}$ and $G_{U2}$, we add a new $\mathtt {e}$-node as an out-neighbor of it. Such construction is depicted in Figure 4.

However, even if $G_W$, $G_{U1}$, and $G_{U2}$ are deterministic, their resulting composition is not so, because of the out-neighbors of the $\mathtt {e}$-nodes.³ In the following, we show how to obtain a deterministic graph by suitably merging $G_W$ with portions of $G_U$.

3.4 Deterministic Graph

To obtain a deterministic DAG, we need to suitably combine one instance of gadget $G_W$ with the two instances $G_{U1}$ and $G_{U2}$ (recall that both $G_{U1}$ and $G_{U2}$ have instances of gadget $G_U^{(j)}$, for all $1 \le j \le 2n-2$). Although $G_{U2}$ will be used as is, $G_{U1}$ needs to be partially merged with $G_{W}$ to obtain determinism. We start building our final graph G from $G_W$ by adding parts of $G_{U1}$ when needed, obtaining a deterministic graph called $G_{U1W}$, as shown in Figure 5. Consider subgraph $G_W^{(j)}$ and assume that the first position in which the $\mathtt {1}$-node is lacking is h. We place a partial version of subgraph ${G}_{U1}^{(j^{\prime })}, j^{\prime }:=n-1+j$, by adding to the graph the nodes and edges of ${G}_{U1}^{(j^{\prime })}$ that are located between position $h+1$ and node ${e}_{U1}^{(j^{\prime })}$ (included). If $h=d,$ we place only node $e_{U1}^{(j^{\prime })}$. We also place $\mathtt {1}$-node $u^1_{jh}$ and connect the $\mathtt {0}$-node and the $\mathtt {1}$-node (if any) of $G_W^{(j)}$ in position $h-1$ to it (if $h \gt 1$), or we connect ${b}_{W}^{(j)}$ to it (if $h=1$). Moreover, we connect node $u^1_{jh}$ to the first $\mathtt {0}$- and $\mathtt {1}$-node of partial $G_{U1}^{(j^{\prime })}$. If $h=d,$ we connect $u^1_{j h}$ to $e_{U1}^{(j^{\prime })}$. Then we scan $G_W^{(j)}$ from left to right looking for those positions $h^{\prime }$, $h \le h^{\prime } \lt d$, such that there is no $\mathtt {1}$-node in position $h^{\prime }+1$. We connect the $\mathtt {0}$-node and the $\mathtt {1}$-node (if any) of $G_W^{(j)}$ in position $h^{\prime }$ to the $\mathtt {1}$-node of $G_{U1}^{(j^{\prime })}$ in position $h^{\prime }+1$. Finally, we place edge $({e}_{U1}^{(j^{\prime })}, {b}_{W}^{(j+1)})$. To complete the merging task, we apply the preceding modification to all ${G}_W^{(j)}$, for $1 \le j \le n-1$, and thus obtain gadget $G_{U1W}$.

Fig. 5. Graph $G_{U1W}$ after merging $G_{U1}$ (from Figure 3) with $G_W$ (from Figure 2).

At this point, we place gadget $G_{U2}$ and connect $G_{U1W}$ to it by placing edges $(e_W^{(j)}, b_{U2}^{(j)})$, for all $1 \le j \le n$. Additionally, for every $\mathtt {b}$-node of $G_{U1W}$, we place an additional $\mathtt {b}$-node as in-neighbor. We do the same for every $\mathtt {e}$-node of $G_{U2}$, placing an $\mathtt {e}$-node as out-neighbor. Adding subgraphs $G_{U1}^{(1)}, \ldots , G_{U1}^{(n-1)}$ with one additional $\mathtt {b}$-node as in-neighbor of their $\mathtt {b}$-nodes, and connecting the $\mathtt {e}$-node of $G_{U1}^{(n-1)}$ to the $\mathtt {b}$-node of $G_{W}^{(1)}$, completes the transformation into the wanted deterministic DAG, which we call G. Figure 6 gives an overall picture of G.

It is easy to verify that every $\mathtt {b}$- and $\mathtt {e}$-node in G can have no more than two out-neighbors, and in such case, they have different labels. This shows that graph G is deterministic.

The deterministic DAG G has a crucial property which, combined with Lemma 3.1 and Lemma 3.2, is essential to ensure the correctness of our reduction.

Lemma 3.3.

Pattern P has a match in G if and only if a subpattern $\mathtt {b} P_{x_i}\mathtt {e}$ of P has a match in the underlying subgraph $G_W$ of $G_{U1W}$.

Proof.

For the $(\Rightarrow)$ implication, because of the directed $\mathtt {e} \mathtt {b}$-edges, each distinct subpattern $\mathtt {b} P_{x_i}\mathtt {e}$ matches a path from either a distinct portion of $G_{U1W}$ (or from the $G_{U1}^{(j)}$ subgraphs, $1 \le j \le n-1$, before it) or $G_{U2}$. Moreover, each occurrence of P must begin with $\mathtt {b} \mathtt {b}$ and end with $\mathtt {e} \mathtt {e}$. String $\mathtt {b} \mathtt {b}$ can be matched only in $G_{U1W}$ (or in the $G_{U1}^{(j)}$ subgraphs before it), hence the match must start here. However, string $\mathtt {e} \mathtt {e}$ is found either in $G_{U1W}$ or in $G_{U2}$. Observe that, by construction, once a match for pattern P is started in $G_{U1W}$ (or in the $G_{U1}^{(j)}$ subgraphs before it), the only way to successfully conclude it is either by matching $\mathtt {e} \mathtt {e}$ within $G_{U1W}$, or by matching also a portion of $G_{U2}$ and then $\mathtt {e} \mathtt {e}$. Because of the structure of the graph, in both cases a subpattern $\mathtt {b} P_{x_i}\mathtt {e}$ of P must match one of the subgraphs ${G}_W^{(j)}$ that are present in $G_{U1W}$.

The $(\Leftarrow)$ implication is trivial. In fact, if $\mathtt {b} P_{x_i}\mathtt {e}$ has a match in one subgraph $G_W^{(j)}$, then by construction we can match $\mathtt {b} P_{x_1}\mathtt {e} \ldots \mathtt {b} P_{x_{i-1}}\mathtt {e}$ possibly in the $G_{U1}^{(j)}$ subgraphs before $G_{U1W}$, then possibly in the partial $G_{U1}^{(j)}$ subgraphs of $G_{U1W}$. We can then match $\mathtt {b} P_{x_{i+1}}\mathtt {e} \ldots \mathtt {b} P_{x_n}\mathtt {e}$ in $G_{U2}$ and thus have a full match for P in G.□

We conclude this section by proving the following weaker version of Theorem 1.1. In the next two sections, we show how to obtain the full proof of Theorem 1.1, by transforming G to have maximum sum of indegree and outdegree 3, and how to reduce the alphabet to binary.

Theorem 3.4.

For any constant $\epsilon \gt 0$, the SMLG problem for a labeled deterministic DAG cannot be solved in either $O(|E|^{1-\epsilon } \, m)$ or $O(|E| \, m^{1-\epsilon })$ time unless the OV hypothesis fails. This holds even if restricted to an alphabet of size 4.

Proof.

First, we argue that the reduction given in this section is correct. Then we analyze its cost and argue how a subquadratic time algorithm for SMLG would contradict the OV hypothesis.

Correctness. We need to ensure that pattern P has a match in G if and only if there exist vectors $x_i \in X$ and $y_j \in Y$ which are orthogonal. This follows from Lemma 3.3, which guarantees that P has a match in G if and only if a subpattern $P_{x_i}$ has a match in $G_W$, and the fact that, by Lemma 3.2, this holds if and only if $x_i \cdot y_j = 0$.

Cost. As observed during the construction in Sections 3.1 and 3.2, both pattern P and graph G have size $O(nd)$. Indeed, for each one of the n vectors $x_i \in X,$ we place in P characters $\mathtt {b}$ and $\mathtt {e}$ plus d characters that can be either $\mathtt {0}$ or $\mathtt {1}$. In graph G, the size of each subgraph is proportional to the dimension d of the vectors, and we place $O(n)$ of them.

Using the OV Hypothesis. The last step is to show that any $O(|E|^{1-\epsilon } \, m)$-time or $O(|E| \, m^{1-\epsilon })$ time algorithm A for SMLG contradicts the OV hypothesis. Given two sets of vectors X and Y, we can perform our reduction obtaining pattern P and graph G in $O(nd)$ time while observing that $|E| = O(nd)$ and $m = O(nd)$. No matter whether A has $O(|E|^{1-\epsilon } m)$ or $O(|E| \, m^{1-\epsilon })$ time complexity, we will end up with an algorithm deciding if there exists a pair of OVs between X and Y in $O(nd \cdot (nd)^{1-\epsilon }) = O(n^{2-\epsilon }\text{poly}(d))$ time, which contradicts the OV hypothesis.□

3.5 Reduced Degree

In this section, we show how to transform the deterministic graph G from the previous section to be a 3-DDAG.

Observe that every node in G can have at most two in-neighbors and two out-neighbors. An emblematic case is that of four nodes, say v, w, $v^{\prime }$, and $w^{\prime }$, with edges $(v,w)$, $(v,w^{\prime }),(v^{\prime },w)$, and $(v^{\prime },w^{\prime })$. To reduce to 1 the outdegree of v and $v^{\prime }$, and the indegree of w and $w^{\prime }$, the idea is to add two dummy nodes $\bar{v}$ and $\bar{w}$ connected by an edge $(\bar{v},\bar{w})$, then replace the four preceding edges with $(v,\bar{v})$, $(v^{\prime },\bar{v})$, $(\bar{w},w)$, and $(\bar{w},w^{\prime })$. The dummy nodes can be labeled, for example, with $\mathtt {0}$, then one can do a symmetric modification in the pattern. One needs to apply such transformations between any two consecutive columns of G.

To be more precise, we need to consider four node configurations. The first three, shown in Figure 7, are slightly simpler than the fourth one, in Figure 8. The final result is achieved by applying these adjustments among (sequences of) consecutive columns of G, observing that these four cases cover all possible configurations in the graph.

Fig. 7. Adjustments to the graph needed for achieving maximum sum of indegree and outdegree 3 for every node. The squared nodes are the new artificial nodes added to reach this goal. (a) This is the general case that captures the main idea. Notice that one or both $\mathtt {1}$ -nodes may be missing in G, but we apply the transformation nonetheless. (b) A special case of (a). Node v can be either a $\mathtt {b}$ - or an $\mathtt {e}$ -node. Node v and the $\mathtt {1}$ -node may be missing in G. (c) The other special case of (a). The $\mathtt {1}$ -node, the $\mathtt {b}$ -node, or the $\mathtt {e}$ -node on the right may be missing in G.

Fig. 8. An example of the special case that occurs in gadget $G_{U1W}$ . Notice that some nodes may be missing in G.

Since we always insert a pair of two new $\mathtt {0}$-nodes, or a $\mathtt {b}$- and an $\mathtt {e}$-node, between prescribed columns in G, then we can analogously modify the pattern to match the new structure of G.

The encoding that we present next to obtain a binary alphabet can be safely applied after reducing the degree of the nodes of G with this technique.

3.6 Binary Alphabet

The size of the alphabet used until this point is 4. One can reduce the alphabet size to binary using the following encoding, $\begin{equation*} \alpha (\mathtt {0})=\mathtt {0} \mathtt {0} \mathtt {0} \mathtt {0}, \quad \alpha (\mathtt {1})=\mathtt {1} \mathtt {1} \mathtt {1} \mathtt {1}, \quad \alpha (\mathtt {b})=\mathtt {1} \mathtt {0}, \quad \alpha (\mathtt {e})=\mathtt {0} \mathtt {1}, \end{equation*}$ for both the pattern and the graph. Given any string $x = x[1..m]$, we define its binary encoding $\alpha (x) := \alpha (x[1]) \cdots \alpha (x[m])$. In the graph, we replace each $\sigma$-node with a path of as many nodes as characters in $\alpha (\sigma)$.

To make this encoding work, we need to additionally make the pattern start with characters $\mathtt {e} \mathtt {b} \mathtt {b}$ (instead of just $\mathtt {b} \mathtt {b}$) and end with characters $\mathtt {e} \mathtt {e} \mathtt {b}$ (instead of just $\mathtt {e} \mathtt {e}$) to exploit the properties of sequence $\mathtt {e} \mathtt {b}$. Moreover, this entails that also in the graph we have to place and connect a new $\mathtt {e}$-node to each $\mathtt {b}$-node used to mark the beginning of a viable match, and in the same manner, we need to add a new $\mathtt {b}$-node after every $\mathtt {e}$-node used to mark the end of a match.

We can now assume that the graph and the pattern have been changed as described in the previous section so that the graph has the maximum sum of indegree and outdegree 3. The goal is to show that there is a bijection from matches before and after applying such encoding and reduction adjustments.

At this point, we apply the $\alpha$ encoding, and nodes with labels of length 2 and 4 will be replaced by chains of nodes labeled by single characters each. Note that in graph $G,$ the only out-neighbors of a node can be $\mathtt {0}$ and $\mathtt {1}$, or $\mathtt {b}$ and $\mathtt {e}$, respectively, hence this encoding keeps the graph deterministic. We now prove some key properties of the chosen encoding.

Observe that even if we modified the graph to reduce the degree, it still holds that the subgraphs of G where matches of some subpattern can be present are separated by an $\mathtt {e} \mathtt {b}$-edge (recall Figure 7(b) and (c)). Thus, the following synchronizing property is useful.

Lemma 3.5.

For any string $x \in \Sigma ^+$, its binary encoding $\alpha (x)$ contains $\mathtt {0} \mathtt {1} \mathtt {1} \mathtt {0}$ if and only if x contains $\mathtt {e} \mathtt {b}$.

Proof.

We observe that $\mathtt {e}$ and $\mathtt {b}$ are encoded by two bits each, whereas $\mathtt {0}$ and $\mathtt {1}$ are encoded by four bits each. Hence, $\mathtt {0} \mathtt {1} \mathtt {1} \mathtt {0}$ can appear by concatenating the binary encoding of two or three symbols. However, $\mathtt {e} \mathtt {b}$ occurs in x if and only if it occurs in a substring of length 3 of x. Consequently, it suffices to check the claim by inspection of all the 64 substrings of x of length 3, $\mathtt {0} \mathtt {0} \mathtt {0},\dots , \mathtt {e} \mathtt {e} \mathtt {e}$, and their encodings to see that the property holds.□

An immediate consequence of Lemma 3.5 is that the encoding preserves the occurrences. Let $G^{(ex)}$ be the deterministic DAG reduced to have the maximum sum of indegree and outdegree 3, extended with the extra $\mathtt {b}$- and $\mathtt {e}$-nodes, and let $P^{(ex)}$ be the pattern corresponding to this reduced graph, extended with the $\mathtt {b}$ and $\mathtt {e}$ characters. Let $\alpha (G^{(ex)})$ denote the graph obtained from $G^{(ex)}$ by relabeling its nodes with the binary encoding $\alpha$ of their labels and substituting such nodes that now have labels of length 2 and 4 with undirected paths of length 2 and 4, respectively, whose nodes are labeled with single characters.

Lemma 3.6.

In the reduction, $P^{(ex)}$ has a match in $G^{(ex)}$ if and only if $\alpha (P^{(ex)})$ has a match in $\alpha (G^{(ex)})$.

Proof.

The forward implication is trivial. For the reverse implication, observe that by Lemma 3.5, in any match of $\alpha (P^{(ex)})$ in $\alpha (G^{(ex)})$, the encoding of the string $\mathtt {e} \mathtt {b}$ in the pattern is aligned with the encoding of the $\mathtt {e} \mathtt {b}$-edges in the graph. As such, the encoding of all other characters of the pattern are aligned with the encoding of their corresponding nodes of the graph, and thus $P^{(ex)}$ has a match in $G^{(ex)}$.□

4 UNDIRECTED GRAPHS: ZIG-ZAG MATCHING

In this section, we prove Theorem 1.2. To this end, we need to modify the previous reduction, defining a new alphabet, pattern, and graph. The main ideas will be the same, but since the graph will now be a single undirected path, some key changes will be needed. In Section 4.1, we introduce a reduction in which the alphabet has cardinality 6, and in Section 4.2, we show how to reduce the alphabet to binary.

4.1 Non-Binary Alphabet

The original alphabet $\Sigma = \lbrace \mathtt {b},\mathtt {e},\mathtt {0},\mathtt {1} \rbrace$ is replaced with $\Sigma ^{\prime } = \lbrace \mathtt {b},\mathtt {e},\mathtt {A},\mathtt {B},\mathtt {s},\mathtt {t} \rbrace$. Characters $\mathtt {0}$ and $\mathtt {1}$ are encoded in the following manner: $\begin{equation*} \mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \qquad\text{ and }\qquad \mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}. \end{equation*}$ When such encoding is applied, character $\mathtt {s}$ will be used as a separator marking the beginning and the end of the old characters. As an example, the subpattern $\begin{equation*} P_{x_i} = \mathtt {1} ~\mathtt {0} ~\mathtt {1}\qquad \text{ will be encoded as }\qquad P^{\prime }_{x_i} = \mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s}. \end{equation*}$

A new pattern $P^{\prime }$ is built applying this encoding to each one of the subpatterns $P_{x_i}$, thus obtaining new subpatterns $P^{\prime }_{x_i}$. We then concatenate all the subpatterns $P^{\prime }_{x_i}$ by placing the new character $\mathtt {t}$ to separate them, instead of eb. Finally, we place characters $\mathtt {b} \mathtt {t}$ at the beginning of the new pattern and $\mathtt {t} \mathtt {e}$ at the end. We have the following example. $\begin{align*} &P = \texttt {bb 100 e b 101 ee}\\ &\begin{matrix} ~ & ~ & \mathtt {1} & ~ & \mathtt {0} & ~ & \mathtt {0} &\\ P^{\prime } = \mathtt {b} &\mathtt {t} ~\mathtt {s} & \mathtt {A} \mathtt {B} \mathtt {A} & \mathtt {s} & \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} & \mathtt {s} &\mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} & \mathtt {s} \\ ~ & ~ & \mathtt {1} & ~ & \mathtt {0} & ~ & \mathtt {1} &\\ ~ & \mathtt {t} ~\mathtt {s} &\mathtt {A} \mathtt {B} \mathtt {A} & \mathtt {s} &\mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} & \mathtt {s} &\mathtt {A} \mathtt {B} \mathtt {A} & \mathtt {s} & \mathtt {t} ~\mathtt {e} \end{matrix} \end{align*}$

Note that for each subpattern, we are introducing a constant number of new characters, hence the size of the entire pattern $P^{\prime }$ still is $O(nd)$.

An analogous encoding will be applied to the graph. The strategy is to encode $G_W$ in an undirected path by concatenating subpaths representing each $G_W^{(j)}$, one after another.

The positions h in which both a $\mathtt {0}$- and a $\mathtt {1}$-node are present in $G_W^{(j)}$ are replaced by a path that can be matched both by $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ and $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$. Positions h with only a $\mathtt {0}$-node and no $\mathtt {1}$-node are encoded instead with a path that can be matched only by $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ (Figure 9). We use $\mathtt {s}$-nodes to separate these paths. We denote by $LG_W^{(j)}$ (Linear $G_W^{(j)}$) this linearized version of $G_W^{(j)}$. Moreover, given subgraph $G_W^{(j)}$, two new $\mathtt {t}$-nodes will mark the beginning and the ending of its encoding. Figure 10 illustrates this transformation for $G_W^{(j)}$.

Fig. 9. New substructures. (a) The old substructure is replaced by an undirected path that can match either $\mathtt {s} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {s}$ (which represents $\mathtt {1}$ ) by going forward only, or $\mathtt {s} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {s}$ (which represents $\mathtt {0}$ ) by going forward, backward, and forward again. (b) An undirected path replacing a $\mathtt {0}$ -node can match only the string $\mathtt {s} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {s}$ .

Fig. 10. A subgraph $G_W^{(j)}$ is converted into a linear structure $LG_W^{(j)}$ using $\mathtt {s}$ as the separator.

In a similar manner, $G_U$ is also encoded as a path. We do not need to encode all its $2n-2$ subgraphs: since the matching path can go through nodes more than once, we only need to encode one of these subgraphs, in the same manner as done for $G_W^{(j)}$. Let $LG_U$ be the linearized version of only one of the “jolly” gadgets that were composing the original $G_U$.

Then, for each $1 \le j \le n$, we build structure $LG^{(j)}$ by placing $\mathtt {t}$-nodes, $LG_U$ instances, $LG_W^{(j)}$, a $\mathtt {b}$-node on the left, and an $\mathtt {e}$-node on the right, as in Figure 11. In such structure, the $\mathtt {b}$-node and the $\mathtt {e}$-node delimit the beginning and the end of a viable match for a pattern. The $\mathtt {t}$-nodes are separating the $LG_U$ structures from $LG_W^{(j)}$, and in general, they are marking the beginning and the end of a match for a subpattern $P^{\prime }_{x_i}$. The idea behind $LG^{(j)}$ is that a match of P can traverse $LG_U$ from the beginning to the end, backward and forward as many times as needed, before starting a match of some subpattern $P_{x_i}^{\prime }$ inside $LG_W^{(j)}$. Notice also that this allows only subpatterns on even positions i to match inside $LG_W^{(j)}$. We will address this minor issue at the end see the paragraph following the proof of Lemma 4.3.

Fig. 11. The $LG_W^{(j)}$ structure surrounded by two instances of $LG_U$ . The $\mathtt {t}$ -nodes establish the beginning and the end of a match for a subpattern $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$ while the $\mathtt {b}$ - and $\mathtt {e}$ -nodes are the starting and ending point for a match of the whole pattern $P^{\prime }$ .

To construct the final graph $LG,$ we concatenate all $LG^{(1)}$, $LG^{(2)}$, ..., $LG^{(n)}$ into a single undirected path. Figure 12 gives a picture of the end result.

No issues arise regarding the size of the graph, since we are replacing every $\mathtt {0}$-node, or every pair of a $\mathtt {0}$-node and a $\mathtt {1}$-node, with a constant number of new nodes. By construction, the two gadgets $LG_U$ and $LG_W^{(j)}$ both have size $O(d)$, since for each one of the d entries of a vector we place one of the two possible encodings. In $LG,$ there are n instances of $LG_W^{(j)}$, each one surrounded by two $LG_U$ instances. Hence, the total size of the graph remains $O(nd)$.

To prove the correctness of the reduction, we will show some properties on LG by introducing the following lemmas. We use $t_lLG_W^{(j)}t_r$ to refer to $LG_W^{(j)}$ extended with the $\mathtt {t}$-nodes on its left and on its right. When referring to the k-th $\mathtt {s}$-character in $P^{\prime }_{x_i}$, we mean the k-th $\mathtt {s}$-character found scanning $P^{\prime }_{x_i}$ from left to right; in the same manner, we refer to the k-th $\mathtt {s}$-node in $LG_W^{(j)}$.

Lemma 4.1.

If subpattern $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$ has a match in $t_lLG_W^{(j)}t_r$ starting at $t_l$ and ending at $t_r$, then the k-th $\mathtt {s}$-character in $P^{\prime }_{x_i}$ matches the k-th $\mathtt {s}$-node in $LG_W^{(j)}$, for all $1\le k \le d+1$.

Proof.

First we prove that all the $\mathtt {s}$-nodes in $t_lLG_W^{(j)}t_r$ are matched exactly once by $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$. By construction, subpattern $P^{\prime }_{x_i}$ has $d+1$ $\mathtt {s}$-characters, and $LG_W^{(j)}$ has $d+1$ $\mathtt {s}$-nodes. Since we are working on a chain of nodes and the match is starting at $t_l$ and ending at $t_r$, all the nodes between $t_l$ and $t_r$ have to be matched at least once by $P^{\prime }_{x_i}$. Assume by contradiction that one such $\mathtt {s}$-node is matched more than once. Subpattern $P^{\prime }_{x_i}$ is left with strictly less than d $\mathtt {s}$-characters available for matching the other d $\mathtt {s}$-nodes, and we reach a contradiction. Now we can prove the statement of the lemma by induction on k—that is, the index of the $\mathtt {s}$-characters and $\mathtt {s}$-nodes. Let $\mathtt {s} ^{(P^{\prime }_{x_i})}_k$ denote the k-th $\mathtt {s}$-character in $P^{\prime }_{x_i}$, and let $s^{ \left(LG_W^{(j)} \right) }_k$ denote the k-th $\mathtt {s}$-node in $LG_W^{(j)}$.

Base Case $k = 1$. The match starts at $t_l,$ hence the only node that $\mathtt {s} ^{(P^{\prime }_{x_i})}_1$ can match is the first $\mathtt {s}$-node to the right on $t_l$—that is, $s^{ \left(LG_W^{(j)} \right) }_1$.

Inductive Case $k \gt 1$. The inductive hypothesis tells us that all the nodes up to $s^{ \left(LG_W^{(j)} \right) }_k$ have been matched by consecutive $\mathtt {s}$-characters of $P^{\prime }_{x_i}$ up to $\mathtt {s} ^{(P^{\prime }_{x_i})}_k$. We have to prove the statement for $k+1$. Starting from node $s^{ \left(LG_W^{(j)} \right) }_k$, the next $\mathtt {s}$-nodes that can be matched by $\mathtt {s} ^{(P^{\prime }_{x_i})}_{k+1}$ are $s^{ \left(LG_W^{(j)} \right) }_{k-1}$ and $s^{ \left(LG_W^{(j)} \right) }_{k+1}$. Character $\mathtt {s} ^{(P^{\prime }_{x_i})}_{k+1}$ cannot match node $s^{ \left(LG_W^{(j)} \right) }_{k-1}$ since it has already been matched by $s^{(P^{\prime }_{x_i})}_{k-1}$ and, as argued earlier, every $\mathtt {s}$-node can be matched only once. Thus, $\mathtt {s} ^{(P^{\prime }_{x_i})}_{k+1}$ has to match $s^{ \left(LG_W^{(j)} \right) }_{k+1}$.□

Lemma 4.2.

Subpattern $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$ has a match in $t_lLG_W^{(j)}t_r$ starting at $t_l$ and ending at $t_r$ if and only if there exist $y_j \in Y$ such that $x_i \cdot y_j = 0$.

Proof.

This property has already been proved for gadget $G_W$ in Lemma 3.2, thus what we are left to prove is that $LG_W^{(j)}$ behaves the same as the subgadget $G_W^{(j)}$. First recall that in the construction of $LG_W^{(j)}$, we placed an encoded $\mathtt {1}$ if in $G_W^{(j)}$ we had both a $\mathtt {0}$-node and a $\mathtt {1}$-node in the same position, whereas we placed an encoded $\mathtt {0}$ if we had only a $\mathtt {0}$-node. Lemma 4.1 guarantees that the encoding in $P^{\prime }$ of a single character of P is aligned with the encoding in $LG_W^{(j)}$ of a single node of $G_W$, preventing (the encoding of) a character of P from matching (the encoding of) multiple nodes of $G_W$ and vice versa. By construction, $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ can match the encoding of a $\mathtt {1}$-node while it fails to match the encoding of the $\mathtt {0}$-nodes, since their encoding involves too many characters. However, $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ can match an encoded $\mathtt {0}$-node with a natural alignment, but it can also match the encoding of a $\mathtt {1}$-node by scanning it forward, backward, and forward again. Therefore, the logic behind $LG_W^{(j)}$ safely implements the one of $G_W^{(j)}$, and from this point onward, one can follow the same reasoning as in Lemma 3.2 to complete the proof.□

The main difference with the original proof resides in assuming that a match for $P^{\prime }_{x_i}$ starts at $t_l$ and ends at $t_r$. This feature is crucial for the correctness of the reduction and can be safely exploited since, as shown in the following, the $\mathtt {b}$- and $\mathtt {e}$-nodes guarantee that in case of a match for $P^{\prime }$ we will cross the $LG_W^{(j)}$ gadget from left to right at least once.

Lemma 4.3.

Pattern $P^{\prime }$ has a match in LG if and only if there exist i and j such that i is even and subpattern $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$ has a match in $t_lLG_W^{(j)}t_r$ starting at $t_l$ and ending at $t_r$.

Proof.

For the ($\Rightarrow$) implication, first observe that the $\mathtt {b}$- and $\mathtt {e}$-nodes in LG are forcing a direction to follow. Let $LG_{Ul}^{(j)}$ and $LG_{Ur}^{(j)}$ be the $LG_U$ gadgets to the left and to the right of $LG_W^{(j)}$, respectively. Since pattern $P^{\prime }$ starts with a $\mathtt {b}$ and ends with an $\mathtt {e}$, a match can only start at the $\mathtt {b}$-node on the left of $LG_{Ul}^{(j)}$ and end at the $\mathtt {e}$-node on the right of $LG_{Ur}^{(j)}$, for some j. Hence, $LG_W^{(j)}$ needs to be crossed by a match from left to right at least once. Thus, there must exist a subpattern $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$ that has a match starting at $t_l$ and ending at $t_r$. For such a pattern, Lemma 4.2 applies. Moreover, because of our construction, only a subpattern on even position can achieve such a match.

The ($\Leftarrow$) implication is immediate since given a subpattern $\mathtt {t} P^{\prime }_{x_i}\mathtt {t}$ that has a match in $t_lLG_U^{(j)}t_r$ one can match $\mathtt {b} \mathtt {t} P^{\prime }_{x_1}\mathtt {t} \ldots \mathtt {t} P^{\prime }_{x_{i-1}}\mathtt {t}$ in $LG_{Ul}^{(j)}$ and $\mathtt {t} P^{\prime }_{x_{i+1}}\mathtt {t} \ldots \mathtt {t} P^{\prime }_{x_n}\mathtt {t} \mathtt {e}$ in $LG_{Ur}^{(j)}$ and have a full match for $P^{\prime }$ in LG.□

Since Lemma 4.3 gives us a property that holds only if a subpattern is in an even position, we need to tweak pattern $P^{\prime }$ to make the reduction work. Indeed, we define two patterns. The first pattern $P^{\prime (1)}$ is $P^{\prime }$ itself; the second pattern $P^{\prime (2)}$ is obtained by swapping the subpatterns $P^{\prime }_{x_i}$ on odd position with the next subpatterns $P^{\prime }_{x_{i+1}}$ on even position, for every $i = 1, 3, \ldots$ . For example, if n is even, we will have the following. $\begin{align*} P^{\prime (1)} &= \mathtt {b} \mathtt {t} ~P^{\prime }_{x_1}~\mathtt {t} ~P^{\prime }_{x_2}~\mathtt {t} ~P^{\prime }_{x_3}~\mathtt {t} ~P^{\prime }_{x_4}~\mathtt {t} ~ \ldots ~\mathtt {t} ~P^{\prime }_{x_{n-1}}~\mathtt {t} ~ P^{\prime }_{x_n}~\mathtt {t} \mathtt {e} = P^{\prime }\\ P^{\prime (2)} &= \mathtt {b} \mathtt {t} ~P^{\prime }_{x_2}~\mathtt {t} ~P^{\prime }_{x_1}~\mathtt {t} ~P^{\prime }_{x_4}~\mathtt {t} ~P^{\prime }_{x_3}~\mathtt {t} ~ \ldots ~\mathtt {t} ~P^{\prime }_{x_n}~\mathtt {t} ~ P^{\prime }_{x_{n-1}}~\mathtt {t} \mathtt {e} \end{align*}$ While $P^{\prime (1)}$ checks the even positions of $P^{\prime }$, $P^{\prime (2)}$ checks the odd ones. If n is even, then neither $P^{\prime (1)}$ nor $P^{\prime (2)}$ would be able to have a match in LG, since after matching an even number of subpatterns it is not possible to match any $\mathtt {e}$-node. In such case, we can simply add a dummy subpattern $\bar{P} = \mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s} \ldots \mathtt {s} ~\mathtt {A} \mathtt {B} \mathtt {A} ~\mathtt {s}$ (with d repetitions of $\mathtt {A} \mathtt {B} \mathtt {A}$) at the end of P as it were its last subpattern so that the number of subpatterns becomes odd. Indeed, observe that $\bar{P}$ corresponds to vector $\bar{x} = (1 1 \ldots 1)$, which has null product only with vector $\bar{y} = (0 0 \ldots 0)$. Hence, if $\bar{y} \not\in Y,$ then $\bar{P}$ does not have a match in any $LG^{(j)}$, whereas if $\bar{y} \in Y,$ every subpattern $P^{\prime }_{x_i}$ has a match in the $LG^{(j)}$ built on top of $\bar{y}$. This means that $\bar{P}$ does not disrupt our reduction.⁴

Now we are ready to present the end result.

Lemma 4.4.

Either $P^{\prime (1)}$ or $P^{\prime (2)}$ has a match in LG if and only if there exist vectors $x_i \in X$ and $y_j \in Y$ which are orthogonal.

Proof.

For ($\Rightarrow$), we assume that either $P^{\prime (1)}$ or $P^{\prime (2)}$ have a match in LG. By Lemma 4.3, this means that there exists a subpattern $P^{\prime (q)}_{x_i}, \; q \in \lbrace 1,2\rbrace$ that has a match in $LG_W^{(j)}$, for some j. Lemma 4.2 then ensures that $x_i \cdot y_j = 0$, thus $x_i$ and $y_j$ are orthogonal. For the other implication ($\Leftarrow$), we assume that there exist two OVs $x_i \in X$ and $y_j \in Y$. Thanks to Lemma 4.2, we find a subpattern $P^{\prime }_{x_i}$ matching $LG_W^{(j)}$. By construction, $P^{\prime }_{x_i}$ has to be in an even position either in $P^{\prime (1)}$ or in $P^{\prime (2)}$. By Lemma 4.3, this means that either $P^{\prime (1)}$ or $P^{\prime (2)}$ has a match in LG.□

Theorem 1.2 follows directly from the correctness of these constructions, except for the alphabet size reduction to binary, which we cover in the next section.

4.2 Binary Alphabet

In this section, we explain how to reduce the alphabet from the reduction in Section 4.1 to be binary. For this purpose, we apply the following encoding $\alpha$ to the characters: $\begin{equation*} \alpha (\mathtt {A}) = \mathtt {A}, \quad \alpha (\mathtt {B}) = \mathtt {B}, \quad \alpha (\mathtt {s}) = \mathtt {A} \mathtt {A} \mathtt {A}, \quad \alpha (\mathtt {t}) = \mathtt {B} \mathtt {B} \mathtt {B}, \quad \alpha (\mathtt {b}) = \alpha (\mathtt {e}) = \mathtt {A} \mathtt {B} \mathtt {B} \mathtt {A} \mathtt {A} \mathtt {B}. \end{equation*}$

Denote by $\alpha (P^{\prime })$ and $\alpha (LG)$ the encoded pattern and graph, respectively. Note that when applying the encoding to LG, we replace each $\sigma$-node with a sequence of nodes labeled with the characters of the encoding of $\sigma$. Thus, we maintain the property that the label of each node is a single character. To prove correctness, it suffices to prove the following two lemmas.

Lemma 4.5.

If $P^{\prime }$ has a match in LG, then $\alpha (P^{\prime })$ has a match in $\alpha (LG)$.

Proof.

Since the encoding replaces single symbols with multiple symbols, the difficulties arise when a match of $P^{\prime }$ in LG performs a change of direction. To understand how to handle this issue, let us follow a match of $P^{\prime }$ in LG from left to right. As long as such match has no zig-zags (i.e., it does not change direction in LG), then it trivially holds that we can construct a match of $\alpha (P^{\prime })$ in $\alpha (LG)$. Suppose now that a change of direction happens. We first match node v, followed by w, followed by v again (i.e., it changes direction at w).

If w is an old $\mathtt {A}$- or $\mathtt {B}$-node, then the encoding did not change w and $\alpha (P^{\prime })$ can still match w. Observe also that w cannot be a $\mathtt {b}$- or an $\mathtt {e}$-node, by construction. The remaining case is when w is an $\mathtt {s}$- or a $\mathtt {t}$-node. If w is a $\mathtt {t}$-node, then v cannot be a $\mathtt {b}$-node, because sequence $\mathtt {b}$ $\mathtt {t}$ $\mathtt {b}$ never occurs in the pattern. Note however that the encodings of $\mathtt {s}$- and $\mathtt {t}$ consist of three identical characters. Thus, the match of $\alpha (P^{\prime })$ in $\alpha (LG)$ can be made to use the border node of the encoding of w (the one adjacent to the encoding of v), then the middle node, and then the same border node again (i.e., to change direction at the middle node of the encoding of w).

At this point the match will continue in the reverse direction. Notice that the encodings of $\mathtt {A}$, $\mathtt {B}$, $\mathtt {s}$, and $\mathtt {t}$ are all palindrome strings. Hence, all the previous reasoning for matching following the forward direction also applies for the reverse direction.□

Lemma 4.6.

If $\alpha (P^{\prime })$ has a match in $\alpha (LG)$, then $P^{\prime }$ has a match in LG.

Proof.

To simplify notation, in this proof we treat $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ and $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ as single characters of $P^{\prime }$. To prove the lemma, it suffices to prove that in any match of $\alpha (P^{\prime })$ in $\alpha (LG)$, the encodings of $\mathtt {b}$, $\mathtt {e}$, $\mathtt {s}$, $\mathtt {t}$, and $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ and $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ in the encoded pattern are precisely aligned with encoded $\mathtt {b}$-, $\mathtt {e}$-, $\mathtt {s}$-, and $\mathtt {t}$-nodes, and with nodes encoding $\mathtt {1}$ and $\mathtt {0}$, respectively, in the encoded graph. When saying that such encodings are aligned, we mean that the first and last characters of an encoding $\alpha (\sigma)$ in the pattern must match either the first or the last node (irrespectively of which) of an encoding of the same character $\sigma$ in $\alpha (LG)$. For example, when character $\alpha (\mathtt {s}) = \mathtt {A} \mathtt {A} \mathtt {A}$ separates $\mathtt {0}$- or $\mathtt {1}$-nodes, it can be properly aligned to the graph as shown in Figures 13 and 14.

Fig. 13. The three possible alignments for $\alpha (\mathtt {s})=\mathtt {A} \mathtt {A} \mathtt {A}$ that can be obtained starting in the first position of $\alpha (\mathtt {s})$ in the graph.

Fig. 14. The three possible alignments for $\alpha (\mathtt {s})=\mathtt {A} \mathtt {A} \mathtt {A}$ that can be obtained starting in the last position of $\alpha (\mathtt {s})$ in the graph.

We organize the proof of this lemma in two parts, proving separately two claims. The goal is to show that the encoding of the characters preserves the properties already proven for the non-binary case.

Claim 1.

Encodings $\alpha (\mathtt {b})$ and $\alpha (\mathtt {e})$ in the pattern can only be exactly aligned with encodings $\alpha (\mathtt {b})$ and $\alpha (\mathtt {e})$ in the graph, respectively, from left to right.□

Proof.

First, note that the substrings $\alpha (\mathtt {b}) = \alpha (\mathtt {e})$ of the encoded pattern cannot have a match in the encoded graph starting anywhere else than in the encoding of a $\mathtt {b}$- or $\mathtt {e}$-node. Indeed, $\alpha (\mathtt {b})$ contains $\mathtt {B} \mathtt {B}$, which appears in the graph only in the encoding of a $\mathtt {t}$-node (apart from the encoding of $\mathtt {b}$- or $\mathtt {e}$-nodes). Suppose for a contradiction that a match of $\alpha (\mathtt {b})$ matches two $\mathtt {B}$ characters from an encoding of a $\mathtt {t}$-node in the graph; in particular, $\alpha (\mathtt {b})$ starts with $\mathtt {A} \mathtt {B} \mathtt {B}$, and this prefix of $\alpha (\mathtt {b})$ must end at the middle $\mathtt {B}$-node of $\alpha (\mathtt {t})$. However, the character following $\mathtt {A} \mathtt {B} \mathtt {B}$ in $\alpha (\mathtt {b})$ is $\mathtt {A}$, whereas any neighbor of the middle $\mathtt {B}$-node of $\alpha (\mathtt {t})$ is labeled with $\mathtt {B}$, a contradiction.

We now prove that $\alpha (\mathtt {b})$ in the pattern can only be exactly aligned with $\alpha (\mathtt {b})$ in the graph, from left to right. Suppose for a contradiction that this is not the case. We draw here below the configuration in LG and $\alpha (LG)$ at the border between $LG^{(j)}$ and $LG^{(j+1)}$ (the beginning and end of LG are the same, but missing $\mathtt {t} \mathtt {e}$, and $\mathtt {b} \mathtt {t}$, respectively). $\begin{equation*} \begin{matrix} \hfill LG: & \cdots & \mathtt {t} & \mathtt {e} & \mathtt {b} & \mathtt {t} & \cdots \\ \hfill \alpha (LG): & \cdots & \mathtt {B} \mathtt {B} \mathtt {B} & \mathtt {A} \mathtt {B} \mathtt {B} \mathtt {A} \mathtt {A} \mathtt {B} & \mathtt {A} \mathtt {B} \mathtt {B} \mathtt {A} \mathtt {A} \mathtt {B} & \mathtt {B} \mathtt {B} \mathtt {B} & \cdots \end{matrix} \end{equation*}$ Following the contradiction reasoning, there must be a way of aligning $\alpha (\mathtt {b})$ to the graph other than using an exact match from left to right with $\alpha (\mathtt {b})$ in the graph. Indeed, we can analyze the alternative ways of aligning $\alpha (\mathtt {b})$ to the graph by considering the possible starting position for a potential alignment. Since $\alpha (\mathtt {b})$ starts with $\mathtt {A} \mathtt {B}$, a potential alignment in the graph might start in the second or third $\mathtt {A}$ character of $\alpha (\mathtt {b})$, or in the first, second, or third $\mathtt {A}$ character of $\alpha (\mathtt {e})$. In Figure 15, we analyze all of these five cases, concluding that at some point they will all fail. Completely symmetrically, we can argue that $\alpha (\mathtt {e})$ in the pattern can only be exactly aligned with $\alpha (\mathtt {e})$ in the graph, from left to right.

Fig. 15. The five potential alignments for string $\alpha (\mathtt {b})=\mathtt {A} \mathtt {B} \mathtt {B} \mathtt {A} \mathtt {A} \mathtt {B}$ that do not start in the first position of $\alpha (\mathtt {b})$ in the graph. The squares around the characters highlight the mismatches. Cases (a) and (b) take into account potential alignments starting at the fourth and fifth position of $\alpha (\mathtt {b})$ in the graph, respectively. Cases (c), (d), and (e) depict potential alignments starting at the first, fourth, or fifth position of $\alpha (\mathtt {e})$ in the graph, respectively. In case (c), the mismatch occurs on the first $\mathtt {B}$ character of $\alpha (\mathtt {t})$ , which we know always follows $\alpha (\mathtt {b})$ .

□

Claim 2.

Encodings of $\mathtt {t}$, $\mathtt {s}$, and of the substrings $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ and $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ in the encoded pattern are aligned with the corresponding encoded nodes in $\alpha (LG)$.

Proof.

We prove this claim by induction on the position of the current character in $P^{\prime }$.

Substrings $\alpha (\mathtt {s})$ and $\alpha (\mathtt {t})$ of the encoded pattern are allowed to change direction inside the encoded graph, but they must still start and end at an extremity of the occurrences of $\alpha (\mathtt {s})$ and $\alpha (\mathtt {t})$ in the encoded graph, respectively.

Suppose that the prefix $\alpha (\mathtt {b})$ of $\alpha (P^{\prime })$ matches $\alpha (\mathtt {b})$ in the substructure $\alpha (LG^{(j)})$ of $\alpha (LG)$. As the base case, observe that the characters in $\alpha (P^{\prime })$ following $\alpha (\mathtt {b})$ are $\alpha (\mathtt {t})\alpha (\mathtt {s})$, followed by $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ or $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$. It can be easily checked that they must match from left to right those nodes that follow $\alpha (\mathtt {b})$ in the encoded graph.

For the inductive case, suppose first that the current character of $P^{\prime }$ is $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ (or $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$). By construction, the character of $P^{\prime }$ preceding it can only be $\mathtt {s}$, and by induction, we have that $\alpha (\mathtt {s}) = \mathtt {A} \mathtt {A} \mathtt {A}$ is aligned with the nodes encoding $\mathtt {s}$ in the graph. The match of $\alpha (P^{\prime })$ cannot go back using $\mathtt {A}$-nodes of $\alpha (\mathtt {s})$ because it would not have a $\mathtt {B}$-node to continue the match. Thus, it must use the nodes of the encoded graph corresponding to $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ (or $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$). Moreover, it cannot go on using $\mathtt {A}$-nodes from the next occurrence of $\alpha (\mathtt {s})$ in the graph, because they are all $\mathtt {A}$-nodes. Therefore, this proves that if the current character of $P^{\prime }$ is $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ or $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$, then it is aligned with the corresponding nodes in the encoded graph.

Suppose now that the current character of $P^{\prime }$ is $\mathtt {s}$. The character of $P^{\prime }$ preceding it can be $\mathtt {t}$, $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$, or $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$. In case the preceding character is $\mathtt {t}$, the match of $\alpha (P^{\prime })$ in the encoded graph cannot go back to using nodes encoding $\mathtt {t}$ because they are all $\mathtt {B}$-nodes. Suppose thus that the preceding character is $\mathtt {1} = \mathtt {A} \mathtt {B} \mathtt {A}$ or $\mathtt {0} = \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A} \mathtt {B} \mathtt {A}$ and the encoding of $\mathtt {s}$ in the pattern goes back to use nodes from the encoding of such preceding character. This means it can only match the first $\mathtt {A}$-node of $\alpha (\mathtt {s})$, then go back to the $\mathtt {A}$ node of the encoding of this previous character, and then use the same first $\mathtt {A}$-node of $\alpha (\mathtt {s})$. Notice that this is allowed by our notion of alignment.

The last remaining case is when that the current character of $P^{\prime }$ is $\mathtt {t}$. Since this $\mathtt {t}$ occurrence is not the first one (which was handled in the base case), the character preceding it in $P^{\prime }$ is always $\mathtt {s}$, and recall that $\alpha (\mathtt {s}) = \mathtt {A} \mathtt {A} \mathtt {A}$. Also in this case, the encoding $\alpha (\mathtt {t}) = \mathtt {B} \mathtt {B} \mathtt {B}$ cannot go back and use such $\mathtt {A}$-nodes of $\alpha (\mathtt {s})$, thus it must align to the encoding of a $\mathtt {t}$-node.□

Claims 1 and 2 presented previously complete the proof of this lemma since they allow us to apply the same reasoning of the non-binary case.

5 ADDITIONAL RESULTS

5.1 A Linear Time Algorithm for Almost Trees

Directed pseudo forests are directed graphs whose nodes have outdegree at most 1, and their transpose are graphs whose nodes have indegree at most 1. Both of these types of graphs are structures lying between our conditional hardness results and the linear time solvable string matching case. Such structures are forests of directed trees whose roots may be connected in a directed cycle (at most one cycle per forest).

Exact string matching in a tree whose edges are directed from root to leaves (graphs whose nodes have indegree at most 1) can be solved in linear time. One such algorithm [2] works on constant alphabet, but there is a folklore alphabet-independent solution through a simple variation of the KMP algorithm [36]: recall that after linear time preprocessing of the pattern $P[1..m]$, KMP scans through the text string T, updating index i in the pattern in amortized constant time to find the longest prefix $P[1..i]$ that matches suffix $T[j-i+1..j]$ of the current position j in the text. One can simulate this algorithm on a tree by just storing the current value of index i at each node before branching.

One can reduce our special case to the tree case as follows. Cut the cycle at any edge $(v,w)$ to form a tree rooted at w. Read the cycle from v backward (possibly many times) to form a string $S[1..m]$, where m is the pattern length. Create a path matching the reverse of $S[1..m]$ and connect this path to the root w forming a new tree. Pattern matching on this tree takes linear time [2].

To see that the reduction works correctly, consider root r of some tree hanging from the cycle. Let $S^r$ be the infinite string formed by reading the cycle starting at r backward. For searching a pattern of length m spanning r, it is sufficient to add a path spelling reverse of $S^r[1..m]$ on top of r and use the linear time solution for trees [2]. Furthermore, observe that the infinite strings $S^r$ for all roots r along the cycle overlap, so it is sufficient to linearize the cycle until each root is preceded by a length m part of the reverse of their infinite string $S^r$. To cover also matches inside the cycle, one can consider similarly any node on a cycle as a root. The reduction covers these cases.

Finally, the symmetric case of a cycle containing roots of upward directed trees (graphs whose nodes have outdegree at most 1) can be reduced to the symmetric case by reversing all edges and the pattern.

5.2 Language Intersection of Two DFAs

We can show a connection between SMLG and the emptiness intersection problem by turning a deterministic DAG and a pattern into two DFAs. We do so by modifying the graph of our reduction so that we also obtain a reduction from OV to the emptiness intersection.

Let G be the 3-DDAG obtained in the reduction of Section 3. We can obtain a DFA $D_1$ from G as follows. First, the nodes in G become the states of $D_1$, and each arc $(u,v)$ in G gives a transition from state u to state v in $D_1$ with symbol $L(v)$. Also let S be the states in $D_1$ that correspond to $\mathtt {b}$-nodes in G with zero indegree. We add $O(|S|)$ states to $D_1$ forming a tree whose root becomes the initial state of $D_1,$ and the leaves of this tree have transition to the states in S with symbol $\mathtt {b}$. Each transition from each of these new states to its left child is labeled with L and to its right child is labeled with R.

The other DFA $D_2$ is obtained from P as follows. We employ the same tree with $|S|$ leaves as earlier, except the transitions from these leaves with $\mathtt {b}$ go the same state: from this state, we have a simple chain of states that spells P. We can observe that P occurs in G if and only if the languages of $D_1$ and $D_2$ have a nonempty intersection, as this amounts to find an occurrence of P starting from one of the $\mathtt {b}$-nodes in G corresponding to a state in S.

6 DISCUSSION

The lower bounds that we presented for directed deterministic graphs are tight with regard to the structure of the graph, in the sense that lowering the degree or the alphabet size makes the problem solvable in subquadratic time. Lowering the degree from 3 makes the problem fall into the almost-tree category that we dealt with in Section 5.1. Lowering the alphabet size to unary means that the graph can only consist of a set of paths or cycles. If there is a cycle in the graph, the pattern always matches, and otherwise one can easily check in linear time if there is a long enough path for the pattern to match. Similar trivial or esoteric cases occur when considering the same for directed non-deterministic, undirected deterministic, and undirected non-deterministic graphs.

Our reductions create sparse graphs $G=(V,E)$ with $|E|=O(|V|)$, and hence the results are covering also the difficulty of finding $O(|V|^{1-\epsilon } \, m)$ or $O(|V| \, m^{1-\epsilon })$ time algorithms for SMLG. This difficulty carries over to non-deterministic subdense graphs with $|E|=O(|V|^{2-\epsilon })$ and alphabet size at least 3: given a sparse graph $G^{\prime }=(V,E^{\prime })$ and pattern P of length m from binary alphabet, convert $G^{\prime }$ into a subdense graph $G=(V,E)$ adding $|E|$ spurious arcs labeled with a third symbol. In other words, unless the OV hypothesis fails, there is no $O(|E|+|V|^{1-\epsilon } \, m +|E|^{\frac{1}{2}}\, m)$ time algorithm for SMLG on subdense graphs G for $m=O(|V|)$. However, for dense graphs with $|E|=\omega (|V|^{2-\epsilon })$, there is room to improve the bounds.

Open Problem 1.

Is there an $O(|E|+|V|\, m+ |E|^{\frac{1}{2}}\, m)$ time algorithm for SMLG on dense graphs?

Other natural directions to continue the study include the tradeoff between indexing and query time on string matching for graphs, as well as a closer examination of other possible string-alike graph classes than those already covered here.

For the former, a slight modification of the proof of Theorem 1.1 results in conditional hardness of finding $O(|E|^\alpha m^\beta)$ time algorithms for SMLG for any $\alpha ,\beta \gt 0$ with $\alpha +\beta \lt 2$. This observation can then be exploited in a self-reduction [24], showing that one cannot achieve subquadratic search times using polynomial time for indexing (under the OV hypothesis).

For the latter, one possible direction is to consider degenerate generalized strings [4]: a sequence $S=S_1, S_2, \ldots , S_n$ is a degenerate generalized string if set $S_i$ consist of strings of fixed-length $n_i$ for all i. When interpreted as an automaton, the language of S is the Cartesian product of its sets. It was recently shown that language intersection emptiness on two degenerate generalized strings can be decided in linear time in the total size of the sets [4]. However, if the requirement of equal length strings is relaxed, the complexity of string matching on such elastic degenerate strings has shown to have tight connection with fast matrix multiplication [12]. Naturally, our reductions do not cover graphs representing degenerate generalized strings. They also do not cover the elastic case, but another relaxation of degenerate generalized strings: consider that the Cartesian product taking all combinations of consecutive sets is replaced by an arbitrary selection of subsets of combinations of consecutive sets. A characteristic feature of graphs resulting from this relaxation is that all paths from one node to another are of the same length. This is also a feature of our reduction graphs. Hence, other features need to be identified to close this gap between linear time solvability and conditional quadratic time hardness; interestingly, conditional hardness of indexing elastic degenerate strings has been established without a direct link to the complexity of the online version [29].

After our last submission of this work for review, many new research directions have emerged around the topic. Some of these are already covered in a survey [42]. In the following, we briefly discuss some recent directions.

The conditional lower bounds have been strengthened to consider how many logarithmic factors can be shaved off from the quadratic complexity [30]. The conclusion is that if the denominator of the time complexity feature is a $O(\log ^c m)$ or $O(\log ^c |E|)$ term, the exponent c is bounded by a constant. New graphs properties have been identified that make them amenable to indexing: graphs that can be partially sorted [18], graphs parameterized by the maximum width of their co-lexicographic relation [17], and graphs induced from suitable segmentation of multiple sequence alignments [25] admit efficient indexing schemes. The latter work adapts a reduction technique from this work to show that an arbitrary segmentation of a multiple sequence alignment does not break the conditional lower bound, but one needs a stronger property. Further complexity results have also been derived for online exact and approximate matching on different graph classes [14, 20, 32]. Finally, SMLG has been studied also under the model of computation of quantum computing [19], achieving a subquadratic solution for non-sparse graphs.

ACKNOWLEDGMENTS

We would like to acknowledge the contribution of Alessio Conte, Luca Versari and Bastien Cazaux in useful and inspirational conversations. Also, we would like to thank an anonymous reviewer of a previous version of this article for pointing out the open problem on dense graphs.

Even if the work of Backurs and Indyk [9] represents the closest connection with our results, a folklore proof by Russell Impagliazzo about the hardness of the NFA acceptance problem was also known. We would like to thank Karl Bringmann for bringing such proof to our attention.⁵

Footnotes

¹ Note that we can also define the node labels as nonempty strings, but it suffices to use single symbols to show that string matching in graphs is challenging.
Footnote
² Note that $\mathtt {1}$ is a symbol of $\Sigma$, whereas 1 is the truth value in $x_i$.
Footnote
³ An $\mathtt {e}$-node can have two $\mathtt {b}$-nodes as out-neighbors when linking $G_{U1}$ to $G_W$ (see [23]).
Footnote
⁴ An alternative strategy is to use only one pattern $P^{\prime \prime }$ instead of two, defined as $\begin{equation*} P^{\prime \prime } = \mathtt {b} \mathtt {t} ~\bar{P}~\mathtt {t} ~P^{\prime }_{x_1}~\mathtt {t} ~\bar{P}~\mathtt {t} ~P^{\prime }_{x_2}~\mathtt {t} ~\bar{P}~ \ldots ~\mathtt {t} ~\bar{P}~\mathtt {t} ~ P^{\prime }_{x_n}~\mathtt {t} ~\bar{P}~\mathtt {t} \mathtt {e}. \end{equation*}$ The “dummy” subpatterns $\bar{P}$ encode a $\mathtt {1}$ in every position and guarantee that we always have an odd number of subpatterns in $P^{\prime \prime }$. Moreover, every actual subpattern $P^{\prime }_{x_i}$ has a chance to be matched in $LG_W^{(j)}$, for some j, since every such subpattern occurs in an even position.
Footnote
⁵ An example of this proof can be found in chapter 1, page 6, of the lecture notes of the course Fine-Grained Complexity Theory, run by Karl Bringmann and Marvin Künneman in 2019 for the Max Plank Institute Informatik. The lecture notes are available online at https://www.mpi-inf.mpg.de/departments/algorithms-complexity/teaching/summer19/fine-complexity/.
Footnote

REFERENCES

[1] Abboud Amir, Backurs Arturs, and Williams Virginia Vassilevska. 2015. Tight hardness results for LCS and other sequence similarity measures. In Proceedings of the IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS’15). IEEE, Los Alamitos, CA, 59–78. Google ScholarDigital Library
Reference
[2] Akutsu Tatsuya. 1993. A linear time pattern matching algorithm between a string and a tree. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 684. Springer, 1–10. Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[3] Alanko Jarno, D’Agostino Giovanna, Policriti Alberto, and Prezza Nicola. 2020. Regular languages meet prefix sorting. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA’20). 911–930. Google ScholarCross Ref
Reference 1Reference 2
[4] Alzamel Mai, Ayad Lorraine A. K., Bernardini Giulia, Grossi Roberto, Iliopoulos Costas S., Pisanti Nadia, Pissis Solon P., and Rosone Giovanna. 2018. Degenerate string comparison and applications. In Proceedings of the 18th International Workshop on Algorithms in Bioinformatics (WABI’18). Leibniz International Proceedings in Informatics, Vol. 113. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, Article 21, 14 pages. Google ScholarCross Ref
Reference 1Reference 2
[5] Amir Amihood, Lewenstein Moshe, and Lewenstein Noa. 1997. Pattern matching in hypertext. In Algorithms and Data Structures. Lecture Notes in Computer Science, Vol. 1272. Springer, 160–173. Google ScholarCross Ref
Reference 1Reference 2
[6] Amir Amihood, Lewenstein Moshe, and Lewenstein Noa. 2000. Pattern matching in hypertext. J. Algorithms 35, 1 (2000), 82–99. Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[7] Angles Renzo and Gutierrez Claudio. 2008. Survey of graph database models. ACM Comput. Surv. 40, 1 (Feb. 2008), Article 1, 39 pages. Google ScholarDigital Library
Reference
[8] Backurs Arturs and Indyk Piotr. 2015. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC’15). ACM, New York, NY, 51–58. Google ScholarDigital Library
Reference
[9] Backurs Arturs and Indyk Piotr. 2016. Which regular expression patterns are hard to match? In Proceedings of the IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS’16). IEEE, Los Alamitos, CA, 457–466. Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[10] Backurs Arturs and Indyk Piotr. 2018. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput. 47, 3 (2018), 1087–1097. Google ScholarCross Ref
Reference 1Reference 2
[11] Backurs Arturs and Tzamos Christos. 2017. Improving Viterbi is hard: Better runtimes imply faster clique algorithms. In Proceedings of the 34th International Conference on Maching Learning (ICML’17). Proceedings of Machine Learning Research, Vol. 70., 311–321. http://proceedings.mlr.press/v70/backurs17a.html.Google Scholar
Reference
[12] Bernardini Giulia, Gawrychowski Pawel, Pisanti Nadia, Pissis Solon P., and Rosone Giovanna. 2019. Even faster elastic-degenerate string matching via fast matrix multiplication. In Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP’19). Leibniz International Proceedings in Informatics, Vol. 132. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 21, 15 pages. Google ScholarCross Ref
Reference
[13] Bringmann Karl and Künnemann Marvin. 2015. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS’15). IEEE, Los Alamitos, CA, 79–97. Google ScholarDigital Library
Reference
[14] Caceres Manuel. 2022. Parameterized algorithms for string matching to DAGs: Funnels and beyond. arXiv:2212.07870 (2022). Google ScholarCross Ref
Reference
[15] Consortium The Computational Pan-Genomics. 2018. Computational pan-genomics: Status, promises and challenges. Brief Bioinformatics 19, 1 (2018), 118–135. Google ScholarCross Ref
Reference
[16] Conte Alessio, Ferraro Gaspare, Grossi Roberto, Marino Andrea, Sadakane Kunihiko, and Uno Takeaki. 2018. Node similarity with q-grams for real-world labeled networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18). ACM, New York, NY, 1282–1291. Google ScholarDigital Library
Reference
[17] Cotumaccio Nicola. 2022. Graphs can be succinctly indexed for pattern matching in $O(|E|^{2} + |V|^{5/2})$ time. In Proceedings of the Data Compression Conference (DCC’22). IEEE, Los Alamitos, CA, 272–281.Google Scholar
Reference
[18] Cotumaccio Nicola and Prezza Nicola. 2021. On indexing and compressing finite automata. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA’21). 2585–2599. Google ScholarCross Ref
Reference
[19] Darbari Parisa, Gibney Daniel, and Thankachan Sharma V.. 2022. Quantum time complexity and algorithms for pattern matching on labeled graphs. In String Processing and Information Retrieval. Lecture Notes in Computer Science, Vol. 13617. Springer, 303–314. Google ScholarDigital Library
Reference
[20] Dondi Riccardo, Mauri Giancarlo, and Zoppis Italo. 2022. On the complexity of approximately matching a string to a directed graph. Information and Computation 288 (2022), 104748. Google ScholarDigital Library
Reference
[21] Equi Massimo, Grossi Roberto, and Mäkinen Veli. 2019. On the complexity of exact pattern matching in graphs: Binary strings and bounded degree. arXiv e-prints, arXiv:1901.05264 [cs.CC] (2019).Google Scholar
Reference
[22] Equi Massimo, Grossi Roberto, Mäkinen Veli, and Tomescu Alexandru I.. 2019. On the complexity of string matching for graphs. In Proceedings of the 46th International Colloquium on Automata, Langages, and Programming (ICALP’19). Leibniz International Proceedings in Informatics, Vol. 132. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 55, 15 pages. Google ScholarCross Ref
[23] Equi Massimo, Grossi Roberto, Tomescu Alexandru I., and Mäkinen Veli. 2019. On the complexity of exact pattern matching in graphs: Determinism and zig-zag matching. arXiv e-prints, arXiv:1902.03560 [cs.CC] (2019).Google Scholar
[24] Equi Massimo, Mäkinen Veli, and Tomescu Alexandru I.. 2021. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In SOFSEM 2021: Theory and Practice of Computer Science. Lecture Notes in Computer Science, Vol. 12607. Springer, 608–622. Google ScholarDigital Library
Reference 1Reference 2
[25] Equi Massimo, Norri Tuukka, Alanko Jarno, Cazaux Bastien, Tomescu Alexandru I., and Mäkinen Veli. 2022. Algorithms and complexity on indexing founder graphs. Algorithmica. Published online, July 28, 2022. Google ScholarCross Ref
Reference
[26] Francis Nadime, Green Alastair, Guagliardo Paolo, Libkin Leonid, Lindaaker Tobias, Marsault Victor, Plantikow Stefan, Rydberg Mats, Selmer Petra, and Taylor Andrés. 2018. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 1433–1445. Google ScholarDigital Library
Reference
[27] Gagie Travis, Manzini Giovanni, and Sirén Jouni. 2017. Wheeler graphs: A framework for BWT-based data structures. Theor. Comput. Sci. 698 (2017), 67–78. Google ScholarCross Ref
Reference 1Reference 2
[28] Erik Garrison, Sirén Jouni, Adam M. Novak, Glenn Hickey, Jordan M. Eizenga, Eric T. Dawson, William Jones, et al. 2018. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36 (Aug. 2018), 875. Google ScholarCross Ref
Reference
[29] Gibney Daniel. 2020. An efficient elastic-degenerate text index? Not likely. In String Processing and Information Retrieval. Lecture Notes in Computer Science, Vol. 12303. Springer, 76–88. Google ScholarDigital Library
Reference
[30] Gibney Daniel, Hoppenworth Gary, and Thankachan Sharma V.. 2021. Simple reductions from formula-SAT to pattern matching on labeled graphs and subtree isomorphism. In Proceedings of the 4th Symposium on Simplicity in Algorithms (SOSA’21). 232–242. Google ScholarCross Ref
Reference
[31] Gibney Daniel and Thankachan Sharma V.. 2019. On the hardness and inapproximability of recognizing wheeler graphs. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA’19). Leibniz International Proceedings in Informatics, Vol. 144. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article 51, 16 pages. Google ScholarCross Ref
Reference 1Reference 2
[32] Gibney Daniel, Thankachan Sharma V., and Aluru Srinivas. 2022. On the hardness of sequence alignment on de Bruijn graphs. J. Comput. Biol. 29, 12 (2022), 1377–1396. Google ScholarCross Ref
Reference
[33] Hido Shohei and Kashima Hisashi. 2009. A linear-time graph kernel. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM’09). IEEE, Los Alamitos, CA, 179–188.Google ScholarDigital Library
Reference
[34] Impagliazzo Russell and Paturi Ramamohan. 2001. On the complexity of k-SAT. J. Comput. Syst. Sci. 62, 2 (2001), 367–375. Google ScholarDigital Library
Reference
[35] Jain Chirag, Zhang Haowen, Gao Yu, and Aluru Srinivas. 2019. On the complexity of sequence to graph alignment. In Research in Computational Molecular Biology, Cowen Lenore J. (Ed.). Springer International Publishing, Cham, Switzerland, 85–100.Google ScholarCross Ref
Reference 1Reference 2
[36] Knuth Donald E., Jr. James H. Morris, and Pratt Vaughan R.. 1977. Fast pattern matching in strings. SIAM J. Comput. 6, 2 (1977), 323–350. Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[37] Limasset Antoine, Cazaux Bastien, Rivals Eric, and Peterlongo Pierre. 2016. Read mapping on de Bruijn graphs. BMC Bioinform. 17 (2016), 237. Google ScholarCross Ref
Reference
[38] Manber Udi and Wu Sun. 1992. Approximate string matching with arbitrary costs for text and hypertext. In Advances in Structural and Syntactic Pattern Recognition. World Scientific, 22–33. Google ScholarCross Ref
Reference 1Reference 2
[39] Navarro Gonzalo. 2000. Improved approximate pattern matching on hypertext. Theor. Comput. Sci. 237, 1-2 (2000), 455–463. Google ScholarDigital Library
Reference
[40] Park Kunsoo and Kim Dong Kyue. 1995. String matching in hypertext. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 937. Springer, 318–329. Google ScholarCross Ref
Reference
[41] Potechin Aaron and Shallit Jeffrey. 2020. Lengths of words accepted by nondeterministic finite automata. Inform. Process. Lett. 162 (2020), 105993. Google ScholarCross Ref
Reference
[42] Prezza Nicola. 2021. Subpath queries on compressed graphs: A survey. Algorithms 14, 1 (2021), 14.Google ScholarCross Ref
Reference
[43] Prud’hommeaux Eric and Seaborne Andy. 2008. SPARQL Query Language for RDF. World Wide Web Consortium Recommendation REC-rdf-sparql-query-20080115. W3C.Google Scholar
Reference
[44] Rabin M. O. and Scott D.. 1959. Finite automata and their decision problems. IBM J. Res. Dev. 3, 2 (April1959), 114–125. Google ScholarDigital Library
Reference
[45] Rautiainen Mikko and Marschall Tobias. 2017. Aligning sequences to general graphs in $O(V + mE)$ time. bioRxiv (2017). Google ScholarCross Ref
Reference 1Reference 2
[46] Rodriguez Marko A.. 2015. The Gremlin graph traversal machine and language (invited talk). In Proceedings of the 15th Symposium on Database Programming Languages. 1–10. Google ScholarDigital Library
Reference
[47] Schneeberger Korbinian, Hagmann Jörg, Ossowski Stephan, Warthmann Norman, Gesing Sandra, Kohlbacher Oliver, and Weigel Detlef. 2009. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10 (2009), R98. Google ScholarCross Ref
Reference
[48] Shi Chuan, Li Yitong, Zhang Jiawei, Sun Yizhou, and Yu Philip S.. 2017. A survey of heterogeneous information network analysis. IEEE Trans. Knowl. Data Eng. 29, 1 (2017), 17–37. Google ScholarDigital Library
Reference
[49] Sirén Jouni, Välimäki Niko, and Mäkinen Veli. 2014. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans. Comput. Biol. Bioinform. 11, 2 (March2014), 375–388. Google ScholarDigital Library
Reference
[50] Thachuk Chris. 2013. Indexing hypertext. J. Discrete Algorithms 18 (2013), 113–122. Google ScholarDigital Library
Reference
[51] Wehar Michael. 2016. On the Complexity of Intersection Non-Emptiness Problems. Ph.D. Dissertation. University at Buffalo, State University of New York. http://www.michaelwehar.com/documents/mwehar_dissertation.pdf.Google Scholar
Reference
[52] Williams Ryan. 2005. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci. 348, 2 (2005), 357–365. Google ScholarDigital Library
Reference 1Reference 2
[53] Yang Jaewon and Leskovec Jure. 2015. Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42, 1 (2015), 181–213. Google ScholarDigital Library
Reference

Index Terms

On the Complexity of String Matching for Graphs
1. Applied computing
  1. Life and medical sciences
    1. Genomics
      1. Computational genomics
2. Theory of computation
  1. Computational complexity and cryptography
    1. Problems, reductions and completeness
  2. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching
    2. Graph algorithms analysis

Recommendations

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails
SOFSEM 2021: Theory and Practice of Computer Science
Abstract
The string matching problem on a node-labeled graph $G = (V, E)$ asks whether a given pattern string P has an occurrence in G, in the form of a path whose concatenation of node labels equals P. This is a basic primitive in various problems in ...
$^{}^{}$
Read More
Complexity Issues of String to Graph Approximate Matching
Language and Automata Theory and Applications
Abstract
The problem of matching a query string to a directed graph, whose vertices are labeled by strings, has application in different fields, from data mining to computational biology. Several variants of the problem have been considered, depending on ...
Read More
Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails
Abstract
The string matching problem on a node-labeled graph G = ( V , E ) asks whether a given pattern string P equals the concatenation of node labels of some path in G. This is a basic primitive in various problems in bioinformatics, graph ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Algorithms Volume 19, Issue 3
July 2023
281 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3592471
Editor:
Edith Cohen
Google Research, USA and Tel Aviv University, Israel
Issue’s Table of Contents
Copyright © 2023 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 April 2023
- Online AM: 16 March 2023
- Accepted: 7 March 2023
- Revised: 31 August 2021
- Received: 10 March 2020
Published in talg Volume 19, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Exact pattern matching
graph query
graph search
labeled graphs
string matching
string search
Strong Exponential Time Hypothesis
heterogeneous networks
variation graphs
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 2,396
  Total Downloads
- Downloads (Last 12 months)2,367
- Downloads (Last 6 weeks)204
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Year	Authors	Graph	Exact/Approximate	Time
1992	Manber and Wu [38]	DAG	Approximate\(^{\it (1)}\)	\(O(m\|E\| + occ\lg \lg m)\)
1993	Akutsu [2]	Tree	Exact	\(O(N)\)
1995	Park and Kim [40]	DAG	Exact\(^{\it (3)}\)	\(O(N + m\|E\|)\)
1997	Amir et al. [6]	General	Exact	\(\boldsymbol {O(N + m\|E\|)}\)
1997	Amir et al. [6]	General	Approximate\(^{\it (2)}\)	NP-hard
1997	Amir et al. [6]	General	Approximate\(^{\it (1)}\)	\(O(Nm\lg N + m\|E\|)\)
1998	Navarro [39]	General	Approximate\(^{\it (1)}\)	\(O(Nm + m\|E\|)\)
2017	Rautiainen and Marschall [45]	General	Approximate\(^{\it (1)}\)	\(\boldsymbol {O(N + m\|E\|)}\)
2019	Jain et al. [35]	General binary alphabet	Approximate\(^{\it (2)}\)	NP-hard

On the Complexity of String Matching for Graphs

ACM Transactions on Algorithms

Abstract

1 INTRODUCTION

(String Matching in Labeled Graphs (SMLG))

1.1 Our Results

1.2 History and Implications

2 OVERVIEW OF THE REDUCTION AND CONNECTIONS WITH REGULAR EXPRESSIONS

3 DETERMINISTIC DAGS

3.1 Pattern

3.2 Graph Gadgets

3.3 Non-Deterministic Graph

3.4 Deterministic Graph

3.5 Reduced Degree

3.6 Binary Alphabet

4 UNDIRECTED GRAPHS: ZIG-ZAG MATCHING

4.1 Non-Binary Alphabet

4.2 Binary Alphabet

5 ADDITIONAL RESULTS

5.1 A Linear Time Algorithm for Almost Trees

5.2 Language Intersection of Two DFAs

6 DISCUSSION

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails

Complexity Issues of String to Graph Approximate Matching

Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media