1 Introduction

The planted clique problem, also referred to as hidden clique, is a problem of central importance in the design of algorithms. We introduce a variation of this problem where instead of planting the clique at random, an adversary plants the clique. Our main results are that in certain regimes of the parameters of the problem, the known polynomial time algorithms can be extended to work also in the adversarial settings, whereas for other regimes, the adversarial planting version becomes NP-hard. We find the results interesting for three reasons. One is that they concern an extensively studied problem (planted clique), but from a new direction, and we find that the results lead to a better understanding of what aspects of the planted clique problem are made use of by the known algorithms. Another is that extending the known algorithms (based on semidefinite programming) to the adversarial planted setting involves some new techniques regarding how semidefinite programming can be used and analysed. Finally, the NP-hardness results are interesting as they are proven in a semi-random model in which most of the input instance is random, and the adversary controls only a relatively small aspect of the input instance. One may hope that this brings us closer to proving NP-hardness results for purely random models, a task whose achievement would be a breakthrough in complexity theory.

1.1 The Random Planted Clique Model

Our starting point is the Erdos-Renyi G(np) random graph model, which generates graphs on n vertices, and every two vertices are connected by an edge independently with probability p. We start our discussion with the special case in which \(p = \frac{1}{2}\), and other values of p will be considered later. Given a graph G, let \(\omega (G)\) denote the size of the maximum clique in G, and let \(\alpha (G)\) denote the size of the maximum independent set. Given a distribution D over graphs, we use the notation \(G \sim D\) for denoting a graph sampled at random according to D. The (edge) complement of a graph \(G \sim G(n, \frac{1}{2})\) is by itself a graph sampled from \(G(n, \frac{1}{2})\), and the complement of a clique is an independent set, and hence the discussion concerning cliques in \(G(n, \frac{1}{2})\) extends without change to independent sets (and vice versa).

It is well known (proved by computing the expectation and variance of the number of cliques of the appropriate size) that for \(G \sim G(n, \frac{1}{2})\), w.h.p. \(\omega (G) \simeq 2\log n\) (the logarithm is in base 2). However, there is no known polynomial time algorithm that can find cliques of size \(2\log n\) in such graphs. A polynomial time greedy algorithm can find a clique of size \((1 + o(1))\log n\). The existence of \(\rho > 1\) for which polynomial time algorithms can find cliques of size \(\rho \log n\) is a longstanding open problem.

In the classical planted clique problem, one starts with a graph \(G' \sim G(n, \frac{1}{2})\) and a parameter k. In \(G'\) one chooses at random a set K of k vertices, and makes this set into a clique by inserting all missing edges between pairs of vertices in K. We refer to K as the planted clique, and say that the resulting graph G is distributed according to \(G(n, \frac{1}{2}, k)\). Given \(G \sim G(n, \frac{1}{2}, k)\), the algorithmic goal can be one of the following three: find K, find a clique of maximum size, or find any clique of size at least k. It is not difficult to show that when k is sufficiently large (say, \(k > 3\log n\)), then with high probability K is the unique maximum size clique in \(G \sim G(n, \frac{1}{2}, k)\), and hence all three goals coincide. Hence in the planted clique problem, the goal is simply to design polynomial time algorithms that (with high probability over the choice of \(G \sim G(n, \frac{1}{2}, k)\)) find the planted clique K. The question is how large should k be (as a function of n) so as to make this task feasible.

For some sufficiently large constant \(c > 0\) (throughout, we use c to denote a sufficiently large constant), if \(k > c\sqrt{n \log n}\), with high probability the the vertices of K are simply the k vertices of highest degree in G (see [15]), and hence K can easily be recovered. Alon, Krivelevich and Sudakov [1] managed to shave the \(\sqrt{\log n}\) factor, designing a spectral algorithm that recovers K when \(k > c\sqrt{n}\). They also showed that c can be made an arbitrarily small constant, by increased the running time by a factor of \(n^{O(\log (\frac{1}{c}))}\) (this is done by “guessing" a set \(K'\) of \(O(\log (\frac{1}{c}))\) vertices of K, and finding the maximum clique in the subgraph induced on their common neighbors). Subsequently, additional algorithms were developed that find the planted clique when \(k > c\sqrt{n}\). They include algorithms based on the Lovasz theta function, which is a form of semi-definite programming [8], algorithms based on a “reverse-greedy" principle [5, 11], and message passing algorithms [6]. There have been many attempts to find polynomial time algorithms that succeed when \(k = o(\sqrt{n})\), but so far all of them failed (see for example [9, 13, 18]). It is a major open problem whether there is any such polynomial time algorithm.

Planted clique when \(p \not = \frac{1}{2}\) was not studied as extensively, but it is quite well understood how results from the \(G(n, \frac{1}{2}, k)\) model transfer to the G(npk) model. For p much smaller that \(\frac{1}{2}\), say \(p = n^{\delta -1}\) for some \(0< \delta < 1\) (hence average degree \(n^{\delta }\)), the problem changes completely. Even without planting, with high probability over the choice of \(G \sim G(n, p)\) (with \(p = n^{\delta -1}\)) we have that \(\omega (G) = O(\frac{1}{1 - \delta })\), and the maximum clique can be found in polynomial time. This also extends to finding maximum cliques in the planted setting, regardless of the value of k. (We are not aware of such results being previously published, but they are not difficult. See Section 3.) For \(p > \frac{1}{2}\), it is more convenient to instead look at the equivalent problem in which \(p < \frac{1}{2}\), but with the goal of finding a planted independent set instead of a planted clique. We refer to this model as \(\bar{G}(n, p, k)\). For \(G \sim G(n, p)\) (with \(p = n^{\delta - 1}\)) we have that with high probability \(\alpha (G) = \Theta (n^{1 - \delta }\log n)\). For \(G \sim \bar{G}(n, p, k)\) the known algorithms extend to finding planted independent sets of size \(k = cn^{1 - \frac{\delta }{2}}\) in polynomial time. We remark that the approach of [1] of making c arbitrarily small does not work for such sparse graphs.

1.2 The Adversarial Planted Clique Model

In this paper we introduce a variation on the planted clique model (and planted independent set model) that we refer to as the adversarial planted clique model. As in the random planted clique model, we start with a graph \(G' \sim G(n, p)\) and a parameter k. However, now a computationally unbounded adversary may inspect \(G'\), select within it a subset K of k vertices of its choice, and make this set into a clique by inserting all missing edges between pairs of vertices in K. We refer to this model as AG(npk) (and the corresponding model for planted independent set as \(A\bar{G}(n, p, k)\)). As shorthand notation shall use \(G \sim AG(n, p, k)\) to denote a graph generated by this process. Let us clarify that AG(npk) is not a distribution over graphs, but rather a family of distributions, where each adversarial strategy (where a strategy of an adversary is a mapping from \(G'\) to a choice of K) gives rise to a different distribution.

In the adversarial planted model, it is no longer true that the planted clique is the one of maximum size in the resulting graph G. Moreover, finding K itself may be information theoretically impossible, as K might be statistically indistinguishable from some other clique of size k (that differs from K by a small number of vertices). The three goals, that of finding K, finding a clique of maximum size, or finding any clique of size at least k, are no longer equivalent. Consequently, for our algorithmic results we shall aim at the more demanding goal of finding a clique of maximum size, whereas for our hardness results, we shall want them to hold even for the less demanding goal of finding an arbitrary clique of size k.

1.3 Our Results

Our results cover a wide range of values of \(0< p < 1\), where p may be a function of n. For simplicity of the presentation and to convey the main insights of our results, we present here the results for three representative regimes: \(p = \frac{1}{2}\), \(p = n^{\delta - 1}\) for \(0< \delta < 1\), and \(p = 1 - n^{\delta - 1}\). For the latter regime, it will be more convenient to replace it by the equivalent problem of finding adversarially planted independent sets when \(p = n^{\delta - 1}\).

The term almost surely denotes a probability that tends to 1 as n grows. The term extremely high probability denotes a probability of the form \(1 - e^{-n^r}\) for some \(r > 0\). By \(\exp (x)\) for some expression x we mean \(e^x\).

Informally, our results show the following phenomenon. We consider only the case that \(p \le \frac{1}{2}\), but consider both the planted clique and the planted independent set problems, and hence the results can be translated to \(p > \frac{1}{2}\) as well. For clique, we show (Theorems 1.1 and 1.2) how to extend the algorithmic results known for the random planted clique setting to the adversarial planted clique setting. However, for independent set, we show that this is no longer possible. Specifically, when p is sufficiently small, we prove (Theorem 1.3) that finding an independent set of size k (any independent set, not necessarily the planted one) in the adversarial planted independent set setting is NP-hard. Moreover, the NP-hardness result holds even for large values of k for which finding a random planted independent set is trivial.

In statements of our theorems, we refer to probabilities with which an algorithm finds a maximum clique in an input graph \(G \sim AG(n, p, k)\). The probability of success is taken over the choice of \(G' \sim G(n, p)\), the random graph in which the adversary later plants a clique of size k. For positive results, success needs to hold for every adversarial planting strategy, namely, for every possible choice of k vertices as the planted clique K. For negative results, it suffices that there is one adversarial planting strategy that causes the algorithm to fail. If this adversarial strategy is randomized, then the probability of failure is taken also over the random choices of the adversary. (This explanation holds in an analogous way also for the planted independent set problem.)

Theorem 1.1

For every fixed \(\varepsilon > 0\) and for every \(k \ge \varepsilon \sqrt{n}\), there is an explicitly described algorithm running in time \(n^{O(\log (\frac{1}{\varepsilon }))}\) which almost surely finds the maximum clique in a graph \(G \sim AG(n, \frac{1}{2}, k)\).

Theorem 1.2

Let \(p = n^{\delta -1}\) for \(0< \delta < 1\). Then for every k, there is an explicitly described algorithm running in time \(n^{O(\frac{1}{1-\delta })}\) which almost surely finds the maximum clique in a graph \(G \sim AG(n, p, k)\).

Theorem 1.3

For \(0< \gamma < 1\), \(0< \delta < 1\), \(p = n^{\delta -1}\) and \(cn^{1 - \delta }\log n \le k \le \frac{2}{3}n\), where c is a sufficiently large constant, the following holds. There is no polynomial time algorithm that has probability at least \(\gamma \) of finding an independent set of size k in \(G \sim A\bar{G}(n, p, k)\), unless NP has randomized polynomial time algorithms (NP=RP).

The constant \(\frac{2}{3}\) in Theorem 1.3 was chosen for concreteness – any other constant smaller than 1 will work as well.

1.4 Related Work

Some related work was already mentioned in Section 1.1.

Our algorithm for Theorem 1.1 is based on an adaptation of the algorithm of [8] that applied to the random planted clique setting. In turn, that algorithm is based on the theta function of Lovasz [16].

A work that is closely related to ours and served as an inspiration both to the model that we study, and to the techniques that are used in the proof of the NP-hardness result (Theorem 1.3) is the work of David and Feige [4] on adversarially planted 3-colorings. That work uncovers a phenomenon similar to the one displayed in the current work. Specifically, for the problem of 3-coloring (rather than clique or independent set) it shows that for certain values of p, algorithms that work in the random planted setting can be extended to the adversarial planted setting, and for other values of p, finding a 3-coloring in the adversarial planted setting becomes NP-hard. However, there are large gaps left open in the picture that emerges from the work of [4]. For large ranges of the values of p, specifically, \(n^{-1/2}< p < n^{-1/3}\) and \(p < n^{-2/3}\), there are neither algorithmic results nor hardness results in the work of [4]. Unfortunately, the most interesting values of p for the 3-coloring problem, which are \(p \le \frac{c\log n}{n}\), lie within these gaps, and hence the results of [4] do not apply to them. Our work addresses a different problem (planted clique instead of planted 3-coloring), and for our problem, our analysis leaves almost no such gaps. We are able to determine for which values of p the problem is polynomial time solvable, and for which values it is NP-hard. See Section 5 for more details.

Our model is an example of a semi-random model, in which part of the input is determined at random and part is determined by an adversary. There are many other semi-random models, both for the clique problem and for other problems. Describing all these models is beyond the scope of this paper, and the interested reader is referred to [7] and references therein for additional information.

2 Finding Cliques Using the Theta Function

In this section we prove Theorem 1.1. We start with an overview of the proof, and then provide more details in Sections 2.1 and 2.2.

Our algorithm is an adaptation of the algorithm of [8] that finds the maximum clique in the random planted model. We shall first review that algorithm, then describe why it does not apply in our setting in which an adversary plants the clique, and finally explain how we modify that algorithm and its analysis so as to apply it in the adversarial planted setting.

The key ingredient in the algorithm of [8] is the theta function of Lovasz, denoted by \(\vartheta \). Given a graph G, \(\vartheta (G)\) can be computed in polynomial time (up to arbitrary precision, using semidefinite programming (SDP)), and satisfies \(\vartheta (G) \ge \alpha (G)\). As we are interested here in cliques and not in independent sets, we shall consider \(\bar{G}\), the edge complement of G, and then \(\vartheta (\bar{G}) \ge \omega (G)\). The theta function has several equivalent definitions, and the one that we shall use here (referred to as \(\vartheta _4\) in [16]) is the following.

Given a graph \(G = G(V, E)\), a collection of unit vectors \(s_i \in \mathbb {R}^n\) (one vector for every vertex \(i \in V\)) is an orthonormal representation of G, if \(s_i\) and \(s_j\) are orthogonal (\(s_i \cdot s_j = 0\)) whenever \((i, j) \in E\). The theta function is the maximum value of the following expression, where maximization is over all orthonormal representations \(\{s_i\}\) of G and over all unit vectors h (h is referred to as the handle):

$$\begin{aligned} \vartheta (G) = \max _{h,\{s_i\}} \sum _{i\in V} (h \cdot s_i)^2 \end{aligned}$$
(1)

The optimal orthonormal representation and the associated handle that maximize the above formulation for \(\vartheta \) can be found (up to arbitrary precision) in polynomial time by formulating the problem as an SDP (details omitted). Observe that for any independent set S the following is a feasible solution for the SDP: choose \(s_i = h\) for all \(i \in S\), and choose all remaining vectors \(s_j\) for \(j \not \in S\) to be orthogonal to h and to each other. Consequently, \(\vartheta (G) \ge \alpha (G)\), as claimed.

The main content of the algorithm of [8] is summarized in the following theorem. We phrased it in a way that addresses cliques rather than independent sets, implicitly using \(\alpha (\bar{G}) = \omega (G)\). We also remind the reader that in the random planted model, the planted clique K is almost surely the unique maximum clique.

Theorem 2.1

(Results of [8]) Consider \(G \sim G(n, \frac{1}{2}, k)\), a graph selected in the random planted clique model, with \(k \ge c\sqrt{n}\) for some sufficiently large constant c. Then with extremely high probability (over choice of G) it holds that \(\vartheta (\bar{G}) = \omega (G)\).

Moreover, for every vertex i that belongs to the planted clique K, the corresponding vector \(s_i\) has inner product larger than \(1 - \frac{1}{n}\) with the handle h, and for every other vertex, the corresponding inner product is at most \(\frac{1}{n}\).

Given Theorem 2.1, the following algorithm finds the planted clique when \(G \sim G(n, \frac{1}{2}, k)\), and \(k \ge c\sqrt{n}\) for some sufficiently large constant c. Solve the optimization problem (1) (on \(\bar{G}\)) to sufficiently high precision, and output all vertices whose corresponding inner product with h is at least \(\frac{1}{2}\).

The algorithm above does not apply to \(G \sim AG(n, \frac{1}{2}, k)\), a graph selected in the adversarial planted clique model, for the simple reason that Theorem 2.1 is incorrect in that model. The following example illustrates what might go wrong,

Example 2.1

Consider a graph \(G' \sim G(n, \frac{1}{2})\). In \(G'\) first select a random vertex set T of size slightly smaller than \(\frac{1}{2}\log n\). Observe that the number of vertices in \(G'\) that are in the common neighborhood of all vertices of T is roughly \(2^{-|T|}n > \sqrt{n}\). Plant a clique K of size k in the common neighborhood of T. In this construction, K is no longer the largest clique in G. This is because T (being a random graph) is expected to have a clique \(K'\) of size \(2\log |T| \simeq 2\log \log n\), and \(K' \cup K\) forms a clique of size roughly \(k + 2\log \log n\) in G. Moreover, as T itself is a random graph with edge probability \(\frac{1}{2}\), the value of the theta function on T is roughly \(\sqrt{|T|}\) (see [14]), and consequently one would expect the value of \(\vartheta (\bar{G})\) to be roughly \(k + \sqrt{\log n}\).

Summarizing, it is not difficult to come up with strategies for planting cliques of size k that result in the maximum clique having size strictly larger than k, and the value of \(\vartheta (\bar{G})\) being even larger. Consequently, the solution of the optimization problem (1) by itself is not expected to correspond to the maximum clique in G.

We now explain how we overcome the above difficulty. A relatively simple, yet important, observation is the following.

Proposition 2.1

Let \(G \sim AG(n, p, k)\) with \(p = 1/2\) and \(k > \sqrt{n}\), and let \(K'\) be the maximum clique in G (which may differ from the planted clique K). Then with extremely high probability over the choice of \(G' \sim G(n, \frac{1}{2})\), for every possible choice of k vertices by the adversary, \(K'\) contains at least \(k - O(\log n)\) vertices from K, and at most \(O(\log n)\) additional vertices.

Proof

(Sketch) Standard probabilistic arguments show that with extremely high probability, the largest clique in \(G'\) (prior to planting a clique of size k) is of size at most \(\frac{k}{2}\). When this holds, \(K'\) contains at least \(\frac{k}{2}\) vertices from K. Each of the remaining vertices of \(K'\) needs to be connected to all vertices in \(K' \cap K\). Consequently, with extremely high probability, \(K'\) contains at most \(2\log n\) vertices not from K. This is because a \(G' \sim G(n,\frac{1}{2})\) graph, with extremely high probability, does not contain two sets of vertices A and B, with \(|A| = 2\log n\), \(|B| = \Omega (\sqrt{n})\), such that all pairs of vertices in \(A \times B\) induce edges in G.

As \(|K'| \ge k\), we further conclude that all but \(O(\log n)\) vertices of K must be members of \(K'\).\(\square \)

A key theorem that we prove is:

Theorem 2.2

Let \(G \sim AG(n, p, k)\) with \(p = 1/2\) and \(k = k(n) \ge 10\sqrt{n}\). Then \(k \le \vartheta (\bar{G}) \le k + O(\log n)\) with extremely high probability over the choice of \(G' \sim G(n, \frac{1}{2})\), for every possible choice of k vertices by the adversary.

We now explain how Theorem 2.2 is proved. The bound \(\vartheta (\bar{G}) \ge k\) was already explained above. Hence it remains to show that \(\vartheta (\bar{G}) \le k + O(\log n)\). In general, to bound \(\vartheta (G)\) from above for a graph G(VE), one considers the following dual formulation of \(\vartheta \), as a minimization problem.

$$\begin{aligned} \vartheta (G) = \min _M[\lambda _1(M)] \end{aligned}$$
(2)

Here M ranges over all n by n symmetric matrices in which \(M_{ij} = 1\) whenever \((i, j) \not \in E\), and \(\lambda _1(M)\) denotes the largest eigenvalue of M. (Observe that if G has an independent set S of size k, then M contains a k by k block of 1 entries. A Rayleigh quotient argument then implies that \(\lambda _1(M) \ge k\), thus verifying the inequality \(\vartheta (G) \ge \alpha (G)\).) To prove Theorem 2.2 we exhibit a matrix M as above (for the graph \(\bar{G}\)) for which we prove that \(\lambda _1(M) \le k + O(\log n)\).

We first review how a matrix M was chosen by [8] in the proof of Theorem 2.1. First, recall that we consider \(\bar{G}\), and let E be the set of edges of \(\bar{G}\) (non-edges of G). We need to associate values with the entries \(M_{ij}\) for \((i, j)\in E\) (as other entries are 1). The matrix block corresponding to the planted clique K (planted independent set in \(\bar{G}\)) is all 1 (by necessity). For every \((i, j)\in E\) where both vertices are not in K one sets \(M_{ij} = -1\). For every pair \((i, j)\in E\) with \(i \in K\) and \(j \not \in K\), one sets \(M_{i,j} = M_{j,i} = -\frac{k - d_{j,K}}{d_{j,K}}\), where \(d_{j,K}\) is the number of neighbors that vertex j has in the set K (in \(\bar{G}\)). In order to show that \(\lambda _1(M) = k\), one first observes that the vector \(x_K\) (with value 1 at entries that correspond to vertices of K, and value 0 elsewhere) is an eigenvector of M with eigenvalue k. Then one proves that \(\lambda _2(M)\), the second largest eigenvalue of M, has value smaller than k. This is done by decomposing M into a sum of several matrices, bounding the second largest eigenvalue for one of these matrices, and the largest eigenvalue for the other matrices. By Weyl’s inequality, the sum of these eigenvalues is an upper bound on \(\lambda _2(M)\). This upper bound is not tight, but it does show that \(\lambda _2(M) < k\). It follows that the eigenvalue k associated with \(x_K\) is indeed \(\lambda _1(M)\). Further details are omitted.

We now explain how to choose a matrix M so as to prove the bound \(\vartheta (\bar{G}) \le k + O(\log n)\) in Theorem 2.2. Recall (see Example 2.1) that we might be in a situation in which \(\vartheta (\bar{G})> \alpha (\bar{G}) > k\) (with all inequalities being strict). In this case, let \(K'\) denote the largest independent set in \(\bar{G}\), and note that \(K'\) is larger than K. In M, the matrix block corresponding to \(K'\) is all 1. One may attempt to complete the construction of M as described above for the random planting case, by replacing K with \(K'\) everywhere in that construction. If one does so, the vector \(x_{K'}\) (with value 1 at entries that correspond to vertices of \(K'\), and value 0 elsewhere) is an eigenvector of M with eigenvalue \(\alpha (\bar{G}) > k\). However, M would necessarily have another eigenvector with a larger eigenvalue, because \(\vartheta (\bar{G}) > \alpha (\bar{G})\). Hence we are still left with the problem of bounding \(\lambda _1(M)\), rather than bounding \(\lambda _2(M)\). Having failed to identify an eigenvector for \(\lambda _1(M)\), we may still obtain an upper bound on \(\lambda _1(M)\) by using approaches based on Weyl’s inequality (or other approaches). However, these upper bounds are not tight, and it seems difficult to limit the error that they introduce to be as small as \(O(\log n)\), which is needed for proving the inequality \(\lambda _1(M) \le k + O(\log n)\).

For the above reason, we choose M differently. For some constant \(\frac{1}{2}< \rho < 1\), we extend the clique K to a possibly larger clique Q, by adding to it every vertex that has \(\rho k\) neighbors in K. (In Example 2.1, the corresponding clique Q will include all vertices of \(K \cup T\). In contrast, if K is planted at random and not adversarially, then we will simply have \(Q = K\).) Importantly, we prove (see Corollary 2.2) that if \(G' \sim G(n, \frac{1}{2})\), then with high probability \(|Q| < k + O(\log n)\) (for every possible choice of planting a clique of size k by the adversary). For the resulting graph \(G_Q\), we choose the corresponding matrix M in the same way as it was chosen for the random planting case. Now we do manage to show that the eigenvector \(x_Q\) (with eigenvalue |Q|) associated with this M indeed has the largest eigenvalue. This part is highly technical, and significantly more difficult than the corresponding proof for the random planting case. The reason for the added level of difficulty is that, unlike the random planting case in which we are dealing with only one random graph, here the adversary can plant the clique in any one of \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \) locations, and our analysis needs to hold simultaneously for all \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \) graphs that may result from such plantings. Further details can be found in Section 2.1.

Having established that \(\vartheta (\bar{G}_Q) = |Q| \le k + O(\log n)\), we use monotonicity of the theta function to conclude that \(\vartheta (\bar{G}) \le k + O(\log n)\). This concludes our overview for the proof of Theorem 2.2.

Given Theorem 2.2, let us now explain our algorithm for finding a maximum clique in \(G \sim AG(n, \frac{1}{2}, k)\).

Given a graph \(G \sim AG(n, \frac{1}{2}, k)\), the first step in our algorithm is to solve the optimization problem (1) on the complement graph \(\bar{G}\). By Theorem 2.2, we will have \(\vartheta (\bar{G}) \le k + c\log n\) for some constant \(c > 0\). Let \(\{s_i\}\) denote the orthonormal representation found by our solution, and let h be the corresponding handle.

The second step of our algorithm it to extract from G a set of vertices that we shall refer to as H, that contains all those vertices i for which \((h \cdot s_i)^2 \ge \frac{3}{4}\).

Lemma 2.1

For H as defined above, with extremely high probability, at least \(k - O(\log n)\) vertices of K are in H, and most \(O(\log n)\) vertices not from K are in H.

Proof

Let T denote the set of those vertices in K for which \((h \cdot s_i)^2 < \frac{3}{4}\). Remove T from G, thus obtaining the graph \(G_T\). We have that \(\vartheta (\bar{G}_T) \ge \vartheta (\bar{G}) - \sum _{i\in T} (h\cdot s_i)^2 \ge k - \frac{3}{4}|T|\).

The exact same graph \(G_T\) could also be generated by switching the order of operations. Given \(G' \sim G(n, \frac{1}{2})\), first the adversary removes T from \(G'\), thus obtaining a graph \(G'_{T}\) which is a subgraph of the random graph \(G'\). Then the adversary plants the clique \(K \setminus T\) of size \(k - |T|\) in \(G'_{T}\), so as to obtain \(G_T\). By Theorem 2.2 we have that \(\vartheta (\bar{G}_T) \le k - T + O(\log n)\), with extremely high probability. This last claim is based on applying Theorem 2.2 on each of the \(\left( {\begin{array}{c}n\\ |T|\end{array}}\right) \) possible choices of T, and taking a union bound over their probabilities of failure.

By comparing the lower bound on \(\vartheta (\bar{G}_T)\) with the upper bound, we get that \(k - \frac{3}{4}|T| \le k - T + O(\log n)\), implying that \(|T| \le O(\log n)\).

Having established that T is small, let R be the set of vertices not in K for which \((h \cdot s_i)^2 \ge \frac{3}{4}\). We claim that every such vertex \(i\in R\) is a neighbor of every vertex \(j\in K \setminus T\). This is because in the orthogonal representation (for \(\bar{G}\)), if i and j are not neighbors we have that \(s_i \cdot s_j = 0\), and then the fact that \(s_i, s_j\) and h are unit vectors implies that \((h\cdot s_i)^2 < 1 - (h\cdot s_j)^2 \le \frac{1}{4}\). Having this claim and using the fact that \(|K \setminus T| > \sqrt{n}\), it follows that \(|R| \le 2\log n\). This is because a \(G' \sim G(n, \frac{1}{2})\) graph, with extremely high probability, does not contain two sets of vertices A and B, with \(|A| = 2\log n\), \(|B| = \sqrt{n}\), such that all pairs of vertices in \(A \times B\) induce edges in G.\(\square \)

The third step of our algorithm constructs a set F that contains all those vertices that have at least \(\frac{3k}{4}\) neighbors in H.

Lemma 2.2

With extremely high probability, the set F described above contains the maximum clique in G, and at most \(O(\log n)\) additional vertices.

Proof

(Sketch) We may assume that H satisfies the properties of Lemma 2.1. Proposition 2.1 then implies that with extremely high probability, every vertex of the maximum clique in G has at least \(\frac{3k}{4}\) neighbors in H, and hence is contained in F. A probabilistic argument (similar to the end of the proof of Lemma 2.1) establishes that F has at most \(O(\log n)\) vertices not from K. As K itself has at most \(O(\log n)\) vertices not from the maximum clique (by Proposition 2.1), the total number of vertices in F that are not members of the maximum clique is at most \(O(\log n)\).\(\square \)

Finally, in the last step of our algorithm we find a maximum clique in F, and this is a maximum clique in G. This last step can be performed in polynomial time by a standard algorithm (used for example to show that vertex cover is fixed parameter tractable). For every non-edge in the subgraph induced on F, at least one of its end-vertices needs to be removed. Try both possibilities in parallel, and recurse on each subgraph that remains. The recursion terminates when the graph is a clique. The shortest branch in the recursion gives the maximum clique. As only \(O(\log n)\) vertices need to be removed in order to obtain a clique, the depth of the recursion is at most \(O(\log n)\), consequently the running time (which is exponential in the depth) is polynomial in n.

This completes our overview of our algorithm for finding a clique in \(G \sim AG(n, \frac{1}{2}, k)\) when \(k > c\sqrt{n}\) for a sufficiently large constant \(c > 0\). To complete the proof of Theorem 1.1 we need to also address the case that \(k > \varepsilon \sqrt{n}\) for arbitrarily small constant \(\varepsilon \). This we do (as in [1]) by guessing \(t \simeq 2\log \frac{c}{\epsilon }\) vertices from K (try all \(n^{t}\) possibilities), and considering the subgraph of G induced on their common neighbors. This subgraph corresponds to a subgraph of \(G' \simeq G(n, \frac{1}{2})\) with roughy \(n'\simeq 2^{-t} n\) vertices, and a planted clique of size \(\varepsilon \sqrt{n} - t \simeq c\sqrt{n'}\). Now on this new graph \(G''\) we can invoke the algorithm based on the theta function. The proof that \(\vartheta (\bar{G}'') \le k + O(\log n)\) uses the fact that Theorem 2.2 holds with extremely high probability (see more details in Section 2.2).

The many details that were omitted from the above overview of the proof of Theorem 1.1 can be found in the upcoming sections. Specifically, in Section 2.1 we present the proof of Theorem 2.2, generalized to values of p other than 1/2, and \(k \ge c\sqrt{np}\). In Section 2.2 we present the proof of Theorem 1.1, first addressing the case that c is sufficiently large, and then extending the results to the case that c can be arbitrarily small.

2.1 Bounding the Theta Function

In this section we will prove Theorem 2.2, which is stated here again for convenience, with slightly more details.

Theorem 2.3

(Theorem 2.2 restated) Let \(G \sim AG(n, p, k)\) with \(p = 1/2\) and \(k = k(n) = 10\sqrt{n}\). Then \(k \le \vartheta (\bar{G}) \le k + 96\log n\) with probability at least \(1 - \exp (-2k\log n)\), for every possible choice of k vertices by the adversary.

Instead of proving exactly this theorem, we will prove a generalization to other values of p. Let \(c \in (0, 1)\) be an arbitrary constant.

Theorem 2.4

Consider an arbitrary function w(n), such that \(n^{2/3} \ll w(n) \le cn\). Let \(G \sim AG(n, p, k)\), where \(p = w(n)/n\) and \(k = Cw(n)^{1/2}\) for constant \(C \ge \frac{5}{1 - p}\). Let \(a(n, p) := \frac{48}{(1 - p)^2}p\log n\). Then \(k \le \vartheta (\bar{G}) \le k + a(n, p)\), for every possible choice of k vertices by the adversary, with probability at least \(1 - \exp (-2k\log n)\).

This theorem has an important corollary, which follows from the Lipschitz property of Lovasz theta function [16].

Corollary 2.1

Let p and k be as in Theorem 2.4, and let \(K \subset V\) be the vertices belonging to the planted clique of \(G\sim AG(n, p, k)\). Then, with probability at least \(1 - \exp (-2k\log n)\),

  1. (i)

    for every subset \(T\subset K\),

    $$\begin{aligned} k - |T| \le \vartheta (\overline{G\setminus T}) \le k - |T| + a(n, p), \end{aligned}$$

    where \(G\setminus T\) denotes the graph G with vertices from T deleted;

  2. (ii)

    for every subset \(S\subset V\setminus K\), if we “add” S to the planted clique by drawing all edges between S and \(S\cup K\), for the resulting graph \(G_S\)

    $$\begin{aligned} k + |S| \le \vartheta (\overline{G_S}) \le k + |S| + a(n, p). \end{aligned}$$

We now prove Theorem 2.4. For \(G \sim AG(n, p, k)\), its complement graph \(\bar{G}\) contains a planted independent set of size k, so \(\vartheta (\bar{G}) \ge \alpha (\bar{G}) \ge k\). It remains to prove the upper bound. We will use the formulation of the theta function as an eigenvalue minimization problem:

$$\begin{aligned} \vartheta (G) = \min _M[\lambda _1(M)] \end{aligned}$$
(3)

Here M ranges over all n by n symmetric matrices in which \(M_{ij} = 1\) whenever \((i, j) \not \in E\), and \(\lambda _1(M)\) denotes the largest eigenvalue of M.

The following proposition will be used in the proof of Theorem 2.4.

Proposition 2.2

Let k and p be as in Theorem 2.4. Let \(G ' \sim G(n, p)\), \(G' = (V, E)\). Let \(\mu p < \nu \le \mu \in (0, 1]\) be arbitrary constants. For any \(t \ge 0\), for every set \(Q\subset V\) of size \((\mu + o(1))k\), there are at most \(g(n, p, \mu , \nu , t) := \frac{6\mu ^2}{(\nu - p\mu )^2}p\log n + t\) vertices from \(V\setminus Q\) that have at least \((\nu - o(1))k\) neighbors in Q, with probability at least \(1 - \exp \left( -\frac{(\nu - p\mu )^2tk}{3\mu p}\right) \). Here o(1) is any function of n tending to 0.

Proof

For convenience, we will consider the size of Q to be exactly \(\mu k\), and consider the set of vertices that have at least \(\nu k\) neighbors in Q, as addition of o(1)-function does not affect anything in the proof. We shall also use g(npt) as shorthand notation for \(g(n, p, \mu , \nu , t)\).

Fix some set \(Q\subset V\) of size \(\mu k\), and a set \(I \subset V\setminus Q\) of size m. Let T(IQ) denote the event that every vertex in I has at least \(\nu k\) neighbors in Q. Consider a random bipartite graph with parts I and Q and edge probability p, and let e(IQ) be the number of edges between I and Q. It is clear that \(\mathbb {E}[e(I, Q)] = m\mu kp\), and the event T(IQ) implies the event \(\{e(I, Q) \ge m\nu k\}\). Hence

$$\begin{aligned}&\mathbb {P}[T(I, Q)] \le \mathbb {P}[e(I, Q) \ge m\nu k] = \\&\qquad \qquad \qquad \qquad =\mathbb {p}[e(I, Q) \ge \mathbb {E}[e(I, Q)] + m\cdot (\nu - p\mu )k] \le \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \le 2\exp \left( -\frac{\left( \nu - p\mu \right) ^2mk}{2\mu p}\right) . \end{aligned}$$

There are \(\left( {\begin{array}{c}n\\ m\end{array}}\right) \le \left( \frac{ne}{m}\right) ^m\le \exp (2m\log n)\) possible vertex sets I, and \(\left( {\begin{array}{c}n\\ \mu k\end{array}}\right) \le \exp (2\mu k\log n)\) possible subsets Q. Let \(T_m\) be the event that for at least one such choice of I and Q the event T(IQ) holds. By union bound,

$$\begin{aligned} \mathbb {P}[T_m] \le 2\exp \left( 2\mu k\log n + m \cdot \left( 2\log n -\frac{(\nu - p\mu )^2k}{2 \mu p}\right) \right) . \end{aligned}$$

Since \(k = Cw(n)^{1/2}\), \(p = w(n)/n\) and \(w(n) = O(n)\), \(\frac{k}{p} = \Omega (\sqrt{n})\), so \(2\log n \) \( -\frac{(\nu - p\mu )^2k}{2 \mu p} \le -\frac{(\nu - p\mu )^2k}{3 \mu p}\), and \(\mathbb {P}[T_m] \le \exp \left( 2\mu k\log n - m\cdot \frac{(\nu - p\mu )^2k}{3\mu p}\right) \). It is clear that \(2\mu k\log n - m\cdot \frac{(\nu - p\mu )^2k}{3\mu p} < 0\) if and only if \(m > \frac{6\mu ^2}{(\nu - p\mu )^2}p\log n\), so if \(m > g(n, p, t) := \frac{6 \mu ^2}{(\nu - p\mu )^2}p\log n + t\), then \(\mathbb {P}[T_m] \le \exp \left( -\frac{(\nu - p\mu )^2tk}{3\mu p}\right) \xrightarrow {n\rightarrow \infty }0\). Therefore, with probability at least \(1 - \exp \left( -\frac{(\nu - p\mu )^2tk}{3\mu p}\right) \) for every set Q of size \(\mu k\), there are at most \(g(n, p, t) = g(n, p, \mu , \nu , t)\) vertices from \(V\setminus Q\) that have at least \(\nu k \) neighbors in Q.\(\square \)

By setting \(\mu = 1\) and \(\nu = \frac{1 + p}{2}\) and choosing \(t = \frac{24p}{(1-p)^2}\log n\) we get an immediate corollary.

Corollary 2.2

Let k and p be as in Theorem 2.4. With probability at least \(1 - \exp (-2k\log n)\) over choice of \(G'\), for every \(S\subset V\), \(|S| = k\), there are at most \(a(n, p) := \frac{48}{(1 - p)^2}p\log n\) vertices from \(V\setminus S\) with at least \(\frac{1 + p}{2}k\) neighbors in S.

In order to upper bound \(\vartheta (\bar{G})\) for every possible adversarial strategy, we fix some particular set K of k vertices in advance, and prove that the bound holds with extremely high probability for the case that K is the chosen planted clique. Theorem 2.4 follows by a union bound over all possible choices of K.

The graph G, with \(V(G) = [n]\) and \(E = E(G)\), is generated by first drawing a random \(G'\sim G(n, p)\), and then adding edges between all vertices in K. Let Q be the set of all vertices with at least \(\frac{1 + p}{2}k\) neighbors in the planted clique K. By Corollary 2.2\(k' := |Q| \le k + a(n, p)\). We number the vertices of V(G) in such a way that \(Q = [k']\), and the planted k-clique of G is [k].

We derive an upper bound on \(\vartheta (\bar{G})\) by presenting a particular matrix M, for which \(\vartheta (\bar{G}) \le \lambda _1(M) \le k' \le k + a(n, p)\). We use d(iQ) to denote the number of edges between the vertex \(i \in [n]\setminus [k']\) and the set \(Q = [k']\). The symmetric matrix M we choose is as follows.

  • The upper left \(k' \times k'\) block is all-ones matrix of order \(k'\).

  • The lower right block of size \((n - k') \times (n - k')\), denoted by C, is defined as \(c_{ij} = 1\) if \((i, j) \in E\) and \(c_{ij} = -\frac{p}{1-p}\) if \((i,j) \not \in E\).

  • The lower left block is an \((n - k') \times k'\) matrix B. For this matrix, \(b_{ij} = 1\) if \((i, j) \in E\), and \(b_{ij} = -d(i, Q)/(k' - d(i, Q))\) if \((i,j) \not \in E\). Observe that that every row of B sums up to zero.

  • The upper right block is the transpose of the lower right block B.

We rewrite \(b_{ij}\) for \((i, j) \notin E\) in the following way:

$$\begin{aligned} b_{ij} = x_i - \frac{p}{1 - p}, \qquad \qquad x_i = \frac{k'p - d(i, Q)}{(1 - p)(k' - d(i, Q))}. \end{aligned}$$

The vector with 1 in its first \(k'\) entries and 0 in other \(n - k'\) coordinates is an eigenvector of M with eigenvalue \(k'\). To show that \(k'\) is the largest eigenvalue, it suffices to prove that \(\lambda _2(M) < k'\). We represent M as a sum of three symmetric matrices \(M = U + V + W\), and apply Weyl theorem [12]:

$$\lambda _2(M) \le \lambda _1(U) + \lambda _2(V + W) \le \lambda _1(U) + \lambda _2(V) + \lambda _1(W).$$

Matrices U, V and W are as follows.

  • The matrix U is derived from the adjacency matrix of the original graph \(G' \sim G(n,p)\). \(U_{ii} = 0\) for all i, \(U_{ij} = 1\) if \((i, j)\in E\) (in \(G'\)), and \(U_{ij} = -p/(1 - p)\) for all other \(i \not =j\).

  • Matrix V describes the difference between M and U on the top \(k'\) by \(k'\) block. This difference is due to the modification that \(G'\) undergoes by planting the clique K and extending it to Q, and also to the change of diagonal entries from 0 to 1. For \(i, j \le k'\) we have \(V_{ij} = 1 - U_{ij}\), which is \(1/(1 - p)\) if \(i \ne j\) and (ij) was not an edge of \(G'\). All other entries are 0.

  • The matrix W is the correction matrix for having the row sums of B equal to 0. In its lower left block (\(i > k'\) and \(j \le k'\)), \(W_{ij} = 0\) if \(b_{ij} = 1\) and \(W_{ij} = x_i\) if \(b_{ij} = x_i - p/(1 - p)\). Its upper right block is the transpose of the lower left block. All other entries are 0.

Claim

With probability at least \(1 - \exp (-k\log n)\), for every possible choice of k vertices by adversary, we have

$$\begin{aligned} \lambda _1(U) \le \frac{2 + o(1)}{\sqrt{1 - p}}w(n)^{1/2}, \qquad \lambda _2(V) = o(k'), \qquad \lambda _1(W) \le \frac{2 + o(1)}{1 - p}w(n)^{1/2}. \end{aligned}$$

To bound the eigenvalues of U, V and W, we shall use upper bounds on the eigenvalues of random matrices, as appear in [19]

Theorem 2.5

There are constants \(C'\) and \(C''\) such that the following holds. Let \(a_{ij}\), \(i, j \in [n]\) be independent random variables, each of which has mean 0 and variance at most \(\sigma ^2\) and is bounded in absolute value by L, where \(\sigma \ge C''L\frac{\log ^2n}{\sqrt{n}}\). Let A be the corresponding \(n\times n\) matrix. Then with probability at least \(1 - O(1/n^3)\),

$$\begin{aligned} \lambda _1(A) \le 2\sigma \sqrt{n} + C'(L\sigma )^{1/2}n^{1/4}\log n. \end{aligned}$$

The bound holds regardless of what the diagonal elements of A are, since by subtracting the diagonal we may decrease the eigenvalues at most by L.

The matrix U is a random matrix, as it is generated from the graph \(G' \sim G(n, p)\). The entries of matrix U have mean zero, \(|U_{ij}| = O(1)\) since p is bounded by constant \(c < 1\), and the variance is \(\sigma ^2(U_{ij}) = \mathbb {E}[U_{ij}^2] = p/(1 - p) \gg n^{-1/3}\), so by Theorem 2.5 we have

$$\begin{aligned}&\lambda _1(U) \le 2\sqrt{\frac{np}{1 - p}} + O\left( \left( \frac{np}{1 - p}\right) ^{1/4}\log n\right) \le \\&\qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad \le \frac{2}{\sqrt{1 - p}}w(n)^{1/2} + O\left( w(n)^{1/4}\log n\right) =: \Lambda _U \end{aligned}$$

with probability at least \(1 - O(1/n^3)\). Since \(|U_{ij}| = O(1)\) for all \(i, j \in [n]\), \(\lambda _1(U)\) is at most \(O(n^2)\). Then, the expected value of \(\lambda _1(U)\) is at most \(\mathbb {E}[\lambda _1(U)] \le \Lambda _U + O(1/n)\). It follows that \(\mathbb {P}[\lambda _1(U) \ge \Lambda _U + t] \le \mathbb {P}[\lambda _{1}(U) \ge \mathbb {E}[\lambda _{1}(U)] + t]\) for all non-negative t. Hence, to show that \(\lambda _1(U)\) does not exceed \(\lambda _U\) by too much with extremely high probability, it suffices to show that the probability of \(\lambda _1(U)\) to deviate from its mean is exponentially small in \(k \log n \simeq w(n)^{1/2}\log n\). The result by Alon, Krivilevich and Vu [2] ensures that eigenvalues of U are well-concentrated around their means.

Theorem 2.6

(Concentration of eigenvalues) For \(1 \le i \le j \le n\), let \(a_{ij}\) be independent, real random variables with absolute value at most 1. Define \(a_{ji} = a_{ij}\) for all ij, and let A be the \(n \times n\) matrix with \(A(i, j) = a_{ij}\), \(i, j \in [n]\). Let \(\lambda _1(A) \ge \lambda _2(A) \ge \ldots \ge \lambda _n(A)\) be the eigenvalues of A. For all \(s \in [n]\) and for all \(t = \omega (\sqrt{s})\):

$$\begin{aligned} \mathbb {P}[\left| \lambda _s(A) - \mathbb {E}[\lambda _s(A)]\right| \ge t] \le \exp \left( -\frac{(1 - o(1))t^2}{32s^2}\right) . \end{aligned}$$

The same estimate holds for \(\lambda _{n - s + 1}(A)\).

Taking \(t = \Theta (w(n)^{1/4}\log n)\), from Theorem 2.6 we get

$$\begin{aligned}&\mathbb {P}\left[ \lambda _1(U) \ge \frac{2}{\sqrt{1 - c}}w(n)^{1/2} + \Theta \left( w(n)^{1/4}\log n\right) \right] \le \\ \le&\mathbb {P}\left[ \lambda _1(U) \ge \mathbb {E}[\lambda _1(U)] + \Theta \left( w(n)^{1/4}\log n\right) \right] \le \exp \big (-\Omega \big (w(n)^{1/2}\log ^2n\big )\big ), \end{aligned}$$

so \(\lambda _1(U) \le \frac{2}{\sqrt{1 - c}}w(n)^{1/2} + O\left( w(n)^{1/4}\log n\right) \) with probability at least \(1 - \exp \big (-\Omega \big (k\log ^2n\big )\big )\). Note that the bound holds for any choice of the adversary, as matrix U does not depend on the vertices of the planted clique and is determined by initial graph G(np) only.

Moving to V, recall that matrix V captures the difference between M and U on the vertices of Q. In order to simplify the analysis, we will treat the set Q as a fixed vertex set of size \(k' \le k + a(n, p)\) chosen in advance, and establish probabilistic bounds for this fixed Q. Taking a union bound over all possible sets of size \(k'\), we will guarantee that the established properties hold for the case when Q is the set of all vertices with at least \(\frac{1 + p}{2}k\) neighbors in K, as originally desribed. Note that for fixed Q, the entries of matrix V are independent. It is convenient for our analysis to consider a modified matrix \(V'\), obtained from V as follows.For all \(i, j > k'\) we have \(V'_{ij} = V_{ij} = 0\), for \(i \in [k']\) we have \(V'_{ii} = 0 = V_{ii} -1\), and for \(i < j \le k'\) we have \(V'_{ij} = V'_{ji} = V_{ij} - 1\). Consequently, for \(i \not = j\) satisfying \(i, j \le k'\), it holds that \(V'_{ij} = -1\) with probability p and \(V'_{ij} = p/(1- p)\) with probability \((1 - p)\), where the probability is taken over the original choice of \(G'\sim G(n, p)\) (that was followed by planting a clique on the set K of first k vertices, and extending the clique to Q). Since \(V'\) is obtained from V by subtracting a matrix with an all-one \(k'\times k'\) block and zeroes elsewhere, \(\lambda _2(V) \le \lambda _1(V')\), and we can obtain the bounds for \(\lambda _2(V)\) by applying Theorem 2.5 to the matrix \(V'\). The variance is \(\sigma ^2(V'_{ij}) = p/(1 - p) \gg \frac{\log ^4n}{n}\), so with probability at least \(1 - O(1/k'^3)\)

$$\begin{aligned}&\lambda _1(V') \le 2\sqrt{\frac{k'p}{1 - p}} + O\left( \left( \frac{k'p}{1 - p}\right) ^{1/4}\log k'\right) \le \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \le \frac{C'w(n)^{3/4}}{\sqrt{n}} + O\left( \frac{w(n)^{3/8}}{n^{1/4}}\log n\right) \end{aligned}$$

for some \(C' > 0\), we denote this bound by \(\Lambda _{V'}\). Similarly to \(\lambda _1(U)\), \(\lambda _1(V') \le O(k'^2)\) and \(\mathbb {E}[\lambda _1(V')] \le (1 + o(1))\Lambda _{V'}\). Applying Theorem 2.5 to \(\lambda _1(V')\) with \(t \gg \frac{w(n)^{3/8}}{n^{1/4}}\log n\), we get \(\mathbb {P}\left[ \lambda _1(V') > \frac{C'w(n)^{3/4}}{\sqrt{n}} + t\right] \le \exp \left( -\Omega \left( t^2\right) \right) \).

We would like these bounds to hold for any choice of the adversarial k-clique, and any extension to Q. There are \(\left( {\begin{array}{c}n\\ k'\end{array}}\right) \le \left( \frac{ne}{k'}\right) ^{k'} \le \exp (2k'\log n) \le \exp (O(w(n)^{1/2}\log n))\) possible choices, so by setting \(t =\Theta (w(n)^{1/4}\log n)\) in the bound above and applying union bound over all possible choices of the set Q, we prove

$$\begin{aligned} \lambda _2(V) \le \frac{C'w(n)^{3/4}}{\sqrt{n}} + O\left( w(n)^{1/4}\log n\right) = o(k') \end{aligned}$$

for any choice of the adversary with probability at least \(1 - \exp \big (-\Omega \big (k\log ^2n\big )\big )\).

It remains to bound \(\lambda _1(W)\). We will use the trace of \(W^2\).

$$\begin{aligned}&\lambda _1(W)^2 \le \text {tr}(W^2) = 2\sum _{i< j}W_{ij}^2 = 2\sum _{i = k' + 1}^n(k' - d(i, Q))x_i^2 = \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad =2\sum _{i = k' + 1}^n\frac{\left( d(i, Q) - k'p\right) ^2}{(1 - p)^2(k' - d(i, Q))}. \end{aligned}$$

By definition of set Q, for every \(k' + 1 \le i \le n\) we have \(d(i, Q) \le \frac{1 + p}{2}k\), so \(k' - d(i, Q) \ge \frac{1 - p}{2}k\) and

$$\begin{aligned} 2\sum _{i = k' + 1}^n\frac{\left( d(i, Q) - k'p\right) ^2}{(1 - p)^2(k' - d(i, Q))} \le \frac{4}{(1 - p)^3k}\sum _{i = k' + 1}^n\left( d(i, Q) - k'p\right) ^2. \end{aligned}$$

It turns out that we can always bound the sum above.

Theorem 2.7

With probability at least \(1 - \exp (-2k\log n)\),

$$\begin{aligned} \sum _{i = k' + 1}^n\left( d(i, Q) - k'p\right) ^2 \le (n - k')k'p(1 - p) + o(nk'p(1 - p)) \end{aligned}$$

for every possible choice of k vertices by the adversary,

The proof is rather technical and is presented in Section B. From Theorem 2.7 we get

$$\begin{aligned} \lambda _1(W)^2 \le \frac{4 + o(1)}{(1 - p)^3k}(n - k')k'p(1 - p) \le \frac{4 + o(1)}{(1 - p)^2}(n - k')p, \end{aligned}$$

so

$$\begin{aligned} \lambda _1(W) \le \frac{2 + o(1)}{1 - p}\sqrt{(n - k')p} \le \frac{2 + o(1)}{1 - p}w(n)^{1/2}. \end{aligned}$$

Combining the bounds for \(\lambda _1(U)\), \(\lambda _2(V)\) and \(\lambda _1(W)\), the following sequence of inequalities holds with probability at least \(1 - \exp (-2k\log n)\):

$$\begin{aligned} \lambda _2(M)&\le \lambda _1(U) + \lambda _2(V) + \lambda _1(W) \le \\&\le \frac{2}{\sqrt{1 - p}}w(n)^{1/2} + \frac{2}{1 - p}w(n)^{1/2} + o(k') \le \frac{4}{1 - p}w(n)^{1/2} + o(k'). \end{aligned}$$

By choosing \(C \ge \frac{5}{1 - p}\) in \(k = Cw(n)^{1/2}\), we guarantee that the expression above is less than \(k'\). Therefore, \(k'\) is indeed the largest eigenvalue of matrix M, and \(\vartheta (\bar{G}) \le k' \le k + a(n, p)\) for every choice of adversarial k-clique with extremely high probability. This finishes the proof of Theorem 2.4.

2.2 Main Algorithm

In this section we prove Theorem 1.1 in its generalized version, considering \(G \sim AG(n, p, k)\) for a wide range of values of p, and not just \(p= \frac{1}{2}\).

Theorem 2.8

(Theorem 1.1 restated) Let \(c \in (0, 1)\) be an arbitrary constant. For every fixed \(\varepsilon > 0\) and for every \(n^{-1/3}\ll p \le c\) and \(k \ge \varepsilon \sqrt{np}\), there is an (explicitly described) algorithm running in time \(n^{O(\log (\frac{1}{\varepsilon }))}\) which almost surely finds the maximum clique in a graph \(G \sim AG(n, p, k)\). The statement holds for every adversarial planting strategy, and the probability of success is taken over the choice of \(G' \sim G(n, p)\).

We first prove such a theorem when \(k \ge C\sqrt{np}\) for a sufficiently large constant C. Afterwards, we shall extend the proof to the case that C can be an arbitrarily small constant.

Theorem 2.9

Let \(c \in (0, 1)\) be an arbitrary constant. Consider an arbitrary function w(n), such that \(n^{2/3} \ll w(n) \le cn\). Let \(G \sim AG(n, p, k)\), where \(p = w(n)/n\) and \(k \ge \frac{5}{1 - p}w(n)^{1/2}\). There is an (explicitly described) algorithm running in time \(n^{O(1)}\) which almost surely finds the maximum clique in G, for every adversarial planting strategy.

Proof

As described in Section 2, we solve the optimization problem

$$\begin{aligned} \vartheta (G) = \max _{h,\{s_i\}} \sum _{i\in V} (h \cdot s_i)^2, \end{aligned}$$
(4)

finding the optimal orthonormal representation \(\{s_i\}\) and handle h, using the SDP formulation.

Suppose that we solved \(\vartheta (\bar{G})\) in (4) for \(G\sim AG(n, p, k)\) (with p and k as in Theorem 2.9). By Theorem 2.4, \(k \le \vartheta (\bar{G})\le k + a(n, p)\). Let \(G = (V, E)\), let K denote the set of vertices chosen by the adversary.

As h and \(s_i\) are unit vectors, we have that for all \(i \in V\), \((h \cdot s_i)^2 \le 1\). Let T be the set of vertices \(i \in K\) with \((h \cdot s_i)^2 < 3/4\). We claim that \(|T| \le 4a(n, p)\). Suppose the contrary, so \(|T| > 4a(n, p)\). Delete |T| from the graph G and consider \(\vartheta (\overline{G\setminus |T|})\). We get

$$\begin{aligned} \vartheta (\overline{G\setminus T}) \ge \sum _{i \in V\setminus T}(h\cdot s_i)^2 = \vartheta (\bar{G}) - \sum _{j \in T}(h\cdot s_j)^2, \end{aligned}$$

hence by applying Corollary 2.1 to \(G\setminus T\) we get

$$\begin{aligned}&\vartheta (\bar{G}) \le \vartheta (\overline{G\setminus T}) + \sum _{j \in T}(h\cdot s_j)^2 \le k - |T| + a(n, p) + (3/4)|T| =\\&\qquad \qquad \qquad \qquad \quad \qquad \,\,\,= k + a(n, p) - (1/4)|T|< k + a(n, p) - a(n, p) = k, \end{aligned}$$

a contradiction. So, there are at most 4a(np) vertices \(j \in K\) with \((h \cdot s_j)^2 < 3/4\), implying that there are at least \(k - 4a(n, p)\) vertices in K with \((h \cdot s_j)^2 \ge 3/4\). Denote this set by \(K_{3/4}\).

Observe that if \(i \in V\setminus K\) is not connected to some \(j \in K_{3/4}\), then \((h\cdot s_i)^2 \le 1/4\). Indeed, \((i, j)\notin E\) implies \(s_i \cdot s_j = 0\), so \((h\cdot s_i)^2 + (h\cdot s_j)^2 \le 1\) and therefore \((h\cdot s_i)^2 \le 1/4\). Hence, if \(i \in V\setminus K\) has \((h\cdot s_i)^2\ge 3/4\), it must be connected to the whole set \(K_{3/4}\). The set \(K_{3/4}\) has size at least \(k - 4a(n, p)\), so by Corollary 2.2 there are less than a(np) vertices \(i \in V\setminus K\) with \((h\cdot s_i)^2 \ge 3/4\). As a result, for the set H of vertices \(i \in V\) with \((h\cdot s_i)^2 \ge 3/4\), we have \(k - 4a(n, p) \le |H| \le k + a(n, p)\).

Let \(F \subset V\) be the set of all vertices that have at least 3k/4 neighbors in H. Similarly to Lemma 2.2, with extremely high probability F contains the maximum clique in G. Moreover, by Proposition 2.2 there are at most O(a(np)) vertices from \(V\setminus H\) that have at least 3/4k neighbors in H, implying \(|F| \le k + O(a(n, p))\).

It follows that the maximum clique of G[F], the subgraph of G induced on F, is the maximum clique of G. Moreover, \(K\subseteq F\), so F contains a clique of size at least k, and \(|F| \le k + O(a(n, p))\). The maximum clique in G[F] can be found in polynomial time by a standard algorithm (used for example to show that vertex cover is fixed parameter tractable). For every non-edge in the subgraph induced on F, at least one of its end-vertices needs to be removed, so we try both possibilities in parallel, and recurse on each subgraph that remains. Each branch of the recursion is terminated either when the graph is a clique, or when k vertices remain (whichever happens first). At least one of the branches of the recursion finds the maximum clique. The depth of the recursion is at most \(O(a(n, p)) = O\left( \frac{p}{(1 - p)^2}\log n\right) \). Consequently the running time (which is exponential in the depth) is in the order of \(n^{O(1)} 2^{O(a(n,p))} = n^{O(1/(1-p)^2)}\). This running time is polynomial if p is upper bounded by a constant smaller than 1. This finishes the description of the algorithm, proving Theorem 2.9.\(\square \)

Returning to Theorem 1.1, which considers \(G \sim AG(n, \frac{1}{2}, k)\) and \(k \ge \varepsilon \sqrt{n}\), we prove it when \(\varepsilon \ge \frac{10}{\sqrt{2}}\) by plugging in \(p = \frac{1}{2}\) in Theorem 2.9, as Theorem 2.9 assumes the condition \(k \ge \frac{5}{1 - p}w(n)^{1/2}\) (where \(w(n) = np\)). To complete the proof of Theorem 1.1 (and Theorem 2.8) we need to handle arbitrarily small constant \(\varepsilon > 0\). Now we will work with the case \(k \ge \varepsilon w(n)^{1/2}\) for arbitrary \(\varepsilon > 0\)

Suppose that \(k = \varepsilon \sqrt{np}\) for \(0< \varepsilon < \frac{5}{1 - p}\). Similar to the approach of [1], we can use the algorithm that works for the case \(k \ge \frac{5}{1 - p}\sqrt{np}\) in order to obtain the algorithm for \(k = \varepsilon \sqrt{np}\).

Let s be the smallest integer satisfying \(k = \varepsilon \sqrt{np} > 2\cdot \frac{5}{1-p}\sqrt{np}\cdot p^{s/2}\). This gives \(s > \frac{2\log \frac{10}{(1-p)\varepsilon }}{\log (1/p)}\), which is a constant for constant \(\varepsilon > 0\) and p bounded away from 1. Observe that if \(p < \frac{\varepsilon ^2}{100}\) then \(s = 1\). Given graph \(G \sim AG(n, p, k)\), we try all \(\left( {\begin{array}{c}n\\ s\end{array}}\right) \) possible choices for sets \(S\subset V\) of size s. For each such choice, if S is a clique in G, then we apply the algorithm of Theorem 2.9 on G[N(S)], the subgraph induced on the common neighborhood of S (not including S itself). The size of this subgraph is at most roughly \(p^s n + k\), and we chose the value of s so that \(k - s \ge \frac{5}{1-p}\sqrt{|N(S) p|}\). As we show below, for all \(\left( {\begin{array}{c}k\\ s\end{array}}\right) \) choices in which \(S\subset K\), the algorithm will return the largest clique in G[N(S)]. As the largest clique \(K^*\) of G contains at least s vertices of K, for at least one choice of \(S \subset K\) we also have \(S \subset K^*\). For this case, the union of S and the largest clique in G[N(S)] is the largest clique of G, as desired.

It remains to show that if \(S \subset K\), then the algorithm of Theorem 2.9 finds the maximum clique in G[N(S)]. In more details, what we need to show is that with high probability over the choice of \(G' \sim G(n, p)\), for every choice of k vertices as the adversarial planted clique K (giving the graph G), and for every choice of \(S \subset K\) of size s, the algorithm succeeds on G[N(S)].

We shall employ a union bound over all possible choices of S and K. Given that we consider all possible K (and not just the one selected by the adversary), we may describe the generation of G[N(S)] in the following way.

  1. 1.

    Start with the empty graph on a set V of n vertices.

  2. 2.

    Pick a set \(K \subset V\) of k vertices, and a set \(S \subset K\) of s vertices.

  3. 3.

    Generate \(G' \sim G(n, p)\) in a need to know basis. First reveal only those edges between S and \(V \setminus S\). Let \(V'\) denote the set of vertices that each has all of S as its neighbors. Observe that the expected size of \(V'\) is exactly \(\mathbb {E}[|V'|] = (n-k)p^s\).

  4. 4.

    Form the set \(N(S) = V' \cup (K \setminus S)\). We now reveal the edges of \(G'\) inside the set N(S), giving a graph that we call \(G'_{S,K}\). Crucially, this graph is distributed exactly like G(|N(S)|, p).

  5. 5.

    Turn \(K \setminus S\) into a clique in \(G'_{S, K}\), effectively planting a clique of size \(k - s = k - O(1)\).

Claim

With probability at least \(1 - \exp (-2k\log n)\) over the choice of \(G' \sim G(n, p)\), for all possible choices of \(K \subset V\) and \(S \subset K\) it holds simultaneously that \(|N(S)| = (1 + o(1))np^s\).

Proof

Fix some particular choices of K and S. By construction, set N(S) is a union of \(V'\) and vertices from \(K\setminus S\). We are going to show that

  • for \(k = \varepsilon \sqrt{np}\), it holds \(k\log n = o\left( np^s\right) \);

  • with probability \(\ge 1 - \exp (-2k\log n)\), \(|N(S)| \le np^s + k + 3\sqrt{np^sk\log n}\).

Given these two statements, the claim follows directly.

First consider the case \(p < \varepsilon ^2/100\), so \(s = 1\). We can assume that \(p \gg n^{-1/3}\), as when \(p = O(n^{-1/3})\) we can find the maximum clique using the enumeration algorithm from Section 3. Then \(k = O(\sqrt{np}) = O(n^{1/2})\), while \(np^s = np = \Omega (n^{2/3})\), therefore \(k\log n = o(np^s)\) holds. Since \(\mathbb {E}[|V'|] = (n - k)p^s\), by Chernoff bound the probability of \(|V'| > (n - k)p^s + 3\sqrt{(n - k)p^sk\log n}\) is at most \(\exp (-2k\log n)\), and we get the desired.

Now consider the case \(\varepsilon ^2/100 \le p \le c\), so p is a constant. But then \(k\log n = O(\sqrt{n}\log n) = o(n) = o(np^s)\) since s is a constant, and Chernoff bound again gives us \(|V'| \le np^s + 3\sqrt{np^sk\log n}\) with probability at least \(1 - \exp (-2k\log n)\).

Uniting \(V'\) with \(K\setminus S\) cannot add more than \(|K| = k\) vertices, therefore \(|N(S)| \le |V'| + |K| \le np^s + k + 3\sqrt{np^sk\log n}\), with probability at least \(1 - \exp (-2k\log n)\).\(\square \)

By the above claim and our choice of s we now have that

\(k - s > \frac{5}{1-p}\sqrt{|N(S)|p}\), where \(k-s\) is the size of the clique planted in \(G'_{S, K}\). Consequently, we are in a position to apply Theorem 2.9 on G[N(S)], and conclude that the algorithm given in the proof of the theorem finds the maximum clique in G[N(S)]. This indeed holds almost surely for every particular choice of \(K\subset V\) and \(S\subset K\), but we are not done yet, as we want this to hold for all choices of K and S in \(G' \sim G(n, p)\). To reach such a conclusion we need to analyse the failure probability of Theorem 2.9 more closely, so as to be able to take a union bound over all choices of K and S. This union bound involves \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \cdot \left( {\begin{array}{c}k\\ s\end{array}}\right) \simeq \exp (k\log n)\) events (the term \(\left( {\begin{array}{c}k\\ s\end{array}}\right) \) is negligible compared to \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \), because s is a constant).

Indeed the failure probability for Theorem 2.9 can withstand such a union bound. This is because the proof of Theorem 2.9 is based on earlier claims whose failure probability is at most \(\exp (-2k\log n)\). This upper bound on the failure probability is stated explicitly in Theorem 2.4 and Corollary 2.2, and can be shown to also hold in claims that do not state it explicitly (such as Proposition 2.1, Lemma 2.1 and Lemma 2.2, and versions of them generalized to arbitrary p), using analysis similar to that of the proof of Proposition 2.2.

3 Finding Cliques by Enumeration

In this section we prove Theorem 1.2.

Theorem 3.1

(Theorem 1.2 restated) Let \(p = n^{\delta -1}\) for \(0< \delta < 1\). Then for every k, there is an explicitly described algorithm running in time \(n^{O(\frac{1}{1-\delta })}\) which almost surely finds the maximum clique in a graph \(G \sim AG(n, p, k)\).

Let \(p = n^{\delta - 1}\) for \(0< \delta < 1\), and consider first \(G' \sim G(n, p)\) (hence \(G'\) has average degree roughly \(n^{\delta }\)). For every size \(t \ge 1\), let \(N_t\) denote the number of cliques of size t in \(G'\). The expectation (over choice of \(G' \sim G(n, p)\)) satisfies:

$$E[N_t] = \left( {\begin{array}{c}n\\ t\end{array}}\right) p^{\left( {\begin{array}{c}t\\ 2\end{array}}\right) } \le \frac{1}{t!}n^{\frac{\delta - 1}{2}t^2 + \frac{3-\delta }{2}t}$$

The exponent is maximized when \(t = \frac{3 - \delta }{2(1 - \delta )}\). For the maximizing (not necessarily integer) t, the exponent equals \(\frac{(3-\delta )^2}{8(1 - \delta )}\). We denote this last expression by \(e_{\delta }\), and note that \(e_{\delta } \le 1.3\cdot \frac{1}{1 - \delta }\). The expected number of cliques of all sizes is then:

$$\sum _{t\ge 1} E[N_t] \le n + \sum _{t\ge 2} \frac{1}{t!}n^{\frac{\delta - 1}{2}t^2 + \frac{3-\delta }{2}t} \le n^{e_{\delta }}$$

(The last inequality holds for sufficiently large n.) By Markov’s inequality, with probability at least \(1 - \frac{1}{n}\), the actual number of cliques in \(G'\) is at most \(n^{e_{\delta } + 1}\). Note that stronger concentration results can be used here, but are not needed for the proof of Theorem 1.2.

Now, for arbitrary \(1 \le k \le n\), let the adversary plant a clique K of size k in \(G'\), thus creating the graph \(G \sim AG(n, p, k)\). As every subgraph of K is a clique, the total number of cliques in G is at least \(2^k\), which might be exponential in n (if k is large). However, the number of maximal cliques in G (a clique is maximal if it is not contained in any larger clique) is much smaller. Given a maximal clique C in G, consider \(C'\), the subgraph of C not containing any vertex from K. \(C'\) is a clique in \(G'\) (which is nonempty, except for one special case of \(C = K\)). \(C'\) uniquely determines C, as the remaining vertices in C are precisely the set of common neighbors of \(C'\) in K (this is because the clique C is maximal). Consequently, the number of maximal cliques in G is not larger than the number of cliques in \(G'\).

As all maximal cliques in a graph can be enumerated in time linear in their number times some polynomial in n (see e.g. [17] and references therein), one can list all maximal cliques in G in time \(n^{e_{\delta } + O(1)}\) (this holds with probability at least \(1 - \frac{1}{n}\), over the choice of \(G'\), regardless of where the adversary plants clique K), and output the largest one.

This completes the proof of Theorem 1.2.

4 Proving NP-hardness Results

In this section we prove Theorem 1.3.

Theorem 4.1

(Theorem 1.3 restated) For \(p = n^{\delta -1}\) with \(0< \delta < 1\), \(0< \gamma < 1\), and \(cn^{1 - \delta }\log n \le k \le \frac{2}{3}n\), where c is a sufficiently large constant, the following holds. There is no polynomial time algorithm that has probability at least \(\gamma \) of finding an independent set of size k in \(G \sim A\bar{G}(n, p, k)\), unless NP has randomized polynomial time algorithms (NP=RP).

Our proof is an adaptation to our setting of a proof technique developed in [4]. We first present an overview of the proof, and then provide more details in later sections.

Recall that we are considering a graph \(G \sim A\bar{G}(n, p, k)\) (adversarial planted independent set) with \(p = n^{\delta - 1}\) and \(0< \delta < 1\). Let us first explain why the algorithm described in Section 2 fails when \(k = cn^{1 - \frac{\delta }{2}}\) (whereas if the independent set is planted at random, algorithms based on the theta function are known to succeed). The problem is that the bound in Theorem 2.2 is not true anymore, and instead one has the much weaker bound of \(\vartheta (G) \le k + n^{1 - \delta }\log n\). Following the steps of the algorithm of Section 2, in the final step, we need to remove a minimum vertex cover from F. However, now the upper bound on the size of this vertex cover is \(O(n^{1 - \delta }\log n)\) rather than \(O(\log n)\). Consequently, we do not know of a polynomial time algorithm that will do so. It may seem that we also do not know that no such algorithm exists. After all, F is not an arbitrary worst case instance for vertex cover, but rather an instance derived from a random graph. However, our NP-hardness result shows that indeed this obstacle is insurmountable, unless NP has randomized polynomial time algorithms. We remark that using an approximation algorithm for vertex cover in the last step of the algorithm of Section 2 does allow one to find in G an independent set of size \(k - O(n^{1 - \delta }\log n) = (1 - o(1))k\), and the NP-hardness result applies only because we insist on finding an independent set of size at least k.

Let us proceed now with an overview of our NP hardness proof. We do so for the case that \(k = \frac{n}{3}\) (for which we can easily find the maximum independent set if the planted independent set is random). Assume for the sake of contradiction that ALG is a polynomial time algorithm that with high probability over choice of \(G' \sim G(n, p)\), for every planted independent set of size \(k = \frac{n}{3}\), it finds in the resulting graph G an independent set of size k.

We now introduce a class \(\mathcal {H}\) of graphs that, in anticipation of the proofs that will follow, is required to have the following three properties. (Two of the properties are stated below in a qualitative manner, but they have precise quantitative requirements in the proofs that follow.)

  1. 1.

    Solving maximum independent set on graphs from this class is NP-hard.

  2. 2.

    Graphs in this class are very sparse.

  3. 3.

    The number of vertices in each graph is small.

To satisfy the above requirements, we consider graphs that are balanced.

Definition 4.1

Given a graph H, denote its average degree by \(\alpha \). A graph H is balanced if every induced subgraph of H has average degree at most \(\alpha \).

We choose \(0< \varepsilon < \min [\frac{\delta }{2}, 1 - \delta ]\), and let \(\mathcal {H}\) be the class of balanced graphs on \(n^{\epsilon }\) vertices, and of average degree \(2 + \delta \).

Given a graph \(H \in \mathcal {H}\) and a parameter \(k'\), it is NP-hard to determine whether H has an independent of size at least \(k'\) or not.

Theorem 4.2

For any \(0 < \eta \le 1\), determining the size of the maximum independent set in a balanced graph with average degree \(2< \alpha < 2 + \eta \) is NP-hard.

The proof of Theorem 4.2 is provided in Section A.

In our proofs, we view \(G' \sim G(n, p)\) as an n vertex graph with vertices numbered from 1 to n, and \(H \in \mathcal {H}\) as an m vertex graph with vertices numbered from 1 to m. For simplicity, we assume that m divides n (this assumption can easily be removed with only negligible effect on the results). As part of subsequent proofs, we shall consider the question of whether \(G'\) is likely to have H as an induced subgraph. So as to simplify the analysis of this question, we shall only consider induced subgraphs of \(G'\) that are said to obey a given partition. We now explain this concept. Given an n-vertex graph \(G'\), we fix the following partition of the vertex set of \(G'\) into m disjoint subsets of vertices, each of size \(\frac{n}{m}\). Part i for \(1 \le i \le m\) contains the vertices \([(i-1)\frac{n}{m} + 1, i\frac{n}{m}]\). A vertex set M of size m that contains one vertex in each part is said to obey the partition. If M is ordered, then to obey the partition we require vertex i of M to be in part i of the partition, for every \(1 \le i \le m\).

We will reach a contradiction to the existence of ALG by showing how ALG could be used in order to find in H an independent set of size \(k'\), if one exists. For this, we use the following randomized algorithm ALGRAND.

  1. 1.

    Generate a random graph \(G' \sim G(n, p)\), with \(p = n^{\delta -1}\).

  2. 2.

    Plant in \(G'\) a random copy of H that obeys the partition. That is, pick a random set M of m vertices that obeys the partition, associate the ith vertex of H with the vertex of M that is in the ith part, and replace the subgraph of \(G'\) induced on M by H. We refer to the resulting distribution as \(G_H(n,p)\), and to the graph sampled from this distribution as \(G_H\). Observe that the number of vertices in \(G_H\) that have a neighbor in H is with high probability not larger than \(|H|n^{\delta } \le \frac{n}{2}\).

  3. 3.

    Within the non-neighbors of H, plant at random an independent set \(I'\) of size \(k - k'\). We refer to the resulting distribution as \(G_H(n, p, k)\), and to the graph sampled from this distribution as \(\tilde{G}_H\). Observe that with extremely high probability, \(\alpha (\tilde{G}_H \setminus H) = k - k'\). Hence we may assume that this indeed holds. If furthermore \(\alpha (H) \ge k'\), then \(\alpha (\tilde{G}_H) \ge k\).

  4. 4.

    Run ALG on \(\tilde{G}_H\). We say that ALGRAND succeeds if ALG outputs an independent set IS of size k. Observe that then at least \(k'\) vertices of H are in IS, and hence ALGRAND finds an independent set of size \(k'\) in H.

If H does not have an independent set of size \(k'\), ALGRAND surely fails to output such an independent set. But if H does have an independent set of size \(k'\), why should ALGRAND succeed? This is because ALG (which is used in ALGRAND) is fooled to think that the graph \(\tilde{G}_H\) generated by ALGRAND was generated from \(A\bar{G}(n, p, k)\), and on such graphs ALG does find independent sets of size k. And why is ALG fooled? This is because the distribution of graphs generated by ALGRAND is statistically close to a distribution that can be created by the adversary in the \(A\bar{G}(n, p, k)\) model. Specifically, consider the following distribution that we refer to as \(A_H(n, p, k)\).

  1. 1.

    Generate \(G' \sim G(n, p)\), with \(p = n^{\delta -1}\).

  2. 2.

    The computationally unbounded adversary finds within \(G'\) all subsets of vertices of size \(m=|H|\) that obey the partition, such that the subgraph induced on them is H, with vertex i of H in the ith part of the partition. (If there is no such subset, fail.) Choose one such copy of H uniformly at random.

  3. 3.

    As H is assumed to have an independent set of size \(k'\), plant an independent set K of size k as follows. \(k'\) of the vertices of K are vertices of an independent set in the selected copy of H. The remaining \(k - k'\) vertices of K are chosen at random among the vertices of \(G'\) that have no neighbor at all in the copy of H. Observe that we expect there to be at least roughly \(n - |H|n^{\delta } \ge \frac{n}{2}\) such vertices, and with extremely high probability the actual number will be at least \(\frac{n}{3} > k - k'\). We refer to the resulting graph with planted independent set as \(\tilde{G}\).

We shall show that the distribution \(\tilde{G}_H \sim G_H(n, p, k)\) generated by ALGRAND is “statistically similar" to the distribution \(\tilde{G}\sim A_H(n, p, k)\) generated by the adversary. The notion of statistical similarity that we use is presented in the following definition.

Definition 4.2

Let AB be two probability distributions on the same domain X. For fixed \(x \in X\), let \(p_{A}(x)\) denote the probability that a random sample from A gives x, and let \(p_{B}(x)\) denote the probability that a random sample from B gives x. We say that A is \((\beta , \eta )\)-close to B for constants \(\beta , \eta \in (0, 1)\), if \(p_{A}(x) \ge \beta \cdot p_{B}(x)\) with probability at least \(1 - \eta \) over \(x\sim B\).

Proposition 4.1 below follows immediately from Definition 4.2.

Proposition 4.1

Let f be an arbitrary function that gets \(x \in X\) as input and outputs either 0 or 1. If probability distribution A is \((\beta , \eta )\)-close to probability distribution B, then

$$\begin{aligned} \underset{x\sim A}{\mathbb {P}}[f(x) = 1] \ge \beta \left( \underset{x\sim B}{\mathbb {P}}[f(x) = 1] - \eta \right) . \end{aligned}$$

The following theorem will play a central part in the proof of Theorem 1.3.

Theorem 4.3

Let pk be as in Theorem 1.3. There exist universal constants \(\beta , \eta \in (0, 1)\), such that for large enough n, for every \(H \in \mathcal {H}\), the distribution \(\tilde{G}_H \sim G_H(n, p, k)\) generated by ALGRAND is \((\beta , \eta )\)-close to the distribution \(\tilde{G}\sim A_H(n, p, k)\) generated by the adversary.

The proof of Theorem 4.3 appears in Section 4.2. Here we explain the main ideas in the proof.

For a graph G and a given partition, let \(X_H(G)\) denote the number of sets S of size m obeying the partition, such that the subgraph of G induced on S is H (with vertex i of H in part i of the partition, for every \(1 \le i \le m\)). For a graph G chosen at random from some distribution, \(X_H(G)\) is a random variable.

A minimum requirement for Theorem 4.3 to hold is that with sufficiently high probability, \(X_H(G')\ge 1\) for \(G' \sim G(n, p)\), as otherwise \(A_HG(n, p, k)\) fails to produce any output. But this by itself does not suffice. Intuitively, the condition we need is that \(X_H(G')\) is typically “large". Then the fact that \(G_H(n, p)\) of ALGRAND adds another copy of H to \(G'\) does not appear to make much of a difference to \(G'\), because \(G'\) anyway has many copies of H. Hopefully, this will imply that \(G' \sim G(n, p)\) and \(G_H \sim G_H(n, p)\) come from two distributions that are contiguous. This intuition is basically correct, though another ingredient (a concentration result) is also needed. Specifically, we need the following lemma (stated informally).

Lemma 4.1

(Informal) For every \(H \in \mathcal {H}\), for \(G' \in G(n, p)\) with p as above, the expected number of copies of H that obey the partition in \(G'\) is high. Specifically, \(E[X_H(G')] \ge 2^{n^{\eta }}\) for some \(\eta > 0\) that depends on \(\delta \) and \(\varepsilon \). Moreover, with high probability over choice of \(G'\), \(X_H(G')\) is not much lower than its expectation.

The proof of Lemma 4.1 is based on known techniques (first and second moment methods). It uses in an essential way the fact that the graph H is sparse (average degree barely above 2) and does not have many vertices (these properties hold by definition of the class \(\mathcal {H}\)). See more details in Section 4.1. Armed with Lemma 4.1, we first prove the closeness of distributions on graphs without the planted independent set.

Lemma 4.2

Let pk be as in Theorem 1.3. There exist universal constants \(\beta , \eta \in (0, 1)\), such that for large enough n, for every \(H \in \mathcal {H}\), the distribution \(G_H(n, p)\) is \((\beta , \eta )\)-close to the distribution G(np).

Lemma 4.2 is proved by considering graphs \(G' \sim G(n, p)\) that do contain a copy of H that obeys the partition (Lemma 4.1 establishes that this is a typical case), and comparing for each such graph the probability of it being generated by \(G_H(n, p)\) with the probability of it being generated by G(np). Conveniently, the ratio between these probabilities is the same as the ratio between \(X_H(G')\) and \(E[X_H(G')]\). By Lemma 4.1, for most graphs, this ratio is close to 1. For more details, see Section 4.2.

Theorem 4.3 follows quite easily from Lemma 4.2. Consequently, ALG’s performance on the distributions \(G_H(n, p, k)\) and \(A_H(n, p, k)\) is similar. By our assumption, ALG finds (with high probability) an independent set of size k in \(\tilde{G}\sim A_H(n, p, k)\), which now implies that it also does so for \(\tilde{G}_H \sim G_H(n, p, k)\). But as argued above, finding an independent set of size k in \(\tilde{G}_H \sim G_H(n, p, k)\) implies that ALGRAND finds an independent set of size \(k'\) in \(H \in \mathcal {H}\), thus solving an NP-hard problem. Hence the assumption that there is a polynomial time algorithm ALG that can find independent sets of size k in \(G \sim A\bar{G}(n, p, k)\) implies that NP has randomized polynomial time algorithms.

In the upcoming sections we provide the details that were omitted from our overview of the proof of Theorem 1.3.

4.1 Proof of Lemma 4.1

Recall that the random variable \(X_H(G)\) denotes the number of induced copies of H that obey the partition in the random graph G. The main technical content of Lemma 4.1 is handled by the following lemma.

Lemma 4.3

Let \(G \sim G(n, p)\) be a random graph with \(p \in (0, 1)\). Let H be a balanced graph on m vertices with average degree \(2< \alpha < 3\), and let \(0< \varepsilon < 1/7\) be a constant. If \(\varepsilon > m^2p\) and \(\varepsilon ^2 > 2\frac{m^4}{n^2p^\alpha }\) (or equivalently, \(m \le \min [\sqrt{\varepsilon /p},\; 2^{-1/4}p^{\alpha /4}\sqrt{\varepsilon n}]\)), then for every \(\beta \in [0, 1)\)

$$\begin{aligned} \mathbb {P}[X_H(G) \le \beta \mathbb {E}[X_H(G)]] \le \frac{4\varepsilon }{(1 - \beta )^2}. \end{aligned}$$

Proof

Let \(w(n) := np\), so \(p = w(n)/n\). Let \(Y_H(G)\) be a random variable counting the number of sets S obeying the partition that have H as an edge induced subgraph of G, but may have additional internal edges. By definition, \(X_H(G) \le Y_H(G)\) and

$$\begin{aligned} \mathbb {E}[Y_H(G)] = \left( \frac{n}{m}\right) ^mp^{\frac{\alpha m}{2}} = \left( \frac{n}{m}\right) ^m\cdot \left( \frac{w(n)}{n}\right) ^{\frac{\alpha m}{2}} = \left( \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}}\right) ^{\frac{m}{2}}. \end{aligned}$$

A set S in \(Y_H(G)\) contributes to \(X_H(G)\) if it has no internal edges beyond those of H. This happens with probability

$$\begin{aligned} (1 - p)^{\left( {\begin{array}{c}m\\ 2\end{array}}\right) - \frac{\alpha m}{2}} \ge 1 - \left( {\begin{array}{c}m\\ 2\end{array}}\right) \frac{w(n)}{n} \ge 1 - \frac{m^2w(n)}{n} \ge 1 - \varepsilon , \end{aligned}$$

as \(\varepsilon > m^2p\), so

$$\begin{aligned} \mathbb {E}[X_H(G)] \ge (1 - p)^{\left( {\begin{array}{c}m\\ 2\end{array}}\right) - \frac{\alpha m}{2}}\mathbb {E}[Y_H(G)] \ge (1 - \varepsilon )\mathbb {E}[Y_H(G)]. \end{aligned}$$

Note that if \(\mathbb {E}[Y_H(G)^2] \le (1 + \varepsilon )\mathbb {E}[Y_H(G)]^2\), the inequality above gives us \(\mathbb {E}[X_H(G)^2] \le \frac{1 + \varepsilon }{(1 - \varepsilon )^2}\mathbb {E}[X_H(G)]^2\). We will now compute \(\mathbb {E}[Y_H(G)^2]\). Given the occurrence of H, consider another potential occurrence \(H'\) that differs from it by t vertices. Since H is balanced graph,

$$\begin{aligned} |E(G[H'])| - |E(G[H'\cap H])| \ge \frac{\alpha |V(G[H'])|}{2} - \frac{\alpha |V(G[H\cap H'])|}{2} \ge \frac{\alpha t}{2}. \end{aligned}$$

Hence, the probability that \(H'\) is realized conditioned on H being realized is at most \(p^{\frac{\alpha t}{2}}\). The number of ways to choose t other vertices is \(\left( {\begin{array}{c}m\\ t\end{array}}\right) \left( \frac{n}{m}\right) ^t\) (first choose t groups out of m in the partition, then choose one vertex in each group). Hence, the expected number of such occurrences is

$$\begin{aligned} \mu _t \le \left( {\begin{array}{c}m\\ t\end{array}}\right) \left( \frac{n}{m}\right) ^tp^{\frac{\alpha t}{2}} = \left( {\begin{array}{c}m\\ t\end{array}}\right) \left( \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}}\right) ^{\frac{t}{2}}. \end{aligned}$$

It follows that \(\mu _m \le \mathbb {E}[Y_H(G)]\). Moreover,

$$\begin{aligned}&\sum _{t = 1}^{m/2}\frac{\mu _t}{\mathbb {E}[Y_H(G)]} \le \sum _{t = 1}^{m/2}\left( {\begin{array}{c}m\\ t\end{array}}\right) \left( \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}}\right) ^{\frac{t - m}{2}} \le 2^m\left( \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}}\right) ^{-\frac{m}{4}} =\\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad = \left( \frac{16m^2n^{\alpha - 2}}{w(n)^{\alpha }}\right) ^{\frac{m}{4}}. \end{aligned}$$

By assumption, \(\varepsilon ^2 > 2\frac{m^4}{n^2p^\alpha }\) where \(0< \varepsilon < 1/7\), and substituting \(p = w(n)/n\) gives \(w(n)^\alpha > 2m^4n^{\alpha - 2}\) and \(w(n)^\alpha \gg m^2n^{\alpha - 2}\). As a result, the expression above is o(1). Furthermore,

$$\begin{aligned} \sum _{t = m/2}^{m - 1}\frac{\mu _t}{\mathbb {E}[Y_H(G)]} \le \sum _{t = m/2}^{m - 1}m^{m - t}\left( \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}}\right) ^{\frac{t - m}{2}} = \sum _{t = m/2}^{m - 1}\left( \frac{w(n)^{\alpha }}{m^4n^{\alpha - 2}}\right) ^{\frac{t - m}{2}}. \end{aligned}$$

When \(w(n)^\alpha > 2m^4n^{\alpha - 2}\) the term \(t = m - 1\) dominates, and hence the sum is at most roughly \(\sqrt{\frac{m^4n^{\alpha - 2}}{w(n)^{\alpha }}}\). Since \(\varepsilon ^2 > 2\frac{m^4n^{\alpha - 2}}{w(n)^\alpha }\) we have

$$\begin{aligned} \sum _{t = 1}^{m - 1}\frac{\mu _t}{\mathbb {E}[Y_H(G)]} \le \left( \frac{16m^2n^{\alpha - 2}}{w(n)^{\alpha }}\right) ^{\frac{m}{4}}&+ (1 + o(1))\sqrt{\frac{m^4n^{\alpha - 2}}{w(n)^{\alpha }}}\le \\&\qquad \le o(1) + (1 + o(1))\sqrt{\frac{m^4n^{\alpha - 2}}{w(n)^{\alpha }}} \le \varepsilon \end{aligned}$$

and \(\sum _{t = 1}^m\mu _t \le (1 + \varepsilon )\mathbb {E}[Y_H(G)]\). Hence can bound \(\mathbb {E}[Y_H(G)^2] \le \mathbb {E}[Y_H(G)]\sum _{t = 1}^m\mu _t \le (1 + \varepsilon )\mathbb {E}[Y_H(G)]^2\), and

$$\begin{aligned} \mathbb {E}[X_H(G)^2] \le \mathbb {E}[Y_H(G)^2]&\le (1 + \varepsilon )\mathbb {E}[Y_H(G)]^2 \le \\&\le \frac{1 + \varepsilon }{(1 - \varepsilon )^2}\mathbb {E}[X_H(G)]^2 \le (1 + 4\varepsilon )\mathbb {E}[X_H(G)]^2. \end{aligned}$$

The last inequality holds as \(\varepsilon < 1/7\). We get that \(\mathbb {V}[X_H(G)] = \mathbb {E}[X_H(G)^2] - \mathbb {E}[X_H(G)]^2 \le 4\varepsilon \mathbb {E}[X_H(G)]^2\). By Chebyshev’s inequality we conclude that

$$\begin{aligned} \mathbb {P}[X_H(G) \le \beta \mathbb {E}[X_H(G)]]&= \mathbb {P}[X_H(G) \le \mathbb {E}[X_H(G)] - (1 -\beta )\mathbb {E}[X_H(G)]] \le \\&\qquad \qquad \qquad \le \frac{\mathbb {V}[X_H(G)]}{(1 - \beta )^2\mathbb {E}[X_H(G)]^2}\le \frac{4\varepsilon }{(1 -\beta )^2}, \end{aligned}$$

as desired.\(\square \)

Corollary 4.1

(Lemma 4.1 restated) For every constants \(0< \delta < 1\), \(2< \alpha < \min (\frac{2}{1 - \delta }, 3)\), \(0< \rho < \min [\frac{1 - \delta }{2}, \frac{2 - \alpha (1 - \delta )}{4}]\) and \(0< \varepsilon < 1/2\) the following holds for large enough n. Let \(G \sim G(n, p)\) be a random graph with \(p = n^{\delta - 1}\), and let H be a balanced graph on \(m = n^{\rho }\) vertices and with average degree \(\alpha \). Then \(\mathbb {E}[X_H(G)]\xrightarrow {n\rightarrow \infty }+\infty \), and for every \(\beta \in [0, 1)\)

$$\begin{aligned} \mathbb {P}[X_H(G) \le \beta \mathbb {E}[X_H(G)]] \le \frac{\varepsilon }{(1 - \beta )^2}. \end{aligned}$$

Proof

We first note that \(\alpha < \frac{2}{1 - \delta }\) implies that \(2 - \alpha (1 - \delta ) > 0\), and hence we can take \(\rho > 0\) in the above Corollary. The inequality \(\rho < (1 - \delta )/2\) implies that for every constant \(\epsilon \in (0, \frac{1}{8})\), for large enough n,

$$\begin{aligned} m = n^{\rho } < \sqrt{\frac{\epsilon }{n^{\delta - 1}}} = \sqrt{\frac{\epsilon }{p}} \end{aligned}$$

Likewise, \(\rho < (2 - \alpha (1 - \delta ))/4\) implies that for large enough n, for all \(\epsilon \in (0, \frac{1}{8})\)

$$\begin{aligned} 2m^4 < \epsilon ^2 n^{2 - \alpha (1 - \delta )} = \epsilon ^2 n^2 p^{\alpha } \end{aligned}$$

The above bounds on m satisfy the requirements of Lemma 4.3, with \(\epsilon \) serving as \(\varepsilon \) of the statement Lemma 4.3. Hence,

$$\begin{aligned} \mathbb {P}[X_H(G) \le \beta \mathbb {E}[X_H(G)]] \le \frac{4\epsilon }{(1 - \beta )^2}. \end{aligned}$$

To show that \(\mathbb {E}[X_H(G)]\xrightarrow {n\rightarrow \infty }+\infty \), recall the notation \(w(n) = np\) and the following bound from the proof of Lemma 4.3

$$\begin{aligned} \mathbb {E}[X_H(G)]\ge (1 - \epsilon ) \left( \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}}\right) ^{\frac{m}{2}}. \end{aligned}$$

As \(w(n) = n^\delta \) and \(m = n^\rho \),

$$\begin{aligned} \frac{w(n)^{\alpha }}{m^2n^{\alpha - 2}} = n^{2 -2\rho - \alpha (1 - \delta )}. \end{aligned}$$

The inequality \(\rho < (2 - \alpha (1 - \delta ))/4\) implies that \(2 -2\rho - \alpha (1 - \delta ) > 2\rho \). Hence,

$$\begin{aligned} \mathbb {E}[X_H(G)]\ge (1 - \epsilon ) \left( n^{2 -2\rho - \alpha (1 - \delta )}\right) ^{\frac{m}{2}} \ge (1 - \epsilon )n^{m\rho } \xrightarrow {n\rightarrow \infty }+\infty . \end{aligned}$$

Setting \(\varepsilon = 4\epsilon \) finishes the proof.\(\square \)

4.2 Proofs of Lemma 4.2 and Theorem 4.3

We can give now the full statement of Lemma 4.2.

Lemma 4.4

(Lemma 4.2 restated) Let constants \(\delta , \alpha , \rho , {\varepsilon }\) be as in Corollary 4.1. The following holds for large enough n. Let \(G \sim G(n, p)\) be a random graph with \(p = n^{\delta - 1}\), and let H be a balanced graph on \(m = n^{\rho }\) vertices and with average degree \(\alpha \). For every constant \(\beta \in (0, 1)\), the distribution \(G_H(n, p)\) is \((\beta , \varepsilon /(1 - \beta )^2)\)-close to the distribution G(np).

Proof

Fix \(H \in \mathcal {H}\). For an n vertex graph G, let p(G) denote the probability that a random sample from G(np) gives G, and let \(p_H(G)\) denote the probability that a random sample from \(G_H(n, p)\) gives G. Consider \(p_H(G)\). Out of the \(\left( \frac{n}{m}\right) ^m\) options to choose a subset M in \(G_H(n, p)\), only \(X_H(G)\) options are such that the subgraph induced on M is H, so that the resulting graph could be G. Since H has average degree \(\alpha \), it has exactly \(\alpha m / 2\) edges. Note that

$$\begin{aligned} \mathbb {E}[X_H(G)] = \left( \frac{n}{m}\right) ^mp^{\frac{\alpha m}{2}}(1 - p)^{\left( {\begin{array}{c}m\\ 2\end{array}}\right) - \frac{\alpha m}{2}}. \end{aligned}$$

Let e be the number of edges in G. Given that we chose a suitable M, the rest of the edges \((e - \frac{\alpha m}{2})\) of \(G_H(n, p)\) should agree with G. It follows that

$$\begin{aligned} p_H(G)&= \frac{X_H(G)}{\left( \frac{n}{m}\right) ^m}p^{e - \frac{\alpha m}{2}}(1 - p)^{\left( {\begin{array}{c}n\\ 2\end{array}}\right) - \left( \left( {\begin{array}{c}m\\ 2\end{array}}\right) + e - \frac{\alpha m}{2}\right) } = \\&\qquad \quad \,\,\,= \frac{X_H(G)}{\left( \frac{n}{m}\right) ^mp^{\frac{\alpha m}{2}}(1 - p)^{\left( {\begin{array}{c}m\\ 2\end{array}}\right) - \frac{\alpha m}{2}}}p^e(1 - p)^{\left( {\begin{array}{c}n\\ 2\end{array}}\right) - e} = \\&\qquad \qquad \qquad \qquad \qquad =\frac{X_H(G)}{\mathbb {E}[X_H(G)]}p^e(1 - p)^{\left( {\begin{array}{c}n\\ 2\end{array}}\right) - e} = \frac{X_H(G)}{\mathbb {E}[X_H(G)]}p(G). \end{aligned}$$

By Corollary 4.1, for every \(\beta \in [0, 1)\),

$$\begin{aligned} \underset{G\sim G(n, p)}{\mathbb {P}}[X_H(G) \le \beta \mathbb {E}[X_H(G)]] \le \frac{\varepsilon }{(1 - \beta )^2}. \end{aligned}$$

It follows that for every G with \(X_H(G) \ge \beta \mathbb {E}[X_H(G)]\) we have

$$\begin{aligned} p_H(G) = \frac{X_H(G)}{\mathbb {E}[X_H(G)]}p(G) \ge \frac{\beta \mathbb {E}[X_H(G)]}{\mathbb {E}[X_H(G)]}p(G) = \beta p(G). \end{aligned}$$

Therefore, by Corollary 4.1, for every \(\beta \in [0, 1)\), for at least \(1 - \frac{\varepsilon }{(1 - \beta )^2}\) fraction of all graphs \(G \sim G(n, p)\) we will have \(p_H(G) \ge \beta p(G)\).\(\square \)

We now restate and prove Theorem 4.3.

Theorem 4.4

(Theorem 4.3 restated) Let constants \(\delta , \alpha , \rho , {\varepsilon }\) be as in Corollary 4.1. The following holds for large enough n. Let \(p = n^{\delta - 1}\), let \(cn^{1 - \delta }\log n \le k \le \frac{2}{3}n\) where c is a sufficiently large constant, and let H be a balanced graph on \(m = n^{\rho }\) vertices and with average degree \(\alpha \). For every constant \(\beta \in (0, 1)\), the distribution \(\tilde{G}_H \sim G_H(n, p, k)\) generated by ALGRAND is \((\beta , \varepsilon /(1 - \beta )^2)\)-close to the distribution \(\tilde{G}\sim A_H(n, p, k)\) generated by the adversary.

Proof

We will break the generations of \(\tilde{G}\sim A_H(n, p, k)\) and \(\tilde{G}_H \sim G_H(n, p, k)\) into small steps. Recall that these generation processes use a parameter \(k'\). For us (the readers), \(k'\) represents a conjectured value for the size of the maximum independent set in the given graph H. However, for the generation algorithms, \(k'\) is a parameter with no special meaning, and they produce an output regardless of the true size of the maximum independent set in H.

The graph \(\tilde{G}\sim A_H(n, p, k)\) is created as follows:

  1. 1.

    Generate a graph \(G \sim G(n, p)\).

  2. 2.

    Choose in G uniformly at random an induced copy of H that obeys the partition. If there is no such induced copy this step is said to fail, and one invokes the default (explained in item 4).

  3. 3.

    Plant uniformly at random an independent set of size \(k - k'\) among the non-neighbors of the chosen induced copy of H, giving the graph \(\tilde{G}\). If this induced copy has fewer than \(k - k'\) non-neighbors, this step is said to fail, and one invokes the default (explained in item 4).

  4. 4.

    If the default is invoked, plant at random an independent set of size k in G, giving the graph \(\tilde{G}\).

The generation of \(\tilde{G}_H \sim G_H(n, p, k)\) by ALGRAND is identical to the above, except that step 2 is replaced by the following:

  • Plant in G uniformly at random an induced copy of H that obeys the partition, giving the graph \(G_H \sim G_H(n,p)\);

Graph \(\tilde{G}_H\) is obtained from \(G_H\) by planting an independent set in the exact same way as described in steps 3 and 4.

Fixing some \(\beta > 0\), call a graph G typical if the probability of generating G under \(G_H(n, p)\) is at least \(\beta \) times the probability of generating G under G(np). Observe that if \(G \sim G(n, p)\) is typical, then step 2 of the \(A_H(n, p, k)\) process does not fail, since there is at least one copy of H present in the graph.

Suppose that graph G is typical. Let \(p_A\) denote the probability that G is generated under G(np), and let \(p_H\) denote the probability that G is generated under \(G_H(n, p)\). Then \(p_H \ge \beta p_A\). Now consider a set M of m vertices in G that obeys the partition and on which the induced subgraph is H. We refer to such a set M as a feasible set. Let \(p_{A, M}\) denote the probability that the pair (GM) is generated by the adversary, in the sense that M is the m-vertex set containing the chosen random copy of H in step 2 of the \(A_H(n, p, k)\) process above. Similarly, let \(p_{H, M}\) denote the probability that (GM) is the pair generated by ALGRAND, by the end of its step 2. We will show that \(p_{H, M} \ge \beta p_{A, M}\). Let t denote the number of feasible sets in G. Then, as the adversary chooses one such feasible set uniformly at random, we have that \(p_{A, M} = \frac{1}{t} p_A\). So, to establish \(p_{H, M} \ge \beta p_{A, M}\) it remains to show that \(p_{H, M} = \frac{1}{t} p_H\). Observe that \(p_H = \sum _{M'} p_{H, M'}\), where \(M'\) ranges over the t feasible sets. Then, if for any two feasible choices of M, say, \(M_1\) and \(M_2\), it holds \(p_{H,M_1} = p_{H, M_2}\), the equality \(p_{H, M} = \frac{1}{t} p_H\) follows immediately. To prove \(p_{H,M_1} = p_{H, M_2}\) for feasible sets \(M_1, M_2\), we may think of the generation of (GM) by ALGRAND as first choosing M and planting a copy of H there, and then completing at random the rest of the graph G. Both \(M_1\) and \(M_2\) have the exact same probability of being chosen by ALGRAND. Likewise, both have the exact same probability of being completed to G, which is precisely

$$\begin{aligned} p^{|E(G)| - |E(H)|}(1 - p)^{\left( {\begin{array}{c}n\\ 2\end{array}}\right) - \left( {\begin{array}{c}m\\ 2\end{array}}\right) - |E(G)|}. \end{aligned}$$

Hence, \(p_{H,M_1} = p_{H, M_2}\) for every feasible \(M_1\) and \(M_2\), implying that \(p_{H, M} = \frac{1}{t} p_H\), and consequently, \(p_{H, M} \ge \beta p_{A, M}\).

As a result, when graph G is typical, for every feasible set M the probability to generate (GM) by ALGRAND is at least \(\beta \) times the probability to generate (GM) by the adversary. By Lemma 4.2, the probability of \(G \sim G(n, p)\) being typical is at least \(1 - \frac{\varepsilon }{(1 - \beta )^2}\). Hence the distribution \(D_H\) over pairs \((G_H, M_H)\) generated by step 2 of ALGRAND is \((\beta , \varepsilon /(1 - \beta )^2)\)-close to the distribution \(D_A\) over pairs (GM) generated by step 2 of the adversary.

Now, given the pair \((G_H, M_H)\) of the graph \(G_H\sim G_H(n, p)\) and the vertex set \(M_H\) of the planted copy of H, the process of generating \(\tilde{G}_H \sim G_H(n, p, k)\) from \((G_H, M_H)\) is identical to the process of generating \(\tilde{G}\sim A_H(n, p, k)\) from the pair (GM) of the graph \(G \sim G(n, p)\) and the vertex set M of the induced copy of H – steps 3 and 4. But then, since the distribution \((G_H, M_H)\) is \((\beta , \varepsilon /(1 - \beta )^2)\)-close to the distribution (GM), we conclude that the distribution \(G_H(n, p, k)\) is \((\beta , \varepsilon /(1 - \beta )^2)\)-close to the distribution \(A_H(n, p, k)\).\(\square \)

Together Theorem 4.3 and Proposition 4.1 give us the following Corollary.

Corollary 4.2

Let f be an arbitrary function that gets an n vertex graph as input and outputs either 0 or 1. Let constants \(\delta , \alpha , \rho , {\varepsilon }\) be as in Corollary 4.1. The following holds for large enough n. Let \(p = n^{\delta - 1}\), let \(cn^{1 - \delta }\log n \le k \le \frac{2}{3}n\) where c is a sufficiently large constant, and let H be a balanced graph on \(m = n^{\rho }\) vertices and with average degree \(\alpha \). For every constant \(\beta \in (0, 1)\),

$$\begin{aligned} \underset{\tilde{G}_H\sim G_H(n, p, k)}{\mathbb {P}}{\left[ f(\tilde{G}_H) = 1\right] } \ge \beta \left( \underset{\tilde{G}\sim A_H(n, p, k)}{\mathbb {P}}{\left[ f(\tilde{G}) = 1\right] } - \frac{\varepsilon }{(1 - \beta )^2}\right) . \end{aligned}$$

4.3 Proof of Theorem 1.3

To prove Theorem 1.3 we shall use Theorem 4.2 together with a few relatively simple lemmas. Lemma 4.5 implies that the probability that \(G_H(n, p, k)\) fails to produce an output graph is negligible. (For \(A_H(n, p, k)\), the same is implied by the combination of Lemma 4.5 and Theorem 4.3.)

Lemma 4.5

Let \(\delta \in (0, 1)\) and \(0< \rho < (1-\delta )/2\). Let \(G \sim G(n, p)\) be a random graph with \(p = n^{\delta - 1}\). For every set S of \(n^{\rho }\) vertices of G the size of the common non-neighborhood of S is at least \(n - 2n^{\delta + \rho }\) with probability at least \(1 - \exp (-\frac{1}{4}n^{\delta })\).

Proof

We clearly have \(\rho + \delta < 1\). By Chernoff bound, the maximum degree of G is at most \(2n^{\delta }\) with probability at least \(1 - \exp (-\frac{1}{4}n^{\delta })\). Hence, any set S of \(n^{\rho }\) vertices has at most \(2n^{\delta + \rho }\) neighbors. Then, for any set S of this size the common non-neighborhood of S has size at least \(n - 2n^{\delta + \rho }\) with probability at least \(1 - \exp (-\frac{1}{4}n^{\delta })\).\(\square \)

Recall that given graph H with independent set of size \(k'\), the algorithm ALGRAND uses it to generate two random graphs. First, it creates \(G_H \sim G_H(n, p)\) by planting in \(G\sim G(n, p)\) a random copy of H that obeys the partition. We denote by M the set of vertices of G on which H is planted. Second, it generates \(\tilde{G}_H \sim G_H(n, p, k)\) by planting a random independent set \(I'\) of size \(k - k'\) within the non-neighbors of the planted copy of H. The following lemmas establish that with high probability the graph \(\tilde{G}_H\) has no independent set that has more than \(k - k'\) vertices outside the induced copy of H, implying that the maximum independent set in \(\tilde{G}_H\) is a combination of the planted \(I'\) and the original independent set of size \(k'\) in H. In Lemma 4.6 we show that with extremely high probability, there is no independent set of size \(\frac{k - k'}{2}\) in G. In Lemma 4.7 we show that with high probability, any subset Q of size \(t \le \frac{k - k'}{2}\) outside of both H and independent set \(I'\) has at least \(t + 1\) neighbors in \(I'\), implying that \(I'\) is the maximum independent set in \(G_H[V\setminus M]\) (vertices outside of \(I'\) will have too many neighbors in \(I'\), so combining them with \(I'\) will not increase the size of the independent set beyond \(k - k'\)).

Lemma 4.6

Let \(p = n^{\delta - 1}\) and \(k - k' \ge 4n^{1 - \delta }\log n\). With probability at least \(1 - \exp (-\frac{1}{2}n^{1 - \delta }\log ^2 n)\), there is no independent set of size \(\frac{k - k'}{2}\) in \(G\sim G(n, p)\).

Proof

By first moment method the probability that there exists an independent set of size t in G is at most

$$\begin{aligned} \left( {\begin{array}{c}n\\ t\end{array}}\right) (1 - p)^{\frac{t(t - 1)}{2}}&\le \exp \left( t\log n + t - t\log t -\frac{t(t - 1)}{2}n^{\delta - 1}\right) =\\&\qquad \qquad \qquad \quad \quad = \exp \left( \log n + 1 - \log t - \frac{t - 1}{2}n^{\delta - 1}\right) ^t, \end{aligned}$$

which for \(t \ge \frac{k - k'}{2} \ge 2n^{1 - \delta }\log n\) is at most \(\exp (-\frac{1}{2}n^{1 - \delta }\log ^2 n)\xrightarrow {n\rightarrow \infty }0\).\(\square \)

Lemma 4.7

Let \(p = n^{\delta - 1}\) and \(k - k' \ge 6n^{1 - \delta }\log n\). Consider the graph \(G_H \sim G_H(n, p)\) (with M denoting the set of vertices on which H is planted), and let \(W\subseteq V\setminus M\) be the non-neighbors of the planted copy of H. Let \(I'\) be a random subset of W of size \(k - k'\). For every integer t satisfying \(1 \le t \le \frac{k-k'}{2}\), with probability at least \(1 - 2/n\) every subset \(Q \subset V\setminus (M\cup I')\ = W \setminus I'\) of t vertices of graph \(G_H[V\setminus M]\) has at least \(t + 1\) neighbors in \(I'\).

Proof

To prove this, view the process of generating \(\tilde{G}_H\) in a following way. Initially, we have the graph H (on the set M of m vertices) and \(n - m\) isolated vertices. Then, for every pair of vertices uv where \(u \in M\) and \(v \in V\setminus M\), draw an edge (uv) with probability p. By doing so, we determine the set \(W\subseteq V\setminus M\) of vertices that have no neighbors in H. Select a random subset \(I' \subset W\) of size \(k - k'\). For every pair of vertices from \(V\setminus M\), if at least one of them does not belong to \(I'\), draw an edge with probability p.

There are at most \(\left( {\begin{array}{c}n\\ t\end{array}}\right) \le n^t\) possible choices for the set Q. There are at most \(\left( {\begin{array}{c}k-k'\\ t\end{array}}\right) \le k^t \le n^t\) possible choices for the set Y of at most t neighbors of Q within \(I'\). The probability that Q has no neighbors in \(I' \setminus Y\) is \((1-p)^{t(k-k'-t)} \le (1-p)^{t(k-k')/2} \le n^{-3t}\). By a union bound the probability that some subset \(Q\subset V\setminus (M\cup I')\) of size t has at most t neighbors in \(I'\) is at most \(n^{-t}\). The probability of this happening for some value \(t \le \frac{k - k'}{2}\) is at most \(\sum _{t = 1}^{(k - k')/2}n^{-t} \le \frac{2}{n}\), as desired.\(\square \)

Combining the above lemmas we have the following Corollary.

Corollary 4.3

Let \(p = n^{\delta - 1}\) and \(6n^{1 - \delta }\log n \le k - k' \le \frac{2n}{3}\). Then with probability at least \(1 - 4/n\) over the choice of graph \(G \sim G_H(n, p, k)\), every independent set of size k in G contains at least \(k'\) vertices in the planted copy of H.

Proof

There are three events that might cause the Corollary to fail.

  • \(G_H(n, p, k)\) fails to produce an output. By Lemma 4.5 and the upper bound on k, the probability of this event is smaller than \(\frac{1}{n}\).

  • Even before planting \(I'\), there is an independent set larger than \(\frac{k - k'}{2}\) in \(G_H[V\setminus M]\). By Lemma 4.6 the probability of this event is smaller than \(\frac{1}{n}\).

  • After planting \(I'\), one can obtain an independent set larger than \(I'\) in \(G_H[V\setminus M]\) by combining an independent set \(Q \subset V\setminus (M\cup I')\) with some of the vertices of \(I'\). As we already assume that Lemma 4.6 holds, Q can be of size at most \(\frac{k-k'}{2}\). Lemma 4.7 then implies that the probability of this event is at most \(\frac{2}{n}\).

The sum of the above three failure probabilities is at most \(\frac{4}{n}\). \(\square \)

Now we restate and prove Theorem 1.3.

Theorem 4.5

(Theorem 1.3 restated) For \(p = n^{\delta -1}\) with \(0< \delta < 1\), \(0< \gamma < 1\), and \(6n^{1 - \delta }\log n \le k \le \frac{2}{3}n\) the following holds. There is no polynomial time algorithm that has probability at least \(\gamma \) of finding an independent set of size k in \(G \sim A\bar{G}(n, p, k)\), unless NP has randomized polynomial time algorithms (NP=RP).

Proof

Suppose for the sake of contradiction that algorithm ALG has probability at least \(\gamma \) of finding an independent set of size k in the setting of the Theorem. Choose \(2< \alpha < \min [\frac{2}{1 - \delta } , 3]\) and \(0< \rho < \min [\frac{1 - \delta }{2}, \frac{2 - \alpha (1 - \delta )}{4}]\). Let \(\mathcal {H}\) be the class of balanced graphs of average degree \(\alpha \) on \(m = n^{\rho }\) vertices. By Theorem 4.2, given a graph \(H \in \mathcal {H}\) and a parameter \(k'\), it is NP-hard to determine whether H has an independent set of size \(k'\). We now show how ALG can be leveraged to design a randomized polynomial time algorithm that solves this NP-hard problem with high probability.

Repeat the following procedure \(5\frac{\log n}{\gamma }\) times.

  • Sample a graph \(G \sim G_H(n, p, k)\).

  • Run ALG on G. If ALG returns an independent set of size k that has at least \(k'\) vertices in the planted copy of H, then answer yes (H has an independent set of size \(k'\)) and terminate.

If \(5\frac{\log n}{\gamma }\) iterations are completed without answering yes, then answer no (H probably does not have an independent set of size \(k'\)).

Clearly, the above algorithm runs in random polynomial time. Moreover, if it answers yes then its answer is correct, because it actually finds an independent set of size \(k'\) in H. It remains to show that if H has an independent set of size \(k'\), the probability of failing to give a yes answer is small.

We now lower bound the probability that a single run of ALG on \(G \sim G_H(n, p, k)\) fails to output yes. Recall that ALG succeeds (finds an independent set of size k) with probability at least \(\gamma \) over graphs with adversarially planted independent sets, and in particular, over the distribution \(A_H(n, p, k)\).

Choose \(\varepsilon = \frac{\gamma }{9}\) and \(\beta = \frac{1}{3}\). Our choices of \(\alpha , \rho , \varepsilon \) and \(m = n^{\rho }\) satisfy the requirements of Corollary 4.1, and hence we can apply Corollary 4.2. In Corollary 4.2 use the function f that has value 1 if ALG succeeds on G. It follows from Corollary 4.2 that ALG succeeds with probability at least \(\beta (\gamma - \frac{\varepsilon }{(1 - \beta )^2})=\frac{\gamma }{4}\) over graphs \(G \sim G_H(n, p, k)\). Corollary 4.3 implies that there is probability at most \(\frac{4}{n}\) that there is an independent set of size k in G that does not contain \(k'\) vertices in the induced copy of H. Hence a single iteration returns yes with probability at least \(\frac{\gamma }{4} - \frac{4}{n} \ge \frac{\gamma }{5}\) (for sufficiently large n).

Finally, as we have \(5\frac{\log n}{\gamma }\) iterations, the probability that none of the iterations finds an independent set of size k is at most \((1 - \frac{\gamma }{5})^{5\frac{\log n}{\gamma }} \simeq \frac{1}{n}\).\(\square \)

5 Discussion

Our results show that for some ranges of parameters there are polynomial time algorithms that find the maximum independent set in \(G \sim A\bar{G}(n, p, k)\), whereas for other ranges of parameters the problem is NP-hard (under randomized reductions).

An interesting range of parameters that remains open is that of \(p = \frac{d}{n}\) for some large constant d. The case of a random planted independent set of size \(\sqrt{\frac{c}{d}} n\) (for some sufficiently large constant \(c > 0\) independent of d) was addressed in [10]. In such sparse graphs, the planted independent set is unlikely to be the maximum independent set. The main result in [10] is a polynomial time algorithm that with high probability finds the maximum independent set in that range of parameters. It would be interesting to see whether the positive results extend to the case of adversarial planted independent set. We remark that neither Theorem 1.1 nor Theorem 1.3 apply in this range of parameters.