1 Introduction

Lossy kernelization stems from parameterized complexity, a branch in theoretical computer science that studies the complexity of problems as functions of multiple parameters of the input or output [1]. A central notion in parameterized complexity is kernelization, which is a generic technique for designing efficient algorithms availing a polynomial-time preprocessing step that transforms a “large” instance of a problem into a smaller, equivalent instance. Naturally, the preprocessing step is called the kernelization algorithm and the smaller instance is called the kernel. One limitation of the classical kernelization technique is that kernels can only deal with “lossless” preprocessing in the sense that a kernel must be equivalent to the original instance. This is why most of the interesting problems arising from machine learning, e.g., clustering, are intractable from the perspective of kernelization. Lossy or approximate kernelization is a successful attempt of combining kernelization with approximation algorithms. Informally, in lossy kernelization, given an instance of the problem and a parameter, we would like the kernelization algorithm to output a reduced instance of size polynomial in the parameter; however, the notion of equivalence is relaxed in the following way. Given a c-approximate solution (i.e., one with the cost within c-factor of the optimal cost) to the reduced instance, it should be possible to produce in a polynomial time an \(\alpha c\)-approximate solution to the original instance. The factor \(\alpha \) is the loss incurred while going from the reduced instance to the original instance. The notion of lossy kernelization was introduced by Lokshtanov et al. in [2]. However, most of the developments of lossy kernelization up to now are in graph algorithms [3,4,5,6,7], see also [8, Chapter 23] for an overview.

One of the actively developing areas of parameterized complexity concerns fixed-parameter tractable- or FPT-approximation. We refer to the survey [9] for an overview of the area. Several important advances in FPT-approximation concern clustering problems. It includes tight algorithmic and complexity results for k-means and k-median [10] and constant factor FPT-approximation for capacitated clustering [11]. The popular approach for data compression used for FPT-approximation of clustering is based on coresets. The notion of coresets originated from computational geometry. It was introduced by Har-Peled and Mazumdar [12] for k-means and k-median clustering. Informally, a coreset is a summary of the data that for every set of k centers, approximately (within \((1\pm \varepsilon )\) factor) preserves the optimal clustering cost.

Lossy kernels and coresets have a lot of similarities. Both compress the space compared to the original data, and any algorithm can be applied on a coreset or kernel to efficiently retrieve a solution with a guarantee almost the same as the one provided by the algorithm on the original input. The crucial difference is that coreset constructions result in a small set of weighted points. The weights could be up to the input size n. Thus, a coreset of size polynomial in \(k/\varepsilon \), is not a polynomial-sized lossy kernel for parameters \(k, \varepsilon \) because of the \(\log {n}\) bits required to encode the weights. Moreover, usually coreset constructions do not bound the number of coordinates or the dimension of the points.

While the notion of lossy kernelization proved to be useful in the design of graph algorithms, we are not aware of its applicability in clustering. This brings us to the following question: What can lossy kernelization offer to clustering?

In this work, we make the first step towards the development of lossy kernels for clustering problems. Our main result is the design of a lossy kernel for a variant of the well-studied \(k\) -Median clustering with clusters of equal sizes. More precisely, consider a collection (multiset) of points from \(\mathbb {Z}^d\) under the \(\ell _p\)-norm. Thus, every point is a d-dimensional vector with integer coordinates. For a nonnegative integer p, we use \(\Vert \textbf{x}\Vert _p\) to denote the \(\ell _p\)-norm of a d-dimensional vector \(\textbf{x}=(x[1],\ldots ,x[d])\in \mathbb {R}^d\), that is, for \(p\ge 1\),

$$\Vert \textbf{x}\Vert _p=\left( \sum _{i=1}^d| x[i]|^p\right) ^{1/p}$$

and for \(p=0\), \(\Vert \textbf{x}\Vert _0\) is the number of nonzero elements of \(\textbf{x}\), i.e., the Hamming norm. For any subset of points \(T\subseteq \mathbb {Z}^d\), we define

$$\begin{aligned} \textsf{cost}_p(T)=\min _{\textbf{c}\in \mathbb {R}^d}\sum _{\textbf{x}\in T}\Vert \textbf{c}-\textbf{x}\Vert _p. \end{aligned}$$

Then \(k\) -MedianFootnote 1 clustering (without constraints) is the task of finding a partition \(\{X_1,\ldots ,X_k\}\) of a given set \(\textbf{X}\subseteq \mathbb {Z}^d\) of points minimizing the sum

$$ \sum _{i=1}^k \textsf{cost}_p(X_i). $$

In many real-life scenarios, it is desirable to cluster data into clusters of equal sizes. For example, to tailor teaching methods to the specific needs of various students, one would be interested in allocating k fair class sizes by grouping students with homogeneous abilities and skills [13]. In scheduling, the standard task is to distribute n jobs to k machines while keeping identical workloads on each machine and simultaneously reducing the configuration time. In the setting of designing a conference program, one might be interested in allocating n scientific papers according to their similarities to k “balanced” sessions [14].

The following model is an attempt to capture such scenario:

Equal Clustering

Input::

A collection (multiset) \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of n points of \(\mathbb {Z}^d\) and a positive integer k such that n is divisible by k.

Task::

Find a partition \(\{X_1,\ldots ,X_k\}\) (k-clustering) of \(\textbf{X}\) with \(|X_1|=\cdots =|X_k|=\frac{n}{k}\) minimizing

$$ \sum _{i=1}^k \textsf{cost}_p(X_i). $$

First, note that Equal Clustering is a restricted variant of the capacitated version [11] of \(k\) -Median where the size of each cluster is required to be bounded by a given number U. Also note, that some points in \(\textbf{X}\) may be equal (in the above examples, several students, jobs, or scientific papers can have identical features but could be assigned to different clusters due to the size limitations). We refer to the multisets \(X_1,\ldots ,X_k\) as the clusters.

To describe the lossy-kernel result, we need to define the parameterized version of Equal Clustering with the cost of clustering B (the budget) being the parameter. Following the framework of lossy kernelization [2], when the cost of an optimal clustering exceeds the budget, we assume it is equal to \(B+1\). More precisely, in Parameterized Equal Clustering, we are given an additional integer B (budget parameter). The task is to find a k-clustering \(\{X_1,\ldots ,X_k\}\) with \(| X_1|=\cdots =| X_k|\) and minimizing the value

$$\begin{aligned} \textsf{cost}_p^B(X_1,\ldots ,X_k)= {\left\{ \begin{array}{ll} \sum _{i=1}^k \textsf{cost}_p(X_i) &{}\text{ if } \sum _{i=1}^k \textsf{cost}_p(X_i) \le B,\\ B+1&{}\text{ otherwise. } \end{array}\right. } \end{aligned}$$

Before stating our results, let us first discuss some limitations and advantages of parameterization of Decision Equal Clustering by the budget B. First, parameterization by B is reasonable when the vectors are integer-valued, which is a common situation when the data is categorical, that is, can admit a fixed number of possible values. For example, it could be gender, blood type, or political orientation. A prominent example of categorical data is binary data, where the points are binary vectors. Binary data arise in several critical applications. For example, in electronic commerce, each transaction can be modeled as a binary vector (known as market basket data), each of whose coordinates denotes whether a particular item is purchased or not [15, 16]. In document clustering, each document can be modeled as a binary vector, each of whose coordinates denotes whether a specific word is present or not in the document [15, 16].

The most drastic effect of compression occurs when B is small. Intuitively, this means that many of the data points are the same. Such a condition is common in handling personal data that cannot be re-identified. For example, the k-anonymity property requires each person in the data set to be undistinguishable from at least k individuals whose information appears in the release [17].

Finally, comparing lossy kernelization from Theorem 1 and the coresets for \(k\) -Median and k-Means. The sizes of all known coreset constructions depend on the number of clusters k and thus guarantee compression only for small values of k. On the other hand, the size of the lossy kernel is independent of k. In particular, such type of results is interesting when we have to identify many clusters of small size.

Our first main result is the following theorem providing a polynomial 2-approximate kernel.

Theorem 1

For every nonnegative integer constant p, Parameterized Equal Clustering admits a 2-approximate kernel when parameterized by B, where the output collection of points has \(\mathcal {O}(B^2)\) points of \(\mathbb {Z}^{d'}\) with \(d'= \mathcal {O}(B^{p+2})\), where each coordinate of a point takes an absolute value of \(\mathcal {O}(B^3)\).

In other words, the theorem provides a polynomial-time algorithm that compresses the original instance \(\textbf{X}\) to a new instance whose size is bounded by a polynomial of B and such that any c-approximate solution in the new instance can be turned in a polynomial time to a 2c-approximate solution to the original instance.

A natural question is whether the approximation ratio of the lossy kernel in Theorem 1 is optimal. While we do not have a complete answer to this question, we provide lower bounds supporting our study of the problem from the perspective of approximate kernelization. Our next result rules out the existence of an “exact” kernel for the problem. To state the result, we need to define the decision version of Equal Clustering. In this version, we call it Decision Equal Clustering, the question is whether for a given budget B, there is a k-clustering \(\{X_1,\ldots ,X_k\}\) with clusters of the same size such that \(\sum _{1\le i \le k} \textsf{cost}_p(X_i) \le B\).

Theorem 2

For the \(\ell _0\) and \(\ell _1\)-norms, Decision Equal Clustering has no polynomial kernel when parameterized by B unless \({{{\,\mathrm{\textsf{NP}}\,}}\subseteq {{\,\mathrm{\textsf{coNP}}\,}}/{{\,\mathrm{\textsf{poly}}\,}}}\), even if the input points are binary, that is, are from \(\{0,1\}^d\).

On the other hand, we prove that Decision Equal Clustering admits a polynomial kernel when parameterized by k and B.

Theorem 3

For every nonnegative integer constant p, Decision Equal Clustering admits a polynomial kernel when parameterized by k and B, where the output collection of points has \(\mathcal {O}(kB)\) points of \(\mathbb {Z}^{d'}\) with \(d'= \mathcal {O}(kB^{p+1})\) and each coordinate of a point takes an absolute value of \(\mathcal {O}(kB^2)\).

When it comes to approximation in a polynomial time, we show (Theorem 5) that it is \({{\,\mathrm{\textsf{NP}}\,}}\)-hard to obtain a \((1+\varepsilon _c)\)-approximation for Equal Clustering with \(\ell _0\) (or \(\ell _1)\) distances for some \(\varepsilon _c > 0\). However, parameterized by k and \(\varepsilon \), standard techniques yield \((1+\varepsilon )\)-approximation in FPT time. For the \(\ell _2\) norm, there is a general framework for designing algorithms of this form for \(k\) -Median with additional constraints on cluster sizes, introduced by Ding and Xu [18]. The best-known improvements by Bhattacharya et al. [19] achieve a running time of \(2^{\widetilde{\mathcal {O}}(k/\varepsilon ^{\mathcal {O}(1)})} n^{\mathcal {O}(1)}d\) in the case of Equal Clustering, where \(\widetilde{\mathcal {O}}\) hides polylogarithmic factors. In another line of work, FPT-time approximation is achieved via constructing small-sized coresets of the input, and the work [20] guarantees an \(\varepsilon \)-coreset for Equal Clustering (in the \(\ell _2\) norm) of size \((k d\log n/\varepsilon )^{\mathcal {O}(1)}\), and consequently a \((1 + \varepsilon )\)-approximation algorithm with running time \(2^{\widetilde{\mathcal {O}}(k/\varepsilon ^{\mathcal {O}(1)})} (nd)^{\mathcal {O}(1)}\).

Moreover, specifically for Equal Clustering, simple \((1 + \varepsilon )\)-approximations with similar running time can be designed directly via sampling. A seminal work of Kumar et al. [21] achieves a \((1 + \varepsilon )\)-approximation for \(k\) -Median (in the \(\ell _2\) norm) with running time \(2^{\widetilde{\mathcal {O}}(k/\varepsilon ^{\mathcal {O}(1)})} nd\). The algorithm proceeds as follows. First, take a small uniform sample of the input points, and by guessing ensure that the sample is taken only from the largest cluster. Second, estimate the optimal center of this cluster from the sample. In the case of \(k\) -Median, Theorem 5.4 of [21] guarantees that from a sample of size \((1/\varepsilon )^{\mathcal {O}(1)}\) one can compute in time \(2^{(1/\varepsilon )^{\mathcal {O}(1)}} d\) a set of candidate centers such that at least one of them provides a \((1 + \varepsilon )\)-approximation to the cost of the cluster. Finally, “prune” the set of points so that the next largest cluster is at least \(\Omega (1/k)\) fraction of the remaining points and continue the same process with one less cluster. One can observe that in the case of Equal Clustering, a simplification of the above algorithm suffices: one does not need to perform the “pruning” step, as we are only interested in clusterings where all the clusters have size exactly n/k. Thus, \((1/\varepsilon )^{\mathcal {O}(1)}\)-sized uniform samples from each of the clusters can be computed immediately in total time \(2^{\widetilde{\mathcal {O}}(k/\varepsilon ^{\mathcal {O}(1)})} nd\). This achieves \((1 + \varepsilon )\)-approximation for Equal Clustering with the same running time as the algorithm of Kumar et al. In fact, the same procedure works for the \(\ell _0\) norm as well, where for estimating the cluster center it suffices to compute the optimal center of a sample of size \(\mathcal {O}(1/\varepsilon ^2)\), as proven by Alon and Sudakov [22]. Thus, in terms of FPT approximation, Equal Clustering is surprisingly “simpler” than its unconstrained variant \(k\) -Median, however, our hardness result of Theorem 5 shows that the problems are similarly hard in terms of polynomial-time approximation.

Related work

Since the work of Har-Peled and Mazumdar [12] for k-means and k-median clustering, designing small coresets for clustering has become a flourishing research direction. For these problems, after a series of interesting works, the best-known upper bound on coreset size in general metric space is \(\mathcal {O}((k\log n)/\varepsilon ^2)\) [23] and the lower bound is known to be \(\Omega (({k}\log n)/{\varepsilon })\) [24]. For the Euclidean space (i.e., the \(\ell _2\)-norm) of dimension d, it is possible to construct coresets of size \((k/\varepsilon )^{\mathcal {O}(1)}\) [25, 26]. Remarkably, the size of the coresets in this case does not depend on n and d. For Equal Clustering, the best known coreset size of \((kd\log n/\varepsilon )^{O(1)}\) (for \(p=2\)) follows from coresets for the more general capacitated clustering problem [11, 20].

Clustering is undoubtedly one of the most common procedures in unsupervised machine learning. We refer to the book [27] for an overview on clustering. Algorithms for k-median and k-means has been one of the most interesting problems in the area of approximation and led to a plethora of work [28,29,30,31,32]. For k-median, the best known polynomial-time approximation factor is 2.675 [31] and for k-means, it is 6.356 [32]. Moreover, if one is allowed to use FPT time parameterized by k, then these two factors can be improved to \(\approx (1+2/e)\le 1.736\) and \(\approx (1+8/e)\le 3.944\), respectively [10]. Assuming the Gap-ETH [33], these factors are indeed tight [10]. PTASes are known for Euclidean version of k-median and k-means when the dimension is a constant [34,35,36,37,38]. Cohen-Addad et al. [39] obtained an \(n^{o(k)}\) lower bound for k-median when \(d\ge 4\). When the dimension d is arbitrary, one can obtain \((1+\varepsilon )\)-approximation in FPT(k) time where the dependency on n and d are only linear [21]. Feng et al. in [40] gave a unified framework to design FPT approximation algorithms for clustering problems.

Equal Clustering belongs to a wide class of clustering with constraints on the sizes of the clusters. In many applications of clustering, constraints come naturally [41]. In particular, there is a rich literature on approximation algorithms for various versions of capacitated clustering. While for the capacitated version of k-median and k-means in general metric space, no polynomial time O(1)-approximation is known, bicriteria constant-approximations violating either the capacity constraints or the constraint on the number of clusters, by an O(1) factor can be obtained [28, 42,43,44,45,46,47]. Cohen-Addad and Li [11] designed FPT \(\approx 3\)- and \(\approx 9\)-approximation with parameter k for the capacitated version of k-median and k-means, respectively. For these problems in the Euclidean plane, Cohen-Addad [48] obtained a true PTAS. Moreover, for higher dimensional spaces (i.e., \(d\ge 3\)), he designed a \((1+\varepsilon )\)-approximation that runs in time \(n^{{(\log n/\varepsilon )}^{O(d)}}\) [48]. Being a restricted version of capacitated clustering, Equal Clustering admits all the approximation results mentioned above.

Our approach

We briefly sketch the main ideas behind the construction of our lossy kernel for Parameterized Equal Clustering. The lossy kernel’s main ingredients are a) a polynomial algorithm based on an algorithm for computing a minimum weight perfect matching in bipartite graphs, b) preprocessing rules reducing the size and dimension of the problem, and c) a greedy algorithm. Each of the steps is relatively simple and easily implementable. However, proving that these steps result in a lossy kernel with the required properties is not easy.

Recall that for a given budget B, we are looking for a k-clustering of a collection of points \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) into k clusters of the same size minimizing the cost. We also assume that the cost is \(B+1\) if the instance points do not admit a clustering of cost at most B. Informally, we are only interested in optimal clustering when its cost does not exceed the budget. First, if the cluster’s size \(s=\frac{n}{k}\) is sufficiently large (with respect to the budget), we can construct an optimal clustering in a polynomial time. More precisely, we prove that if \(s\ge 4B+1\), then the clusters’ medians could be selected from \(\textbf{X}\). Moreover, we show how to identify the (potential) medians in a polynomial time. In this case, constructing an optimal k-clustering could be reduced to the classical problem of computing a perfect matching of minimum weight in a bipartite graph.

The case of cluster’s size \(s\le 4B\) is different. We apply a set of reduction rules. These rules run in a polynomial time. After exhaustive applications of reduction rules, we either correctly conclude that the considered instance has no clustering of cost at most B or constructs an equivalent reduced instance. In the equivalent instance, the dimension is reduced to \(\mathcal {O}(kB^{p+1})\) while the absolute values of the coordinates of the points are in \(\mathcal {O}(kB^2)\).

Finally, we apply the only approximate reduction on the reduced instance. The approximation procedure is greedy: whenever there are s equal points, we form a cluster out of them. For the points remaining after the exhaustive application of the greedy procedure, we conclude that either there is no clustering of cost at most B or the number of points is \(\mathcal {O}(B^2)\). This construction leads us to the lossy kernel. However, the greedy selection of the clusters composed of equal points may not be optimal. In particular, the reductions used to obtain our algorithmic lower bounds given in Sections 4 and 5 exploit the property that it may be beneficial to split a block of s equal points between distinct clusters.

Nevertheless, the greedy clustering of equal points leads to a 2-approximation. The proof of this fact requires some work. We evaluate the clustering cost obtained from a given optimal clustering by swapping some points to form clusters composed of equal points. Further, we upper bound the obtained value by the cost of the optimum clustering. For the last step, we introduce an auxiliary clustering problem formulated as a min-cost flow problem. This reduction allows us to evaluate the cost and obtain the required upper bound.

Organization of the paper

The remaining part of the paper is organized as follows. In Section 2, we introduce basic notation and show some properties of clusterings. In Section 3, we show our main result that Parameterized Equal Clustering admits a lossy kernel. In Section 4, we complement this result by proving that it is unlikely that Decision Equal Clustering admits an (exact) kernel of a polynomial size when parameterized by B. Still, the problem has a polynomial kernel when parameterized by B and k. In Section 5, we show that Equal Clustering is APX-hard. We conclude in Section 6 by stating some open problems.

2 Preliminaries

In this section, we give basic definitions and introduce the notation used throughout the paper. We also state some useful auxiliary results.

Parameterized complexity and kernelization

We refer to the recent books [8, 49] for a formal introduction to the area. Here we only define the notions used in our paper.

Formally, a parameterized problem \(\Pi \) is a subset of \(\Sigma ^*\times \mathbb {N}\), where \(\Sigma \) is a finite alphabet. Thus, an instance of \(\Pi \) is a pair (Ik), where \(I\subseteq \Sigma ^*\) and k is a nonnegative integer called a parameter. It is said that a parameterized problem \(\Pi \) is fixed-parameter tractable (\({{\,\mathrm{\textsf{FPT}}\,}}\)) if it can be solved in \(f(k)\cdot | I|^{\mathcal {O}(1)}\) time for some computable function \(f(\cdot )\).

A kernelization algorithm (or kernel) for a parameterized problem \(\Pi \) is an algorithm that, given an instance (Ik) of \(\Pi \), in a polynomial time produces an instance \((I',k')\) of \(\Pi \) such that

  1. (i)

    \((I,k)\in \Pi \) if and only if \((I',k')\in \Pi \), and

  2. (ii)

    \(| I'|+k'\le g(k)\) for a computable function \(g(\cdot )\).

The function \(g(\cdot )\) is called the size of a kernel; a kernel is polynomial if \(g(\cdot )\) is a polynomial. Every decidable \({{\,\mathrm{\textsf{FPT}}\,}}\) problem admits a kernel. However, it is unlikely that all \({{\,\mathrm{\textsf{FPT}}\,}}\) problems have polynomial kernels and the parameterized complexity theory provides tools for refuting the existence of polynomial kernels up to some reasonable complexity assumptions. The standard assumption here is that \({{\,\mathrm{\textsf{NP}}\,}}\not \subseteq {{\,\mathrm{\textsf{coNP}}\,}}/{{\,\mathrm{\textsf{poly}}\,}}\).

We also consider the parameterized analog of optimization problems. Since we only deal with minimization problems where the minimized value is nonnegative, we state the definitions only for optimization problems of this type. A parameterized minimization problem P is a computable function

$$\begin{aligned} P:\Sigma ^*\times \mathbb {N} \times \Sigma ^* \rightarrow \mathbb {R}_{\ge 0} \cup \{+\infty \}. \end{aligned}$$

The instances of a parameterized minimization problem P are pairs \((I,k) \in \Sigma ^* \times \mathbb {N}\), and a solution to (Ik) is simply a string \(s \in \Sigma ^*\), such that \(| s| \le | I|+k\). Then the function \(P(\cdot ,\cdot ,\cdot )\) defines the value P(Iks) of -4pt a solution s to an instance (Ik). The optimum value of an instance (Ik) is

$$\begin{aligned} \textsf{Opt}_{P}(I,k)=\min _{s \in \Sigma ^* \text { s.t. } | s| \le | I|+k}P(I,k,s). \end{aligned}$$

A solution s is optimal if \(\textsf{Opt}_P(I,k)=P(I,k,s)\). A parameterized minimization problem P is said to be \({{\,\mathrm{\textsf{FPT}}\,}}\) if there is an algorithm that for each instance (Ik) of P computes an optimal solution s in \(f(k)\cdot | I|^{\mathcal {O}(1)}\) time, where \(f(\cdot )\) is a computable function. Let \(\alpha \ge 1\) be a real number. An \({{\,\mathrm{\textsf{FPT}}\,}}\) \(\alpha \)-approximation algorithm for P is an algorithm that in \(f(k)\cdot | I|^{\mathcal {O}(1)}\) time computes a solution s for (Ik) such that \(P(I,k,s)\le \alpha \cdot \textsf{Opt}_P(I,k)\), where \(f(\cdot )\) is a computable function.

It is useful for us to make some comments about defining \(P(\cdot ,\cdot ,\cdot )\) for the case when the considered problem is parameterized by the solution value. For simplicity, we do it informally and refer to [8] for details and explanations. If s is not a “feasible” solution to an instance (Ik), then it is convenient to assume that \(P(I,k,s)=+\infty \). Otherwise, if s is “feasible” but its value is at least \(k+1\), we set \(P(I,k,s)=k+1\).

Lossy kernels

Finally, we define \(\alpha \)-approximate or lossy kernels for parameterized minimization problems. Informally, an \(\alpha \)-approximate kernel of size \(g(\cdot )\) is a polynomial-time algorithm, that given an instance (Ik), outputs an instance \((I',k')\) such that \(| I'|+k' \le g(k)\) and any c-approximate solution \(s'\) to \((I',k')\) can be turned in a polynomial time into a \((c \cdot \alpha )\)-approximate solution s to the original instance (Ik). More precisely, let P be a parameterized minimization problem and let \(\alpha \ge 1\). An \(\alpha \)-approximate (or lossy) kernel for P is a pair of polynomial algorithms \(\mathcal {A}\) and \(\mathcal {A}'\) such that

  1. (i)

    given an instance (Ik), \(\mathcal {A}\) (called a reduction algorithm) computes an instance \((I',k')\) with \(| I'|+k'\le g(k)\), where \(g(\cdot )\) is a computable function,

  2. (ii)

    the algorithm \(\mathcal {A}'\) (called a solution-lifting algorithm), given the initial instance (Ik), the instance \((I',k')\) produced by \(\mathcal {A}\), and a solution \(s'\) to \((I',k')\), computes an solution s to (Ik) such that

    $$\begin{aligned} \frac{P(I,k,s)}{\textsf{Opt}_P(I,k)}\le \alpha \cdot \frac{P(I',k',s')}{\textsf{Opt}_P(I',k')}. \end{aligned}$$

For simplicity, we assume here that \(\frac{P(I,k,s)}{\textsf{Opt}_P(I,k)}=1\) if \(\textsf{Opt}_P(I,k)=P(I,k,s)=0\) and \(\frac{P(I,k,s)}{\textsf{Opt}_P(I,k)}=+\infty \) if \(\textsf{Opt}_P(I,k)=0\) and \(P(I,k,s)>0\); the same assumption is used for \(\frac{P(I',k',s')}{\textsf{Opt}_P(I',k')}\). As with classical kernels, \(g(\cdot )\) is called the size of an approximate kernel, and an approximate kernel is polynomial if \(g(\cdot )\) is a polynomial.

Vectors and clusters

For a vector \(\textbf{x}\in \mathbb {R}^d\), we use \(\textbf{x}[i]\) to denote the i-th element of the vector for \(i\in \{1,\ldots ,d\}\). For a set of indices \(R \subseteq \{1,\ldots ,d\}\), \(\textbf{x}[R]\) denotes the vector of \(\mathbb {R}^{| R|}\) composed by the elements of vector \(\textbf{x}\) from set R, that is, if \(R=\{i_1,\ldots ,i_r\}\) with \(i_1<\ldots <i_r\) and \(\textbf{y}=\textbf{x}[R]\), then \(\textbf{y}[j]=\textbf{x}[i_j]\) for \(j\in \{1,\ldots ,r\}\). In our paper, we consider collections \(\textbf{X}\) of points of \(\mathbb {Z}^d\). We underline that some points of such a collection may be equal. However, to simplify notation, we assume throughout the paper that the equal points of \(\textbf{X}\) are distinct elements of \(\textbf{X}\) assuming that the points are supplied with unique identifiers. By this convention, we often refer to (sub)collections of points as (sub)sets and apply the standard set notation.

Let X be a collection of points of \(\mathbb {Z}^d\). For a vector \(\textbf{c}\in \mathbb {R}^d\), we define the cost of X with respect to \(\textbf{c}\) as

$$\begin{aligned} \textsf{cost}_p(X,\textbf{c})=\sum _{\textbf{x}\in X}\Vert \textbf{c}-\textbf{x}\Vert _p. \end{aligned}$$

Slightly abusing notation we often refer to \(\textbf{c}\) as a (given) median of X. We say that \(\textbf{c}^*\in \mathbb {R}^d\) is an optimum median of X if

$$\begin{aligned} \textsf{cost}_p(X)=\textsf{cost}_p(X,\textbf{c}^*)=\min _{\textbf{c}\in \mathbb {R}^d}\textsf{cost}_p(X,\textbf{c}). \end{aligned}$$

Notice that the considered collections of points have integer coordinates but the coordinates of medians are not constrained to integers and may be real.

Let \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) a collection of points of \(\mathbb {Z}^d\) and let k be a positive integer such that n is divisible by k. We say that a partition \(\{X_1,\ldots ,X_k\}\) of \(\textbf{X}\) is an equal k-clustering of \(\textbf{X}\) if \(| X_i|=\frac{n}{k}\) for all \(i \in \{1,\ldots ,k\}\). For an equal k-clustering \(\{X_1,\ldots ,X_k\}\) and given vectors \(\textbf{c}_1,\ldots ,\textbf{c}_k\), we define the cost of clustering with respect to \(\textbf{c}_1,\ldots ,\textbf{c}_k\) as

$$\begin{aligned} \textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)=\sum _{i=1}^k\textsf{cost}_p(X_i,\textbf{c}_i). \end{aligned}$$

The cost of an equal k-clustering \(\{X_1,\ldots ,X_k\}\) is

$$\begin{aligned} \textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k), \end{aligned}$$

where \(\textbf{c}_1,\ldots ,\textbf{c}_k\) are optimum medians of \(X_1,\ldots ,X_k\), respectively. For an integer \(B\ge 0\),

$$\begin{aligned} \textsf{cost}_p^B(X_1,\ldots ,X_k)= {\left\{ \begin{array}{ll} \textsf{cost}_p(X_1,\ldots ,X_k)&{}\text{ if } \textsf{cost}_p(X_1,\ldots ,X_k)\le B,\\ B+1&{}\text{ otherwise. } \end{array}\right. } \end{aligned}$$

We define

$$\begin{aligned} \textsf{Opt}(\textbf{X},k)=\min \{&\textsf{cost}_p(X_1,\ldots ,X_k)\mid \\ {}&\{X_1,\ldots ,X_k\}\text { is an equal }k\text {-clustering of }\textbf{X}\}, \end{aligned}$$

and given a nonnegative integer B,

$$\begin{aligned} \textsf{Opt}(\textbf{X},k,B)=\min \{&\textsf{cost}_p^B(X_1,\ldots ,X_k)\mid \\ {}&\{X_1,\ldots ,X_k\}\text { is an equal }k\text {-clustering of }\textbf{X}\}. \end{aligned}$$

We conclude this section by the observation that, given vectors \(\textbf{c}_1,\ldots ,\textbf{c}_k\in \mathbb {R}^d\), we can find an equal k-clustering \(\{X_1,\ldots ,X_k\}\) that minimizes \(\sum _{i=1}^kcost_p(X_i,\textbf{c}_i)\) using a reduction to the classical Minimum Weight Perfect Matching problem on bipartite graphs that is well-known to be solvable in a polynomial time. Recall that a matching M of a graph G is a set of edges without common vertices. It is said that a matching M saturates a vertex v if M has an edge incident to v. A matching M is perfect if every vertex of G is saturated. The task of Minimum Weight Perfect Matching is, given a bipartite graph G and a weight function \(w:E(G)\rightarrow \mathbb {Z}_{\ge 0}\), find a perfect matching M (if it exists) such that its weight \(w(M)=\sum _{e\in M}w(e)\) is minimum. The proof of the following lemma essentially repeats the proof of Lemma 1 of [50] but we provide it here for completeness.

Lemma 1

Let \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) be a collection of points of \(\mathbb {Z}^d\) and k be a positive integer such that n is divisible by k. Let also \(\textbf{c}_1,\ldots ,\textbf{c}_k\in \mathbb {R}^d\). Then an equal k-clustering \(\{X_1,\ldots ,X_k\}\) of minimum \(\textsf{cost}(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\) can be found in a polynomial time.

Proof

Given \(\textbf{X}\) and \(\textbf{c}_1,\ldots ,\textbf{c}_k\), we construct the bipartite graph G as follows. Let \(s=\frac{n}{k}\) .

  • For each \(i\in \{1,\ldots ,k\}\), we construct a set of s vertices \(V_i=\{v_1^i,\ldots ,v_s^i\}\) corresponding to the median \(c_i\). Denote \(V=\bigcup _{i=1}^kV_i\).

  • For each \(i\in \{1,\ldots ,n\}\), construct a vertex \(u_i\) corresponding to the vector \(\textbf{x}_i\) of \(\textbf{X}\) and make \(u_i\) adjacent to the vertices of V. Denote \(U=\{u_1,\ldots ,u_n\}\).

We define the edge weights as follows.

  • For every \(i\in \{1,\ldots ,n\}\) and \(j\in \{1,\ldots ,k\}\), set \(w(u_iv_h^j)=\Vert \textbf{c}_j-\textbf{x}_i\Vert _p\) for \(h\in \{1,\ldots ,s\}\), that is, the weight of all edges joining \(u_i\) corresponding to \(\textbf{x}_i\) with the vertices of \(V_j\) corresponding to the median \(\textbf{c}_j\) are the same and coincide with the \(\ell _p\) distance between \(\textbf{x}_i\) and \(\textbf{c}_j\).

Observe that G(UV) is a complete bipartite graph, where U and V form the bipartition. Note also that \(| U| =| V|=n\).

Notice that we have the following one-to-one correspondence between perfect matchings of G and k-clusterings of \(\textbf{X}\). In the forward direction, assume that M is a perfect matching of G. We construct the clustering \(\{X_1,\ldots ,X_k\}\) as follows. For every \(h \in \{1,\ldots ,n\}\), \(u_h\) is saturated by M and, therefore, there are \(i_h\in \{1,\ldots ,k\}\) and \(j_h \in \{1,\ldots s\}\) such that edge \(u_hv^{i_h}_{j_h}\in M\). We cluster the vectors of \(\textbf{X}\) according to M. Formally, we place \(x_h\) in \(X_{i_h}\) for each \(h\in \{1,\ldots ,n\}\). Clearly, \(\{X_1,\ldots ,X_k\}\) is a partition of \(\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) and \(| X_i|=s\) for all \(i\in \{1,\ldots ,k\}\). By the definition of weights of the edges of G, \(\textsf{cost}_p(X_1,\ldots ,X_k,c_1,\ldots ,c_k)=w(M)\). For the reverse direction, consider an equal k-clustering \(\{X_1,\ldots ,X_k\}\) of \(\textbf{X}\). Let \(i \in \{1,\ldots ,k\}\). Consider the cluster \(X_i\) and assume that \(X_i=\{j_1,\ldots ,j_s\}\). Denote by \(M_i=\{u_{j_1}v_1^i,\ldots ,u_{j_{s}}v_s^i\}\). Clearly, \(M_i\) is a matching saturating the vertices of \(V_i\). We construct \(M_i\) for every \(i\in \{1,\ldots ,k\}\) and set \(M=\bigcup _{i=1}^{k}M_i\). Since \(\{X_1,\ldots ,X_k\}\) is a partition of \(\{1,\ldots ,n\}\), M is a matching saturating every vertex of U. By the definition of the weight of edges, \(w(M)=\textsf{cost}_p(X_1,\ldots ,X_k, c_1,\ldots ,c_k)\). Thus, finding a k-clustering \(\{X_1,\ldots ,X_k\}\) that minimizes \(\textsf{cost}_p(X_1,\ldots ,X_k),c_1,\ldots ,c_k)\) is equivalent to computing a perfect matching of minimum weight in G. Then, because a perfect matching of minimum weight in G can be found in a polynomial time [51, 52], a k-clustering of minimum cost can be found in a polynomial time. This completes the proof of the lemma.\(\square \)

3 Lossy Kernel

In this section, we prove Theorem 1 by establishing a 2-approximate polynomial kernel for Parameterized Equal Clustering. In Subsection 3.1, we provide some auxiliary results, and in Subsection 3.2, we prove the main results. Throughout this section, we assume that \(p\ge 0\) defining the \(\ell _p\)-norm is a fixed constant.

3.1 Technical Lemmata

We start by proving the following results about the medians of clusters when their size is sufficiently big with respect to the budget.

Lemma 2

Let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of a collection of points \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of \(\mathbb {Z}^d\) of cost at most \(B\in \mathbb {Z}_{\ge 0}\), and let \(s=\frac{n}{k}\). Then each cluster \(X_i\) for \(i\in \{1,\ldots ,k\}\) contains at least \(s-2B\) equal points.

Proof

The claim is trivial if \(s \le 2B+1\). Let \(s \ge 2B+2\). Assume to the contrary that a cluster \(X_i\) has at most \(s-2B-1\) equal points for some \(i \in \{1,\ldots ,k\}\). Let \(\textbf{c}_1,\ldots ,\textbf{c}_k\) be optimum medians for the clusters \(X_1,\ldots ,X_k\), respectively. Then we have that \(\textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\).

Let \(\textbf{x}_{i_0} \in X_i\) be a point at the minimum distance from \(\textbf{c}_i\). Since there are at most \(s-2B-1\) points in \(X_i\) which are equal to \(x_{i_0}\), there are \(t=2B+1\) points \(\textbf{x}_{i_1},\ldots ,\textbf{x}_{i_t} \in X_i\) distinct from \(x_{i_0}\). Observe that

$$\begin{aligned} {} \sum _{\textbf{x}_h \in X_i}\Vert \textbf{c}_i-\textbf{x}_h\Vert _p \ge \sum _{j =0}^t\Vert \textbf{c}_i-\textbf{x}_{i_j}\Vert _p\ge \sum _{j=1}^{t}\Vert \textbf{c}_i-\textbf{x}_{i_j}\Vert _p. \end{aligned}$$
(1)

Because the points have integer coordinates and by the triangle inequality,

$$\begin{aligned} {} 1 \le \Vert \textbf{x}_{i_0}-\textbf{x}_{i_j}\Vert _p \le \Vert \textbf{x}_{i_0}-\textbf{c}_i\Vert _p+\Vert \textbf{x}_{i_j}-\textbf{c}_{i}\Vert _p \end{aligned}$$
(2)

for every \(j\in \{1,\ldots ,t\}\). Since \(\textbf{x}_{i_0}\) is a point of \(X_i\) at minimum distance from \(\textbf{c}_i\),

$$\begin{aligned} {} \Vert \textbf{x}_{i_0}-\textbf{c}_i\Vert _p+\Vert \textbf{x}_{i_j}-\textbf{c}_{i}\Vert _p \le 2\Vert \textbf{x}_{i_j}-\textbf{c}_{i}\Vert . \end{aligned}$$
(3)

From (2) and (3), we get \(\Vert \textbf{x}_{i_j}-\textbf{c}_{i}\Vert _p \ge \frac{1}{2}\) for \(j\in \{1,\ldots ,t\}\). Thus, from (1), we get

$$\begin{aligned}{} \sum _{\textbf{x}_h \in X_i}\Vert \textbf{c}_i-\textbf{x}_h\Vert _p \ge \sum _{j =1}^t\Vert \textbf{c}_i-\textbf{x}_{i_j}\Vert _p \ge \frac{1}{2}t =\frac{1}{2}(2B+1)> B, \end{aligned}$$

which is a contradiction with \(\textsf{cost}_p(X_1,\ldots ,X_k)\le B\).\(\square \) This completes the proof.

Lemma 3

Let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of a collection of points \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of \(\mathbb {Z}^d\) of cost at most \(B\in \mathbb {Z}_{\ge 0}\), and let \(s=\frac{n}{k}\ge 4B+1\). Let also \(\textbf{c}_1,\ldots ,\textbf{c}_k \in \mathbb {R}^d\) be optimum medians for \(X_1,\ldots ,X_k\), respectively. Then for every \(i \in \{1,\ldots ,k\}\), \(\textbf{c}_i=\textbf{x}_{j}\) for \(\textbf{x}_j \in \textbf{X}_i\) such that \(\textbf{X}_i\) contains at least \(s-2B\) points that are equal to \(\textbf{x}_j\) and the choice of \(\textbf{c}_i\) is unique.

Proof

Consider a cluster \(X_i\) with the median \(\textbf{c}_i\) for arbitrary \(i \in \{1,\ldots ,k\}\). Since \(s \ge 4B+1\), then by Lemma 2, there is \(\textbf{x}_j\in X_i\) such that \(X_i\) contains at least \(s-2B\) points that are equal to \(\textbf{x}_j\). We show that \(\textbf{c}_i=\textbf{x}_j\). Notice that the choice of the set of at least \(s-2B\) equal points is unique because \(X_i\) can contain at most \(s-(s-2B)=2B\) distinct from \(\textbf{x}_j\) points, and since \(s\ge 4B+1\), \(s-2B\ge 2B+1>2B\).

The proof is by contradiction. Assume that \(\textbf{c}_i \ne \textbf{x}_j\). Let \(S\subseteq \{1,\ldots ,n\}\) be the set of indices of the points \(\textbf{x}_h\in X_i\) that coincide with \(\textbf{x}_j\), and denote by T the set of indices of the remaining points in \(X_i\). We know that \(| T| \le 2B <| S| \) because \(s \ge 4B+1\) and \(| S| \ge 2B+1\). Then

$$\begin{aligned} \begin{aligned} \textsf{cost}_p(X_i)=&\textsf{cost}_p(X_i,\textbf{c}_i)=\sum _{h\in X_i}\Vert \textbf{c}_i-\textbf{x}_h\Vert _p\\=&\sum _{h\in S}\Vert \textbf{c}_i-\textbf{x}_h\Vert _p+\sum _{h \in T}\Vert \textbf{c}_i-\textbf{x}_{h}\Vert _p\\ =&(| S|-| T|)\Vert \textbf{c}_i-\textbf{x}_j\Vert _p +\sum _{h \in T}(\Vert \textbf{c}_i-\textbf{x}_j\Vert +\Vert \textbf{c}_i-\textbf{x}_{h}\Vert _p). \end{aligned} \end{aligned}$$
(4)

On using the triangle inequality, we get

$$\begin{aligned} {} \begin{aligned} (| S| -| T|)\Vert \textbf{c}_i-\textbf{x}_j\Vert _p +&\sum _{h\in T}(\Vert \textbf{c}_i-\textbf{x}_j\Vert _p+\Vert \textbf{c}_i-\textbf{x}_{h}\Vert _p)\\ \ge&(| S|-| T|)\Vert \textbf{c}_i-\textbf{x}_j\Vert _p+\sum _{h\in T}\Vert \textbf{x}_j-\textbf{x}_{h}\Vert _p. \end{aligned} \end{aligned}$$
(5)

We know that \((| S| -| T|)\Vert \textbf{c}_i-\textbf{x}_j\Vert _p>0\) because \(| S| >| T|\) and \(\textbf{c}_i \ne \textbf{x}_j\). Then by (5), we have

$$\begin{aligned} (| S|-| T|)\Vert \textbf{c}_i-\textbf{x}_j\Vert _p+\sum _{h \in T}\Vert \textbf{x}_j-\textbf{x}_{h}\Vert _p >\sum _{h\in T}\Vert \textbf{x}_j-\textbf{x}_{h}\Vert _p. \end{aligned}$$
(6)

Combining (4)–(6), we conclude that \(\textsf{cost}_p(X_i)>\sum _{h\in T}\Vert \textbf{x}_j-\textbf{x}_{h}\Vert _p\). Let \(\textbf{c}'_i=\textbf{x}_j\). Then

$$\begin{aligned} \textsf{cost}_p(X_i,\textbf{c}_i')=&\sum _{h\in X_i}\Vert \textbf{c}'_i-\textbf{x}_h\Vert _p=\sum _{h\in S}\Vert \textbf{c}'_i-\textbf{x}_h\Vert _p+\sum _{h \in T}\Vert \textbf{c}'_i-\textbf{x}_{h}\Vert _p\\ =&\sum _{h \in T}\Vert \textbf{c}'_i-\textbf{x}_{h}\Vert _p<\textsf{cost}_p(X_i) \end{aligned}$$

which contradicts that \(\textbf{c}_i\) is an optimum median for \(X_i\). This concludes the proof.\(\square \)

We use the following lemma to identify medians.

Lemma 4

Let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of a collection of points \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of \(\mathbb {Z}^d\) of cost at most \(B\in \mathbb {Z}_{\ge 0}\), and let \(s=\frac{n}{k}\ge 4B+1\). Suppose that \(Y\subseteq \textbf{X}\) is a collection of at least \(B+1\) equal points of \(\textbf{X}\). Then there is \(i\in \{1,\ldots ,k\}\) such that an optimum median of \(X_i\) coincides with \(\textbf{x}_j\) for \(\textbf{x}_j\in Y\).

Proof

Let \(\textbf{c}_1,\ldots .\textbf{c}_k\) be optimum medians of \(X_1,\ldots ,X_k\), respectively. Since \(s\ge 4B+1\), then by Lemma 3, for every \(i \in \{1,\ldots ,k\}\), \(\textbf{c}_i\) coincides with some element \(\textbf{x}_{h}\) of the cluster \(X_i\). For the sake of contradiction, assume that \(\textbf{c}_1,\ldots ,\textbf{c}_k\) are distinct from \(\textbf{x}_j \in Y\). This means that \(\Vert \textbf{x}_j-\textbf{c}_i\Vert _p\ge 1\) because the coordinates of the points of \(\textbf{X}\) are integer. Then

$$\begin{aligned} \textsf{cost}_p(X_1,\ldots ,X_k)=&\sum _{i=1}^k\textsf{cost}_p(X_i,\textbf{c}_i) \ge \sum _{i=1}^k\sum _{\textbf{x}_h \in Y \cap X_i}\Vert \textbf{c}_i-\textbf{x}_h\Vert _p\ge \sum _{i=1}^k | X_i\cap Y |\\=&| Y|\ge B+1 >B, \end{aligned}$$

contradicting that \(\textsf{cost}_p(X_1,\ldots ,X_k)\le B\). This proves the lemma. \(\square \)

We use our next lemma to upper bound the clustering cost if we collect \(s=\frac{n}{k}\) equal points in the same cluster.

Lemma 5

Let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of a collection of points \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of \(\mathbb {Z}^d\), and let \(\textbf{c}_1,\ldots ,\textbf{c}_k\in \mathbb {R}^d\). Suppose that S is a collection of \(s=\frac{n}{k}\) equal points of \(\textbf{X}\) and \(\textbf{x}_j\in S\). Then there is an equal k-clustering \(\{X_1',\ldots ,X_k'\}\) of \(\textbf{X}\) with \(X_1'=S\) such that

$$\textsf{cost}_p(X_1',\ldots ,X_k',\textbf{c}_1',\ldots ,\textbf{c}_k')\le \textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)+s\Vert \textbf{c}_1-\textbf{x}_j\Vert _p,$$

where \(\textbf{c}_1'=\textbf{x}_j\) and \(\textbf{c}_h'=\textbf{c}_h\) for \(h\in \{2,\ldots ,k\}\).

Proof

The claim is trivial if \(S=X_1\) because we can set \(X_i'=X_i\) for \(i\in \{1,\ldots ,k\}\). Assume that this is not the case and there are elements of S that are not in \(X_1\); denote by \(\textbf{x}_{i_1},\ldots ,\textbf{x}_{i_t}\) these elements. We assume that \(\textbf{x}_{i_h} \in X_{i_h'}\), for \(h \in \{1,\ldots ,t\}\) for \(i_h' \ge 2\). Because \(| S|=s\), there are \(\textbf{x}_{j_1},\ldots ,\textbf{x}_{j_t} \in X_1\) such that \(\textbf{x}_{j_1},\ldots ,\textbf{x}_{j_t} \notin S\). We construct \(X_1',\ldots ,X_k'\) from \(X_1,\ldots ,X_k\) by exchanging the points \(\textbf{x}_{j_h}\) and \(\textbf{x}_{i_h}\) between \(X_1\) and \(X_{i_h'}\) for every \(h \in \{1,\ldots ,t\}\). Notice that \(| X_1'|=\cdots =| X_k'|\) because the exchanges do not modify the sizes of the clusters. Thus, \(\{X_1',\ldots ,X_k'\}\) is an equal k-clustering. We claim that \(\{X_1',\ldots ,X_k'\}\) satisfies the required property.

We have that

$$\begin{aligned} \begin{aligned}&\textsf{cost}(X'_1,\ldots ,X'_k,\textbf{c}'_1,\ldots ,\textbf{c}'_k) - \textsf{cost}(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\\&=\sum _{i=1}^{k}\sum _{\textbf{x}_h \in X'_i}^{}\Vert \textbf{x}_h-\textbf{c}'_i\Vert _p-\sum _{i=1}^{k}\sum _{\textbf{x}_h \in X_i}^{}\Vert \textbf{x}-\textbf{c}_i\Vert _p\\&= \sum _{\textbf{x}_h \in X'_1}\Vert \textbf{x}_h-\textbf{c}'_1\Vert _p-\sum _{\textbf{x}_h \in X_1}\Vert \textbf{x}_h-\textbf{c}_1\Vert _p\\ {}&~~~+\sum _{i=2}^{k}\big (\sum _{\textbf{x}_h \in X'_i}\Vert \textbf{x}_h-\textbf{c}'_i\Vert _p -\sum _{\textbf{x}_h \in X_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p\big ). \end{aligned} \end{aligned}$$
(7)

Note that \(\sum _{\textbf{x}_h \in X'_1}\Vert \textbf{x}_h-\textbf{c}'_1\Vert _p=0\) and \(\sum _{\textbf{x}_h \in X_1}\Vert \textbf{x}_h-\textbf{c}_1\Vert _p \ge \sum _{h=1}^{t}\Vert \textbf{x}_{j_h}-\textbf{c}_1\Vert _p\). Also by the construction of \(X'_1,\ldots ,X'_k\) and because \(\textbf{c}_{i}=\textbf{c}'_{i}\) for \(i \in \{2,\ldots ,k\}\), we have that

$$\begin{aligned} \sum _{i=2}^{k}\big (\sum _{\textbf{x}_h \in X'_i}\Vert \textbf{x}_h-\textbf{c}'_i\Vert _p -\sum _{\textbf{x}_h \in X_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p\big )=&\sum _{h=1}^{t}\Vert \textbf{x}_{j_h}-\textbf{c}'_{i_h}\Vert _p -\sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_{i_h}\Vert _p\\ =&\sum _{h=1}^{t}\Vert \textbf{x}_{j_h}-\textbf{c}_{i_h}\Vert _p -\sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_{i_h}\Vert _p. \end{aligned}$$

Then extending (7) and applying the triangle inequality twice, we obtain that

$$\begin{aligned}&\textsf{cost}(X'_1,\ldots ,X'_k,\textbf{c}'_1,\ldots ,\textbf{c}'_k) - \textsf{cost}(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\\&\le - \sum _{h=1}^{t}\Vert \textbf{x}_{j_h}-\textbf{c}_1\Vert _p+\sum _{h=1}^{t}\Vert \textbf{x}_{j_h}-\textbf{c}_{i_h}\Vert _p -\sum _{h=1}^{t}\Vert \textbf{x}_{i_h}-\textbf{c}_{i_h}\Vert _p \\&= \sum _{h=1}^{t}\big (-\Vert \textbf{x}_{j_h}-\textbf{c}_1\Vert _p+\Vert \textbf{x}_{j_h}-\textbf{c}_{i_h}\Vert _p-\Vert \textbf{x}_{i_h}-\textbf{c}_{i_h}\Vert _p\big )\\&\le \sum _{h=1}^{t}\big (\Vert \textbf{x}_{i_h}-\textbf{c}_{i_h}\Vert _p-\Vert \textbf{c}_1-\textbf{c}_{i_h}\Vert _p\big ) \\&\le \sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_1\Vert _p \le t\Vert \textbf{x}_j-\textbf{c}_1\Vert _p \le s\Vert \textbf{x}_j-\textbf{c}_1\Vert _p \end{aligned}$$

as required by the lemma.\(\square \)

Our next lemma shows that we can solve Parameterized Equal Clustering in a polynomial time if the cluster size is sufficiently big with respect to the budget.

Lemma 6

There is a polynomial-time algorithm that, given a collection \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of n points of \(\mathbb {Z}^d\), a positive integer k such that n is divisible by k, and a nonnegative integer B such that \(\frac{n}{k}\ge 4B+1\), either computes \(\textsf{Opt}(X,k)\le B\) and produces an equal k-clustering of minimum cost or correctly concludes that \(\textsf{Opt}(\textbf{X},k)>B\).

Proof

Let \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) be a collection of n points of \(\mathbb {Z}^d\) and let k be a positive integer such that n is divisible by k, and suppose that \(s=\frac{n}{k}\ge 4B+1\) for a nonnegative integer B.

First, we exhaustively apply the following reduction rule.

Reduction Rule 1

If \(\textbf{X}\) contains a collection of s equal points S, then set \(\textbf{X}:=\textbf{X}\setminus S\) and \(k:=k-1\).

To argue that the rule is safe, let \(\textbf{X}'=\textbf{X}\setminus S\), where S is a collection of s equal points of \(\textbf{X}\), and let \(k'=k\). Clearly, \(\textbf{X}'\) contains \(n'=n-s\) points and \(\frac{n'}{k'}=s\). If \(\{X_1',\ldots ,X_{k'}'\}\) is an equal \(k'\)-clustering of \(\textbf{X}'\), then \(\{S,X_1',\ldots ,X_{k'}'\}\) is an equal k-clustering of \(\textbf{X}\). Note that \(\textsf{cost}_p(S)=0\) because the elements of S are the same. Then \(\textsf{cost}_p(S,X_1',\ldots ,X_{k'}')=\textsf{cost}_p(X_1',\ldots ,X_{k'}')\). Therefore, \(\textsf{Opt}(\textbf{X},k)\le \textsf{Opt}(\textbf{X}',k')\). We show that if \(\textsf{Opt}(\textbf{X},k)\le B\), then \(\textsf{Opt}(\textbf{X},k)\ge \textsf{Opt}(\textbf{X}',k')\).

Suppose that \(\{X_1,\ldots ,X_k\}\) is an equal k-clustering of \(\textbf{X}\) with \(\textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{Opt}(X,k)\le B\). Denote by \(\textbf{c}_1,\ldots ,\textbf{c}_k\) optimum medians of \(X_1,\ldots ,X_k\), respectively. Because \(| S| =s\ge 4B+1\ge B+1\), there is a cluster whose optimum median is \(\textbf{x}_j\) for \(\textbf{x}_j\in S\). We assume without loss of generality that \(X_1\) is such a cluster and \(\textbf{c}_1=\textbf{x}_j\). By Lemma 5, there is a k-clustering \(\{S,X_2',\ldots ,X_{k}'\}\) of \(\textbf{X}\) such that \(\textsf{cost}_p(S,X_2',\ldots ,X_k',\textbf{c}_1',\ldots ,\textbf{c}_k')\le \textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)+s\Vert \textbf{c}_1-\textbf{x}_j\Vert _p\), where \(\textbf{c}_1'=\textbf{x}_j\) and \(\textbf{c}_h'=\textbf{c}_h\) for \(h\in \{2,\ldots ,k\}\). Because \(\textbf{c}_1=\textbf{x}_j\), we conclude that \(\textsf{cost}_p(X_2',\ldots ,X_k')=\textsf{cost}_p(S,X_2',\ldots ,X_k',\textbf{c}_1',\ldots ,\textbf{c}_k')\le \textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)=\textsf{Opt}(X,k)\). Since \(\{X_2',\ldots ,X_k'\}\) is a \(k'\)-clustering of \(\textbf{X}'\), we have that \(\textsf{Opt}(\textbf{X}',k')\le \textsf{cost}_p(X_2',\ldots ,X_k')\le \textsf{Opt}(X,k)\) as required.

We obtain that either \(\textsf{Opt}(\textbf{X},k)=\textsf{Opt}(\textbf{X}',k')\le B\) or \(\textsf{Opt}(\textbf{X},k)>B\) and \(\textsf{Opt}(\textbf{X}',k')> B\). Notice also that, given an optimum equal \(k'\)-clustering of \(\textbf{X}'\), we can construct the optimum k-clustering of X, by making S a cluster. Thus, it is sufficient to prove the lemma for the collection of points obtained by the exhaustive application of Reduction Rule 1. Note that if this collection is empty, then \(\textsf{Opt}(\textbf{X},k)=0\) and the lemma holds. This allows us to assume from now that \(\textbf{X}\) is nonempty and has no s equal points.

Suppose that \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering with \(\textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{Opt}(\textbf{X},k)\le B\). By Lemma 3, we have that for every \(i \in \{1,\ldots ,k\}\), the optimum median \(\textbf{c}_i\) for \(X_i\) is unique and \(\textbf{c}_i=\textbf{x}_{j}\) for \(\textbf{x}_j \in \textbf{X}_i\) such that \(\textbf{X}_i\) contains at least \(s-2B\) points that are equla to \(\textbf{x}_j\). Notice that \(\textbf{c}_1,\ldots ,\textbf{c}_k\) are pairwise distinct because a collection of equal points cannot be split between distinct clusters in such a way that each of these clusters would contain at least \(s-2B\) points. This holds because any collection of equal points of \(\textbf{X}\) contains at most \(s-1\) elements and \(2(s-2B)>s\) as \(s\ge 4B+1\). By Lemma 4, we have that if \(\textbf{X}\) contains a collection of equal points S of size \(B+1\le s-2B\), then one of the optimum medians should be equal to a point from S.

These observations allow us to construct (potential) medians \(\textbf{c}_1,\ldots ,\textbf{c}_t\) as follows: we iteratively compute inclusion maximal collections S of equal points of \(\textbf{X}\) and if \(| S| \ge B+1\), we set the next median \(\textbf{c}_i\) be equal to a point of S. If the number of constructed potential medians \(t\ne k\), we conclude that \(\textbf{X}\) has no equal k-clustering of cost at most B. Otherwise, if \(t=k\), we have that \(\textbf{c}_1,\ldots ,\textbf{c}_k\) should be optimum medians for an equal k-clustering of minimum cost if \(\textsf{Opt}(\textbf{X},k)\le B\).

Then we compute in a polynomial time an equal k-clustering \(\{X_1,\ldots ,X_k\}\) of \(\textbf{X}\) that minimizes \(\textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\) using Lemma 1. If \(\textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1\), \(\ldots ,\textbf{c}_k)>B\), then we conclude that \(\textsf{Opt}(\textbf{X},k)>B\). Otherwise, we have that \(\textsf{Opt}(\textbf{X},k)=\textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\) and \(\{X_1,\ldots ,X_k\}\) is an equal k-clustering of minimum cost.\(\square \)

Our next aim is to show that we can reduce the dimension and the absolute values of the coordinates of the points if \(\textsf{Opt}(X,k)\le B\). To achieve this, we mimic some ideas of the kernelization algorithm of Fomin et al. in [53] for the related clustering problem. However, they considered only points from \(\{0,1\}^d\) and the Hamming norm.

Lemma 7

There is a polynomial-time algorithm that, given a collection \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) of n points of \(\mathbb {Z}^d\), a positive integer k such that n is divisible by k, and a nonnegative integer B, either correctly concludes that \(\textsf{Opt}(\textbf{X},k)>B\) or computes a collection of n points \(\textbf{Y}=\{\textbf{y}_1,\ldots ,\textbf{y}_n\}\) of \(\mathbb {Z}^{d'}\) such that the following holds:

  1. i

    For every partition \(\{I_1,\ldots ,I_k\}\) of \(\{1,\ldots ,n\}\) such that \(| I_1| =\cdots =| I_k| =\frac{n}{k}\), either \(\textsf{cost}_p(X_1,\ldots ,X_k)>B\) and \(\textsf{cost}_p(Y_1,\ldots ,Y_k)>B\) or \(\textsf{cost}_p(X_1,\ldots ,X_k)\) \(=\textsf{cost}_p(Y_1,\ldots ,Y_k)\), where \(X_i=\{\textbf{x}_h\mid h\in I_i\}\) and \(Y_i=\{\textbf{y}_h\mid h\in I_i\}\) for every \(i\in \{1,\ldots ,k\}\).

  2. ii

    \(d' = \mathcal {O}(kB^{p+1})\).

  3. iii

    \(|\textbf{y}_i[h]| = \mathcal {O}(kB^2)\) for \(h \in \{1,\ldots ,d'\}\) and \(i \in \{1,\ldots ,n\}\).

Proof

Let \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) be a collection of n points of \(\mathbb {Z}^d\) and let k be a positive integer such that n is divisible by k. Let also B be a nonnegative integer.

We iteratively construct the partition \(S=\{S_1,\ldots ,S_t\}\) of \(\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) using the following greedy algorithm. Let \(j \ge 1\) be an integer and suppose that the sets \(S_0,\ldots ,S_{j-1}\) are already constructed assuming that \(S_0=\emptyset \). Let \(\textbf{Z}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\} \setminus \cup _{i=0}^{j-1}S_i\). If \(\textbf{Z}= \emptyset \), then the construction of S is completed. If \(\textbf{Z}\ne \emptyset \), we construct \(S_j\) as follows:

  • set \(S_j:=\{\textbf{x}_h\}\) for arbitrary \(\textbf{x}_h \in \textbf{Z}\) and set \(\textbf{Z}:=\textbf{Z}\setminus \{\textbf{x}_h\}\),

  • while there is \(\textbf{x}_r \in \textbf{Z}\) such that \(\Vert \textbf{x}_r-\textbf{x}_{r'}\Vert _p\le B\) for some \(\textbf{x}_{r'} \in S_j\), set \(S_j:=S_j\cup \{\textbf{x}_r\}\) and set \(\textbf{Z}=\textbf{Z}\setminus \{\textbf{x}_r\}\).

The crucial property of the partition S is that every cluster of an equal k-clustering of cost at most B is entirely in some part of the partition.

Claim 3.1

Let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of \(\textbf{X}\) of cost at most B. Then for every \(i \in \{1,\ldots ,k\}\) there is \(j \in \{1,\ldots ,t\}\) such that \(X_i \subseteq S_j\).

Proof of Claim 3.1

Denote \(\textbf{c}_1,\ldots ,\textbf{c}_k \in \mathbb {R}^d\) the optimum medians for the clusters \(X_1,\ldots ,X_k\), respectively. Assume to the contrary that there is a cluster \(X_i\) such that \(\textbf{x}_u,\textbf{x}_v \in X_i\) with \(\textbf{x}_u\) and \(\textbf{x}_v\) in distinct collections of the partition \(\{S_1,\ldots ,S_t\}\). Then \(\Vert \textbf{x}_u-\textbf{x}_v\Vert _p > B\) by the construction of \(S_1,\ldots ,S_t\) and

$$\begin{aligned} \textsf{cost}_p(X_1,\ldots ,X_k)\ge&\textsf{cost}_p(X_i)=\textsf{cost}_p(X_i,\textbf{c}_i)\ge \Vert \textbf{c}_i-\textbf{x}_u\Vert _p+\Vert \textbf{c}_i-\textbf{x}_v\Vert _p\\ \ge&\Vert \textbf{x}_u-\textbf{x}_v\Vert _p > B \end{aligned}$$

contradicting that \(\textsf{cost}_p(X_1,\ldots ,X_k)\le B\).\(\square \)

From the above Claim 3.1, we have that if \(t>k\), then \(\textbf{X}\) has no equal k-clustering of cost at most B, that is, \(\textsf{Opt}(X,B)>B\). In this case, we return this answer and stop. From now on, we assume that this is not the case and \(t \le k\).

By Lemma 2, at least \(\frac{n}{k}-2B\) points in every cluster of an equal k-clustering of cost at most B are the same. Thus, if \(\{X_1,\ldots ,X_k\}\) is an equal k-clustering of cost at most B, then for each \(i\in \{1,\ldots ,k\}\), \(X_i\) contains at most \(2B+1\) distinct points. By Claim 3.1, we obtain that for every \(i\in \{1,\ldots ,t\}\), \(S_i\) should contain at most \(k(2B+1)\) distinct points if \(\textbf{X}\) admits an equal k-clustering of cost at most B. Then for each \(i\in \{1,\ldots ,t\}\), we compute the number of distinct points in \(S_i\) and if this number is bigger than \(k(2B+1)\), we conclude that \(\textsf{Opt}(\textbf{X},k)>B\). In this case, we return this answer and stop. From now, we assume that this is not the case ‘ \(S_i\) for \(i\in \{1,\ldots ,t\}\) contains at most \(k(2B+1)\) distinct points.

For a collection of points \(Z\subseteq \textbf{X}\), we say that a coordinate \(h\in \{1,\ldots ,d\}\) is uniform for Z if \(\textbf{x}_j[h]\) is the same for all \(\textbf{x}_h\in Z\) and h is nonuniform otherwise.

Let \(\ell _i\) be the number of nonuniform coordinates for \(S_i\) for \(i \in \{1,\ldots ,t\}\), and let \(\ell =\max _{1 \le i \le t}\ell _i\). For each \(i\in \{1,\ldots ,t\}\), we select a set of indices \(R_i\subseteq \{1,\ldots ,d\}\) of size \(\ell \) such that \(R_i\) contains all nonuniform coordinates for \(S_i\). Note that \(R_i\) may be empty if \(\ell =0\). We also define a set of coordinates \(T_i=\{1,\ldots ,d\}\setminus R_i\), for \(i \in \{1,\ldots ,t\}\).

For every \(i \in \{1,\ldots ,n\}\) and \(j \in \{1,\ldots ,t\}\) such that \( x_i \in S_j\), we define an \((\ell +1)\)-dimensional point \(\textbf{x}'_i\), where \(\textbf{x}'_i[1,\ldots ,\ell ]=\textbf{x}_i[R_j]\) and \(\textbf{x}'_i[\ell +1]=(j-1)(B+1)\). This way we obtain a collection of points \(\textbf{X}'=\{\textbf{x}'_1,\ldots ,\textbf{x}'_n\}\). For every \(j \in \{1,\ldots ,t\}\), we define \(S'_j=\{\textbf{x}'_h \mid \textbf{x}_h \in S_j\}\), that is, we construct the partition \(S'=\{S'_1,\ldots ,S'_t\}\) of \(\{\textbf{x}'_1,\ldots ,\textbf{x}'_n\}\) corresponding to S.

For each \(i\in \{1,\ldots ,t\}\), we do the following:

  • For each \(h \in \{1,\ldots ,\ell \}\), we find \(M_h^{(i)}=\min \{\textbf{x}'_j[h] \mid \textbf{x}'_j \in S'_i\}\).

  • For every \(\textbf{x}_j' \in S_i'\), we define a new point \(\textbf{y}_j\) by setting \(\textbf{y}_j[h]=\textbf{x}'_j[h]-M_h^{(i)}\) for \(h \in \{1,\ldots ,\ell \}\) and \(\textbf{y}_j[\ell +1]=\textbf{x}'_j[\ell +1]=(j-1)(B+1)\).

This way, we construct the collection \(\textbf{Y}=\{\textbf{y}_1,\ldots ,\textbf{y}_n\}\) of points from \(\mathbb {Z}^{\ell +1}\). Our algorithm returns this collection of points.

It is easy to see that the described algorithm runs in a polynomial time. We show that if the algorithm outputs \(\textbf{Y}\), then this collection of the points satisfies conditions (i)–(iii) of the lemma.

To show (i), let \(\{I_1,\ldots ,I_k\}\) be a partition of \(\{1,\ldots ,n\}\) such that \(| I_1| =\cdots =| I_k| =\frac{n}{k}\), and let \(X_i=\{\textbf{x}_h\mid h\in I_i\}\) and \(Y_i=\{\textbf{y}_h\mid h\in I_i\}\) for every \(i\in \{1,\ldots ,k\}\). We show that either \(\textsf{cost}_p(X_1,\ldots ,X_k)>B\) and \(\textsf{cost}_p(Y_1,\ldots ,Y_k)>B\) or \(\textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{cost}_p(Y_1,\ldots ,Y_k)\).

Suppose that \(\textsf{cost}_p(X_1,\ldots ,X_k)\le B\). Consider \(i\in \{1,\ldots , k\}\) and denote by \(\textbf{c}_i\) the optimum median for \(X_i\). By Claim 3.1, there is \(j\in \{1,\ldots ,t\}\) such that \(X_i\subseteq S_j\). We define \(\textbf{c}_i'\in \mathbb {R}^{\ell +1}\) by setting \(\textbf{c}_i'[1,\ldots ,\ell ]=\textbf{c}_i[R_j]\) and \(\textbf{c}_i'[\ell +1]=(j-1)(B+1)\). Further, we consider \(\textbf{c}_i''\in \mathbb {R}^{\ell +1}\) such that \(\textbf{c}_i''[h]=\textbf{c}_i'[h]-M_h^{(j)}\) for \(h\in \{1,\ldots ,\ell \}\) and \(\textbf{c}_i''[\ell +1]=(j-1)(B+1)\). Then by the definitions of \(\textbf{X}_i'\) and \(\textbf{Y}_i\), we have that

$$\begin{aligned} \textsf{cost}_p(X_i)=\textsf{cost}_p(X_i,\textbf{c}_i)=\textsf{cost}_p(X_i',\textbf{c}_i')=\textsf{cost}_p(Y_i,\textbf{c}_i'')\ge \textsf{cost}_p(Y_i). \end{aligned}$$

This implies that \(\textsf{cost}_p(X_1,\ldots ,X_k)\ge \textsf{cost}_p(Y_1,\ldots ,Y_k)\).

For the opposite direction, assume that \(\textsf{cost}_p(X_1,\ldots ,X_k)\le B\). Similarly to \(S'\), for every \(j \in \{1,\ldots ,t\}\), we define \(S''_j=\{\textbf{y}_h \mid \textbf{x}_h \in S_j\}\), that is, we construct the partition \(S''=\{S''_1,\ldots ,S''_t\}\) of \(\textbf{Y}\) corresponding to S. We claim that for each \(i\in \{1,\ldots ,k\}\), there is \(j\in \{1,\ldots ,t\}\) such that \(Y_i\subseteq S_j\).

The proof is by contradiction and is similar to the proof of Claim 3.1. Assume that there is \(i\in \{1,\ldots ,k\}\) such that there are \(\textbf{y}_u,\textbf{y}_v \in Y_i\) belonging to distinct sets of \(S''\). Then \(\Vert \textbf{y}_u-\textbf{y}_v\Vert _p\ge |\textbf{y}_u[\ell +1]-\textbf{y}_v[\ell +1]| > B\) by the construction of \(S_1'',\ldots ,S_t''\). Then

$$\begin{aligned} \textsf{cost}_p(Y_1,\ldots ,Y_k)\ge&\textsf{cost}_p(Y_i)=\textsf{cost}_p(Y_i,\textbf{c}_i)\ge \Vert \textbf{c}_i-\textbf{x}_u\Vert _p+\Vert \textbf{c}_i-\textbf{x}_v\Vert _p\\ \ge&\Vert \textbf{x}_u-\textbf{x}_v\Vert _p > B, \end{aligned}$$

where \(\textbf{c}_i\) is an optimum median of \(Y_i\). However, this contradicts that \(\textsf{cost}_p(Y_1,\ldots ,Y_k)\) \(\le B\).

Consider \(i\in \{1,\ldots , k\}\) and let \(\textbf{c}_i''\in \mathbb {R}^{\ell +1}\) an optimum median for \(Y_i\). Let also \(j\in \{1,\ldots ,t\}\) be such that \(Y_i\subseteq S_j\). Notice that \(\textbf{c}_i''[\ell +1]=(j-1)(B+1)\) by the definition of \(S_j\). We define \(\textbf{c}_i'\in \mathbb {R}^{\ell +1}\) by setting \(\textbf{c}_i'[h]=\textbf{c}_i''[h]+M_h^{(j)}\) for \(h\in \{1,\ldots ,\ell \}\) and \(\textbf{c}_i'[\ell +1]=\textbf{c}_i''[\ell +1]=(j-1)(B+1)\). Then we define \(\textbf{c}_i\in \mathbb {R}^d\), we setting \(\textbf{c}_i[R_j]=\textbf{c}_i'[1,\ldots ,\ell ]\) and \(\textbf{c}_i[T_j]=\textbf{x}_h[T_j]\) for arbitrary \(\textbf{x}_h\in S_j\). Because the coordinates in \(T_j\) are uniform for \(S_j\), the values in each coordinate \(h\in T_j\) of the coordinates of the points of \(X_i\) are the same. This implies that

$$\begin{aligned} \textsf{cost}_p(X_i)\le \textsf{cost}_p(X_i,\textbf{c}_i)=\textsf{cost}_p(X_i',\textbf{c}_i')=\textsf{cost}_p(Y_i,\textbf{c}_i'')=\textsf{cost}_p(Y_i). \end{aligned}$$

Hence, \(\textsf{cost}_p(X_1,\ldots ,X_k)\le \textsf{cost}_p(Y_1,\ldots ,Y_k)\). This completes the proof of (i).

To show (ii), we prove that \(\ell \le kB^p(2B+1)\). For this, we show that \(\ell _i\le kB^p(2B+1)\) for every \(i\in \{1,\ldots ,t\}\). Consider \(i\in \{1,\ldots ,t\}\). Recall that \(S_i\) contains at most \(k(2B+1)\) distinct points. Denote by \(\textbf{x}_{j_1},\ldots ,\textbf{x}_{j_r}\) the distinct points in \(X_i\) and assume that they are numbered in the order in which they are included in \(S_i\) by the greedy procedure constructing this set.

Let \(Z_q=\{\textbf{x}_{j_1},\ldots ,\textbf{x}_{j_q}\}\) for \(q\in \{1,\ldots ,r\}\). We claim that \(Z_q\) has at most \((q-1)B^p\) nonuniform coordinates for each \(q\in \{1,\ldots ,r\}\). The proof is by induction. The claim is trivial if \(q=1\). Let \(q>1\) and assume that the claim is fulfilled for \(Z_{q-1}\). By the construction of \(S_i\), \(\textbf{x}_{j_q}\) is at distance at most B from \(\textbf{x}_{j_h}\) for some \(h\in \{1,\ldots ,q-1\}\). Then because \(\Vert \textbf{x}_{j_q}-\textbf{x}_{j_h}\Vert _p\le B\), we obtain that the points \(\textbf{x}_{i_q}\) and \(\textbf{x}_{i_h}\) differ in at most \(B^p\) coordinates by the definition of the \(\ell _p\)-norm. Then because \(Z_{q-1}\) has at most \((q-2)B^p\) nonuniform coordinates, \(Z_q\) has at most \((q-1)B^p\) nonuniform coordinates as required.

Because the number of nonuniform coordinates for \(S_i\) is the same as the number of nonuniform coordinates for \(Z_r\) and \(r\le k(2B+1)\), we obtain that \(\ell _i\le kB^p(2B+1)\). Then \(\ell =\max _{1\le i\le t}\ell _i\le kB^p(2B+1)\). Because the points of \(\textbf{Y}\) are in \(\mathbb {Z}^{\ell +1}\), we have the required upper bound for the dimension. This concludes the proof of (ii).

Finally, to show (iii), we again exploit the property that every \(S_i\) contains at most \(k(2B+1)\) distinct points. Let \(i\in \{1,\ldots ,t\}\) and \(h\in \{1,\ldots ,d\}\) and denote by \(\textbf{x}_{j_1},\ldots ,\textbf{x}_{j_r}\) the distinct points in \(X_i\). Let \(h\in \{1,\ldots ,d\}\). We can assume without loss of generality that \(\textbf{x}_{j_1}[h]\le \cdots \le \textbf{x}_{j_r}[h]\). We claim that \(\textbf{x}_{j_r}[h]-\textbf{x}_{j_1}[h]\le B(k(2B+1)-1)\). This is trivial if \(r=1\). Assume that \(r>1\). Observe that \(\textbf{x}_{j_q}[h]-\textbf{x}_{q-1}[h]\le B\) for \(q\in \{2,\ldots ,r\}\). Otherwise, if there is \(q\in \{2,\ldots ,r\}\) such that \(\textbf{x}_{j_q}[h]-\textbf{x}_{q-1}[h]> B\), then the distance from any point in \(\{\textbf{x}_{j_1},\ldots ,\textbf{x}_{j_{q-1}}\}\) to any point in \(\{\textbf{x}_{j_q},\ldots ,\textbf{x}_{j_{r}}\}\) is more than B but this contradicts that these points are the distinct points of \(S_i\). Then because \(\textbf{x}_{j_q}[h]-\textbf{x}_{q-1}[h]\le B\) for \(q\in \{2,\ldots ,r\}\) and \(r\le k(2B+1)\), we obtain that \(\textbf{x}_{j_r}[h]-\textbf{x}_{j_1}[h]\le B(k(2B+1)-1)\).

Then, by the definition of \(\textbf{x}_1',\ldots ,\textbf{x}_n'\), we obtain that for every \(\textbf{x}_{q}',\textbf{x}_r'\in S_i'\) for some \(i\in \{1,\ldots ,t\}\) and every \(h\in \{1,\ldots ,\ell \}\), \(|\textbf{x}_q'[h]-\textbf{x}_r'[h]|\le B(k(2B+1)-1)\). By the definition of \(M_h^{(i)}\) for \(i\in \{1,\ldots ,t\}\), we obtain that \(|\textbf{y}_j[h]|\le B(k(2B+1)-1)\) for every \(j\in \{1,\ldots ,n\}\) and every \(h\in \{1,\ldots ,\ell \}\). Because \(|\textbf{y}_j[\ell +1]|\le (k-1)(B+1)\), we have that \(|\textbf{y}_i[h]|\le B(k(2B+1)-1)\) for \(h \in \{1,\ldots ,d'\}\) and \(i \in \{1,\ldots ,n\}\). This completes the proof of (iii) and the proof of the lemma.\(\square \)

Finally in this subsection, we show the following lemma that is used to upper bound the additional cost incurred by the greedy clustering of blocks of equal points.

Lemma 8

Let \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) be a collection of n points of \(\mathbb {Z}^d\) and set k be a positive integer such that n is divisible by k. Suppose that \(S_1,\ldots ,S_t\) are disjoint collections of equal points of \(\textbf{X}\) such that \(| S_1| =\cdots =| S_t| =\frac{n}{k}\) and \(\textbf{Y}=\textbf{X}\setminus \big (S_1\cup \cdots \cup S_t\big )\). Then \(\textsf{Opt}(\textbf{Y},k-t)\le 2\cdot \textsf{Opt}(\textbf{X},k)\).

Proof

Let \(\{X_1,\ldots ,X_k\}\) be an optimum equal k-clustering of \(\textbf{X}\) with optimum medians \(\textbf{c}_1,\ldots ,\textbf{c}_k\) of \(X_1,\ldots ,X_k\), respectively, that is, \(\textsf{Opt}(\textbf{X},k)=\textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{cost}_p(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\). Let \(\textbf{x}_{i_h}\in \textbf{S}_h\) for \(h\in \{1,\ldots ,t\}\). Consider a t-tuple of \((j_1,\ldots ,j_t)\) of distinct indices from \(\{1,\ldots ,k\}\) such that

$$\begin{aligned} \Vert \textbf{x}_{i_1}-\textbf{c}_{j_1}\Vert _p+\cdots +\Vert \textbf{x}_{i_t}-\textbf{c}_{j_t}\Vert _p=\min _{(q_1,\ldots ,q_t)}\big (\Vert \textbf{x}_{i_1}-\textbf{c}_{q_1}\Vert _p+\cdots +\Vert \textbf{x}_{i_t}-\textbf{c}_{q_t}\Vert _p\big ), \end{aligned}$$
(8)

where the minimum in the right part is taken over all t-tuples \((q_1,\ldots ,q_t)\) of distinct indices from \(\{1,\ldots ,k\}\). Denote \(\ell =k-t\). Iteratively applying Lemma 5 for \(S_1,\ldots ,S_t\) and the medians \(\textbf{c}_{j_1},\ldots ,\textbf{c}_{j_t}\), we obtain that there is an equal \(\ell \)-clustering \(\{Y_1,\ldots ,Y_\ell \}\) of \(\textbf{Y}\) such that

$$\begin{aligned} \textsf{cost}_p(S_1,\ldots ,S_t,Y_1,\ldots ,Y_\ell )\le \textsf{cost}_p(X_1,\ldots ,X_k)+s\sum _{h=1}^t \Vert \textbf{x}_{i_h}-\textbf{c}_{j_h}\Vert _p. \end{aligned}$$
(9)

Because the points in each \(S_i\) are the same, \(\textsf{cost}_p(S_i)=0\) and, therefore, \(\textsf{cost}_p(S_1,\ldots ,S_t,Y_1,\ldots ,Y_\ell )=\textsf{cost}_p(Y_1,\ldots ,Y_k)\). Then by (9),

$$\begin{aligned} \textsf{Opt}(\textbf{Y},\ell )\le \textsf{cost}_p(Y_1,\ldots ,Y_k)\le \textsf{Opt}(\textbf{X},k)+s\sum _{h=1}^t \Vert \textbf{x}_{i_h}-\textbf{c}_{j_h}\Vert _p. \end{aligned}$$
(10)

This implies that to prove the lemma, it is sufficient to show that

$$\begin{aligned} s\sum _{h=1}^t \Vert \textbf{x}_{i_h}-\textbf{c}_{j_h}\Vert _p\le \textsf{Opt}(\textbf{X},k). \end{aligned}$$
(11)

To prove (11), we consider the following auxiliary clustering problem. Let \(\textbf{Z}=S_1\cup \cdots \cup S_t\) and \(s=\frac{n}{k}\). The task of the problem is to find a partition \(\{Z_1,\ldots ,Z_k\}\) of \(\textbf{Z}\), where some sets may be empty and \(| Z_i| \le s\) for every \(i\in \{1,\ldots ,k\}\), such that

$$\begin{aligned} \sum _{i=1}^t\sum _{\textbf{x}_h\in Z_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p \end{aligned}$$
(12)

is minimum. In words, we cluster the elements of \(\textbf{Z}\) in the optimum way into clusters of size at most s using the optimum medians \(\textbf{c}_1,\ldots ,\textbf{c}_k\) for the clustering \(\{X_1,\ldots ,X_k\}\). Denote by \(\textsf{Opt}^*(\textbf{Z},k)\) the minimum value of (12). Because in this problem the task is to cluster a subcollection of points of \(\textbf{X}\) and we relax the cluster size constraints, we have that \(\textsf{Opt}^*(\textbf{Z},k)\le \textsf{Opt}(\textbf{X},k)\). We show the following claim.

Claim 3.2

$$\begin{aligned} \textsf{Opt}^*(\textbf{Z},k)\ge s\cdot \min _{(q_1,\ldots ,q_t)}\sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_{q_h}\Vert _p, \end{aligned}$$

where the minimum is taken over all t-tuples \((q_1,\ldots ,q_t)\) of distinct indices from \(\{1,\ldots ,k\}\).

Proof of Claim 3.2

We show that the considered auxiliary clustering problem can be reduced to the Min Cost Flow problem (see, e.g., the textbook of Kleinberg and Tardos [54] for the introduction)Footnote 2. We construct the directed graph G and define the cost and capacity functions \(c(\cdot )\) and \(\omega (\cdot )\) on the set of arcs A(G) as follows.

  • Construct two vertices a and b that are the source and target vertices, respectively.

  • For every \(i\in \{1,\ldots ,t\}\), construct a vertex \(u_i\) (corresponding to \(S_i\)) and an arc \((a,u_i)\) with \(\omega (a,u_i)=0\).

  • For every \(j\in \{1,\ldots ,k\}\), construct a vertex \(v_j\) (corresponding to \(Z_j\)) and and arc \((v_j,b)\) with \(\omega (v_j,b)=0\).

  • For every \(h\in \{1,\ldots ,t\}\) and every \(j\in \{1,\ldots ,k\}\), construct an arc \((u_h,v_j)\) and set \(\omega (u_i,v_j)=\Vert \textbf{x}_{i_h}-\textbf{c}_j\Vert _p\) (recall that \(\textbf{x}_{i_h}\in S_h\)).

  • For every arc e of G, set \(c(e)=s\), where \(s=\frac{n}{k}\).

Then the volume of a flow \(f:A(G)\rightarrow \mathbb {R}_{\ge 0}\) is \(v(f)=\sum _{i=1}^tf(a,u_i)\) and its cost is \(\omega (f)=\sum _{a\in A(G)}\omega (a)\cdot f(a)\). Let \(f^*(\cdot )\) be a flow of volume st with minimum cost. We claim that \(\omega (f^*)=\textsf{Opt}^*(\textbf{Z},k)\).

Assume that \(\{Z_1,\ldots ,Z_k\}\) is a partition of \(\textbf{Z}\) such that \(| Z_i| \le s\) for every \(i\in \{1,\ldots ,k\}\) and \(\textsf{Opt}^*(\textbf{Z},k)=\sum _{i=1}^t\sum _{\textbf{x}_h\in Z_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p\). We define the flow \(f(\cdot )\) as follows:

  • for every \(i\in \{1,\ldots ,t\}\), set \(f(a,u_i)=s\),

  • for every \(i\in \{1,\ldots ,t\}\) and \(j\in \{1,\ldots ,k\}\), set \(f(u_i,v_j)=| S_i\cap Z_j|\), and

  • for every \(j\in \{1,\ldots ,t\}\), set \(f(v_j,b)=| Z_j| \).

It is easy to verify that f is a feasible flow of volume st and \(\omega (f)=\sum _{i=1}^t\sum _{\textbf{x}_h\in Z_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p\). Thus, \(\omega (f^*)\le \omega (f)=\textsf{Opt}^*(\textbf{Z},k)\).

For the opposite inequality, consider \(f^*(\cdot )\). By the well-known property of flows (see [54]), we can assume that \(f^*(\cdot )\) is an integer flow, that is, \(f^*(e)\) is a nonnegative integer for every \(e\in A(G)\). Since \(v(f^*)=st\), we have that \(f^*(a,u_i)=s\) for every \(i\in \{1,\ldots ,t\}\). Then we construct the clustering \(\{Z_1,\ldots ,Z_k\}\) as follows: for every \(i\in \{1,\ldots ,i\}\) and \(j\in \{1,\ldots ,k\}\), we put exactly \(f^*(u_i,v_j)\) points of \(S_i\) into \(Z_j\). Because \(f^*(a,u_i)=s\) for every \(i\in \{1,\ldots ,t\}\) and \(c(v_j,b)=s\) for every \(j\in \{1,\ldots ,k\}\), we obtain that \(\{Z_1,\ldots ,Z_k\}\) is a partition of \(\textbf{Z}\) such that \(| Z_i| \le s\) for every \(i\in \{1,\ldots ,k\}\) and \(\sum _{i=1}^t\sum _{\textbf{x}_h\in Z_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p=\omega (f^*)\). This implies that \(\textsf{Opt}^*(\textbf{Z},k)\le \sum _{i=1}^t\sum _{\textbf{x}_h\in Z_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p=\omega (f^*)\).

-6ptThis proves that \(\omega (f^*)=\textsf{Opt}^*(\textbf{Z},k)\). Moreover, we can observe that, given an integer flow \(f(\cdot )\) with \(v(f)=st\), we can construct a feasible clustering \(\{Z_1,\ldots ,Z_k\}\) of cost \(\omega (f)\) such that for every \(i\in \{1,\ldots ,t\}\) and every \(j\in \{1,\ldots ,k\}\), \(| S_i\cap Z_j|=f(u_i,v_j)\). Recall that the capacities of the arcs of G are the same and are equal to s. Then again exploiting the properties of flows (see [54]), we observe that there is a flow \(f^*(\cdot )\) with \(v(f^*)=st\) of minimum cost such that saturated arcs (that is, arcs e with \(f^*(e)=c(e)=s\)) compose internally vertex disjoint (ab)-paths, and the flow on other arcs is zero. This implies, that for the clustering \(\{Z_1,\ldots ,Z_k\}\) constructed for \(f^*(\cdot )\), for every \(j\in \{1,\ldots ,k\}\), ether \(Z_j=\emptyset \) or there is \(i\in \{1,\ldots ,t\}\) such that \(Z_j=S_i\). Assume that \(j_1,\ldots ,j_t\) are distinct indices from \(\{1,\ldots ,k\}\) such that \(Z_{j_h}=S_h\) for \(h\in \{1,\ldots ,t\}\). Then \(\omega (f^*)=\sum _{i=1}^t\sum _{\textbf{x}_h\in Z_i}\Vert \textbf{x}_h-\textbf{c}_i\Vert _p=s\sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_{j_h}\Vert _p\) and

$$\begin{aligned} \textsf{Opt}^*(\textbf{Z},k)=\omega (f^*)=s\sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_{j_h}\Vert _p\ge s\cdot \min _{(q_1,\ldots ,q_t)}\sum _{h=1}^t\Vert \textbf{x}_{i_h}-\textbf{c}_{q_h}\Vert _p, \end{aligned}$$

where the minimum is taken over all t-tuples \((q_1,\ldots ,q_t)\) of distinct indices from \(\{1,\ldots ,k\}\). This proves the claim.\(\square \)

Recall that \(\textsf{Opt}^*(\textbf{Z},k)\le \textsf{Opt}(\textbf{X},k)\). By the choice of \(j_1,\ldots ,j_t\) in (8) and Claim 3.2, we obtain that inequality (11) holds. Then by (11), we have that \(\textsf{Opt}(\textbf{Y},k-t)\le 2\cdot \textsf{Opt}(\textbf{X},k)\) as required by the lemma.\(\square \)

3.2 Proof of Theorem 1

Now we are ready to show the result about the approximate kernel that we restate.

Theorem 1

For every nonnegative integer constant p, Parameterized Equal Clustering admits a 2-approximate kernel when parameterized by B, where the output collection of points has \(\mathcal {O}(B^2)\) points of \(\mathbb {Z}^{d'}\) with \(d'= \mathcal {O}(B^{p+2})\), where each coordinate of a point takes an absolute value of \(\mathcal {O}(B^3)\).

Proof

Let \((\textbf{X},k,B)\) be an instance of Parameterized Equal Clustering with \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\), where the points are from \(\mathbb {Z}^d\) and n is divisible by k. Recall that a lossy kernel consists of two algorithms. The first algorithm is a polynomial time reduction producing an instance \((\textbf{X}',k',B')\) of bounded size. The second algorithm is a solution-lifting and for every equal \(k'\)-clustering \(\{X_1',\ldots ,X_k'\}\) of \(\textbf{X}'\), this algorithm produces in a polynomial time an equal k-clustering \(\{X_1,\ldots ,X_k\}\) of \(\textbf{X}\) such that

$$\begin{aligned} \frac{\textsf{cost}_p^B(X_1,\ldots ,X_k)}{\textsf{Opt}(\textbf{X},k,B)}\le 2\cdot \frac{\textsf{cost}_p^{B'}(X_1',\ldots ,X_{k'}')}{\textsf{Opt}(\textbf{X}',k',B')}. \end{aligned}$$
(13)

We separately consider the cases when \(\frac{n}{k} \ge 4B+1\) and \(\frac{n}{k} \le 4B\).

Suppose that \(\frac{n}{k} \ge 4B+1\). Then we apply the algorithm from Lemma 6. If the algorithm returns the answer that \(\textbf{X}\) does not admit an equal k-clustering of cost at most B, then the reduction algorithm returns a trivial no-instance \((\textbf{X}',k',B')\) of constant size, that is, an instance such that \(\textbf{X}'\) has no clustering of cost at most \(B'\). For example, we set \(\textbf{X}'=\{(0),(1)\}\), \(k'=1\), and \(B'=0\). Here and in the further cases when the reduction algorithm returns a trivial no-instance, the solution-lifting algorithm returns an arbitrary equal k-clustering of \(\textbf{X}\). Since \(\textsf{cost}_p^B(X_1,\ldots ,X_k)=\textsf{Opt}(\textbf{X},k,B)=B+1\), (13)Footnote 3 holds. Assume that the algorithm from Lemma 6 produced an equal k-clustering \(\{X_1,\ldots ,X_k\}\) of minimum cost. Then the reduction returns an arbitrary instance of Parameterized Equal Clustering of constant size. For example, we can use \(\textbf{X}'=\{(0)\}\), \(k'=1\), and \(B'=0\). The solution-lifting algorithms always returns \(\{X_1,\ldots ,X_k\}\). Clearly, \(\textsf{cost}_p^B(X_1,\ldots ,X_k)=\textsf{Opt}(\textbf{X},k,B)\) and (13) is fulfilled.

From now on, we assume that \(\frac{n}{k} \le 4B\), that is, \(n \le 4Bk\). We apply the algorithm from Lemma 6. If this algorithm reports that there is no equal k-clustering of cost at most B, then the reduction algorithm returns a trivial no-instance and the solution-lifting algorithm outputs an arbitrary equal k-clustering of \(\textbf{X}\). Clearly, (13) is satisfied. Assume that this is not the case. Then we obtain a collection of \(n\le 4Bk\) points \(\textbf{Y}=\{\textbf{y}_1,\ldots ,\textbf{y}_n\}\) of \(\mathbb {Z}^{d'}\) satisfying conditions (i)–(iii) of Lemma 7. That is,

  1. (i)

    for every partition \(\{I_1,\ldots ,I_k\}\) of \(\{1,\ldots ,n\}\) such that \(| I_1| =\cdots =| I_k| =\frac{n}{k}\), either \(\textsf{cost}_p(X_1,\ldots ,X_k)>B\) and \(\textsf{cost}_p(Y_1,\ldots ,Y_k)>B\) or \(\textsf{cost}_p(X_1,\ldots ,X_k)=\textsf{cost}_p(Y_1,\ldots ,Y_k)\), where \(X_i=\{\textbf{x}_h\mid h\in I_i\}\) and \(Y_i=\{\textbf{y}_h\mid h\in I_i\}\) for every \(i\in \{1,\ldots ,k\}\),

  2. (ii)

    \(d' = \mathcal {O}(kB^{p+1})\), and

  3. (iii)

    \(|\textbf{y}_i[h]| = \mathcal {O}(kB^2)\) for \(h \in \{1,\ldots ,d'\}\) and \(i \in \{1,\ldots ,n\}\).

By (i), for given an equal k-clustering clustering \(\{Y_1,\ldots ,Y_k\}\) of \(\textbf{Y}\), we can compute the corresponding clustering \(\{X_1,\ldots ,X_k\}\) by setting \(X_i=\{\textbf{x}_h\mid \textbf{y}_h\in Y_i\}\) for \(i\in \{1,\ldots ,k\}\). Then \(\textsf{Opt}(\textbf{X},k,B)=\textsf{Opt}(\textbf{Y},k,B)\) and

$$\begin{aligned} \frac{\textsf{cost}_p^B(X_1,\ldots ,X_k)}{\textsf{Opt}(\textbf{X},k,B)}= \frac{\textsf{cost}_p^{B}(Y_1,\ldots ,Y_{k})}{\textsf{Opt}(\textbf{Y},k,B)}. \end{aligned}$$
(14)

Hence the instances \((\textbf{X},k,B)\) and \((\textbf{Y},k,B)\) are equivalent. We continue with the compressed instance \((\textbf{Y},k,B)\).

Now we apply the greedy procedure that constructs clusters \(S_1,\ldots ,S_t\) composed by equal points. Formally, we initially set \(\textbf{X}':=Y\), \(k':=k\), and \(i:=0\). Then we do the following:

  • while \(\textbf{X}'\) contains a collections S of s identical points, set \(i:=i+1\), \(S_i:=S\), \(\textbf{X}':=\textbf{X}'\setminus S\), and \(k':=k'-1\).

Denote by \(\textbf{X}'\) the set of points obtained by the application of the procedure and let \(S_1,\ldots ,S_t\) be the collections of equal points constructed by the procedure. Note that \(k'=k-t\). We also define \(B'=2B\). Notice that it may happen that \(\textbf{X}'=\textbf{Y}\) or \(\textbf{X}'=\emptyset \). The crucial property exploited by the kernelization is that by Lemma 8, \(\textsf{Opt}(\textbf{X}',k')\le 2\cdot \textsf{Opt}(\textbf{Y},k)\).

We argue that if \(k'>B\), then we have no k-clustering of cost at most B. Suppose that \(k'>B'\). Consider an arbitrary equal \(k'\)-clustering \(\{X_1',\ldots ,X_{k'}'\}\) of \(\textbf{X}'\). Because the construction of \(S_1,\ldots ,S_t\) stops when there is no collection of s equal points, each cluster \(X_i'\) contains at least two distinct points. Since all points have integer coordinates, we have that \(\textsf{cost}_p(X_i')\ge 1\) for every \(i\in \{1,\ldots ,k'\}\). Therefore, \(\textsf{cost}_p(X_1',\ldots ,X_{k'}')=\sum _{i=1}^{k'}\textsf{cost}_p(X_i')\ge k'>B'=2B\). This means that \(2\cdot \textsf{Opt}(Y,k)\ge \textsf{Opt}(\textbf{X}',k')>2B\) and \(\textsf{Opt}(\textbf{Y},k)>B\). Using this, our reduction algorithm returns a trivial no-instance. Then the solution-lifting algorithm outputs an arbitrary equal k-clustering of \(\textbf{X}\) and this satisfies (13).

From now on -4pt we assume that \(k'\le B'=2B\) and construct the reduction and solution lifting algorithms for this case.

If \(k'=0\), then \(\textbf{X}'=\emptyset \) and the reduction algorithm simply returns an arbitrary instance of constant size. Otherwise, our reduction algorithms returns \((\textbf{X}',k',B')\). Observe that since \(k'\le B'=2B\), \(| \textbf{X}'|\le n\le 4B^2\). Recall that \(d'=\mathcal {O}(B^{p+2})\) and \(|\textbf{x}_i'[h]| = \mathcal {O}(B^3)\) for \(h \in \{1,\ldots ,d'\}\) for every point \(\textbf{x}_i'\in \textbf{X}'\). We conclude that the instance \((\textbf{X}',k',B')\) of Parameterized Equal Clustering satisfies the size conditions of the theorem.

Now we describe the solution-lifting algorithm and argue that inequality (13) holds.

If \(k'=0\), then the solution-lifting algorithm ignores the output of the reduction algorithm which was arbitrary. It takes the equal k-clustering \(\{S_1,\ldots ,S_k\}\) of \(\textbf{Y}\) and outputs the equal k-clustering \(\{X_1,\ldots ,X_k\}\) of \(\textbf{X}\) by setting \(X_i=\{\textbf{x}_h\mid \textbf{y}_h\in S_i\}\) for \(i\in \{1,\ldots ,k\}\). Clearly, \(\textsf{cost}_p(S_1,\ldots ,S_k)=\textsf{cost}_p(X_1,\ldots ,X_p)=0\). Therefore, (13) holds.

If \(k'>0\), we consider an equal \(k'\)-clustering \(\{X_1',\ldots ,X_{k'}'\}\) of \(\textbf{X}'\). The solution-lifting algorithm constructs an equal k-clustering \(\{S_1,\ldots ,S_t,X_1',\ldots ,X_{k'}'\}\), that is, we just add the clusters constructed by our greedy procedure. Since the points in each set \(S_i\) are the same, \(\textsf{cost}_p(S_i)=0\) for every \(i\in \{1,\ldots ,t\}\). Therefore,

$$\begin{aligned} \textsf{cost}_p(S_1,\ldots ,S_t,X_1',\ldots ,X_{k'}')=\textsf{cost}_p(X_1',\ldots ,X_{k'}'). \end{aligned}$$

Notice that since \(\textsf{Opt}(\textbf{X}',k')\le 2\cdot \textsf{Opt}(\textbf{Y},k)\), we have that \(\textsf{Opt}(\textbf{X}',k',B')\le 2\cdot \textsf{Opt}(\textbf{Y},k,B)\). Indeed, if \(\textsf{Opt}(\textbf{Y},k)\le B\), then \(\textsf{Opt}(\textbf{X}',k')\le 2B=B'\). Hence, \(\textsf{Opt}(\textbf{Y},k,B)=\textsf{Opt}(\textbf{Y},k)\), \(\textsf{Opt}(\textbf{X}',k',B')=\textsf{Opt}(\textbf{X}',k')\), and \(\textsf{Opt}(\textbf{X}',k',B')\le 2\cdot \textsf{Opt}(\textbf{Y},k,B)\). If \(\textsf{Opt}(\textbf{Y},k)>B\), then \(\textsf{Opt}(\textbf{Y},k,B)=B+1\). In this case, \(2\cdot \textsf{Opt}(\textbf{Y},k,B)=2B+2>\textsf{Opt}(\textbf{X}',k',B')\) because \(\textsf{Opt}(\textbf{X}',k',B')\le B'+1=2B+1\). Finally, since \(\textsf{cost}_p(S_1,\ldots ,S_t,X_1',\ldots ,X_{k'}')=\textsf{cost}_p(X_1',\ldots ,X_{k'}')\) and \(\textsf{Opt}(\textbf{X}',k',B')\le 2\cdot \textsf{Opt}(\textbf{Y},k,B)\), we conclude that

$$\begin{aligned} \frac{\textsf{cost}_p^B(S_1,\ldots ,S_t,X_1',\ldots ,X_{k'}')}{\textsf{Opt}(\textbf{Y},k,B)}\le 2\cdot \frac{\textsf{cost}_p^{B}(X_1,\ldots ,X_{k'}')}{\textsf{Opt}(\textbf{X}',k',B')}. \end{aligned}$$
(15)

Then the solution-lifting algorithm computes the equal k-clustering \(\{X_1,\ldots ,X_k\}\) for the equal k-clustering \(\{Y_1,\ldots ,Y_k\}=\{S_1,\ldots ,S_t,X_1',\ldots ,X_{k'}'\}\) of \(\textbf{Y}\) by setting \(X_i=\{\textbf{x}_h\mid \textbf{y}_h\in Y_i\}\) for \(i\in \{1,\ldots ,k\}\). Combining (14) and (15), we obtain (13).

This concludes the description of the reduction and solution-lifting algorithms, as well as the proof of their correctness. To argue that the reduction algorithm is a polynomial-time algorithm, we observe that the algorithms from Lemmata 6 and 7 run in a polynomial time. Trivially, the greedy construction of \(S_1,\ldots ,S_t\), \(\textbf{X}\), and \(k'\) can be done in a polynomial time. Therefore, the reduction algorithm runs in a polynomial time. The solution-lifting algorithm is also easily implementable to run in a polynomial time. This concludes the proof.\(\square \)

4 Kernelization

In this section, we study (exact) kernelization of clustering with equal sizes. In Subsection 4.1 we prove Theorem 2 claiming that decision version of the problem, Decision Equal Clustering, does not admit a polynomial kernel being parameterized by B only. We also show in Subsection 4.2 that the technical lemmata developed in the previous section for approximate kernel, can be used to prove that Decision Equal Clustering parameterized by k and B admits a polynomial kernel.

4.1 Kernelization Lower Bound

In this subsection, we show that it is unlikely that Decision Equal Clustering admits a polynomial kernel when parameterized by B only. We prove this for the \(\ell _0\) and \(\ell _1\)-norms. Our lower bound holds even for points with binary coordinates, that is, for points from \(\{0,1\}^d\). For this, we use the result of Dell and Marx [55] about kernelization lower bounds for the Perfect \(r\) -Set Matching problem.

A hypergraph \(\mathcal {H}\) is said to be r-uniform for a positive integer r, if every hyperedge of \(\mathcal {H}\) has size r. Similarly to graphs, a set of hyperedges M is a matching if the hyperedges in M are pairwise disjoint, and M is perfect if every vertex of \(\mathcal {H}\) is saturated in M, that is, included in one of the hyperedges of M. Perfect \(r\) -Set Matching asks, given a r-uniform hypergraph \(\mathcal {H}\), whether \(\mathcal {H}\) has a perfect matching. Dell and Marx [55] proved the following kernelization lower bound.

Proposition 1

([55]) Let \(r\ge 3\) be an integer and let \(\varepsilon \) be a positive real. If \({{{\,\mathrm{\textsf{NP}}\,}}\subseteq {{\,\mathrm{\textsf{coNP}}\,}}/{{\,\mathrm{\textsf{poly}}\,}}}\), then Perfect \(r\) -Set Matching does not have kernels with \(\mathcal {O}(\big (\frac{| V(\mathcal {H})|}{r}\big )^{r-\varepsilon })\) hyperedges.

We need a weaker claim.

Corollary 1

Perfect \(r\) -Set Matching admits no polynomial kernel when parameterized by the number of vertices of the input hypergraph unless \({{{\,\mathrm{\textsf{NP}}\,}}\subseteq {{\,\mathrm{\textsf{coNP}}\,}}/{{\,\mathrm{\textsf{poly}}\,}}}\).

Proof

To see the claim, it is sufficient to observe that the existence of a polynomial kernel for Perfect \(r\) -Set Matching parameterized by \(| V(\mathcal {H})|\) implies that the problem has a kernel such that the number of hyperedges is polynomial in \(| V(\mathcal {H})|\) with the degree of the polynomial that does not depend on d contradicting Proposition 1.\(\square \)

We show the kernelization lower bound for \(\ell _0\) and \(\ell _1\) using the fact that optimum medians can be computed by the majority rule for a collection of binary points. Let X be a collection of points of \(\{0,1\}^d\). We construct \(\textbf{c}\in \{0,1\}^d\) as follows: for \(i\in \{1,\ldots ,d\}\), consider the multiset \(S_i=\{\textbf{x}[i]\mid \textbf{x}\in X\}\) and set \(\textbf{c}[i]=0\) if at least half of the elements of \(S_i\) are zeros, and set \(\textbf{c}[i]=1\) otherwise. It is straightforward to observe the following.

Observation 1

Let X be a collection of points of \(\{0,1\}^d\) and let \(\textbf{c}\in \{0,1\}^d\) be a vector constructed by the majority rule. Then for the \(\ell _0\) and \(\ell _1\)-norms, \(\textbf{c}\) is

We also use the following lemma which is a special case of Lemma 9 of [50].

Lemma 9

([50]) Let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of a collection of points \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\) from \(\{0,1\}^d\), and let \(\textbf{c}_1,\ldots ,\textbf{c}_k\) be optimum medians for \(X_1,\ldots ,X_k\), respectively. Let also \(\textbf{C}\subseteq \{\textbf{c}_1,\ldots ,\textbf{c}_k\}\) be the set of medians coinciding with some points of \(\textbf{X}\). Suppose that every collection of the same points of \(\textbf{X}\) has size at most \(\frac{n}{k}\). Then there is an equal k-clustering \(\{X_1',\ldots ,X_k'\}\) of \(\textbf{X}\) such that \(\textsf{cost}_0(X_1',\ldots ,X_k',\textbf{c}_1,\ldots ,\textbf{c}_k)\le \textsf{cost}_0(X_1,\ldots ,X_k,\textbf{c}_1,\ldots ,\textbf{c}_k)\) and for every \(i\in \{1,\ldots ,k\}\), the following is fulfilled: if \(\textbf{c}_i\in \textbf{C}\), then each \(\textbf{x}_h\in \textbf{X}\) coinciding with \(\textbf{c}_i\) is in \(X_i'\).

Now we are ready to prove Theorem 2, we restate it here.

Theorem 2

For the \(\ell _0\) and \(\ell _1\)-norms, Decision Equal Clustering has no polynomial kernel when parameterized by B unless \({{{\,\mathrm{\textsf{NP}}\,}}\subseteq {{\,\mathrm{\textsf{coNP}}\,}}/{{\,\mathrm{\textsf{poly}}\,}}}\), even if the input points are binary, that is, are from \(\{0,1\}^d\).

Proof

Notice that for any binary vector \(\textbf{x}\in \{0,1\}^d\), \(\Vert \textbf{x}\Vert _0=\Vert \textbf{x}\Vert _1\). Since we consider only instances where the input points are binary, we can assume that the medians of clusters are binary as well by Observation 1. Then it is sufficient to prove the theorem for one norm, say \(\ell _0\). We reduce from Perfect \(r\) -Set Matching. Let \(\mathcal {H}\) be an r-uniform hypergraph. Denote by \(v_1,\ldots ,v_n\) the vertices and by \(E_1,\ldots ,E_m\) the hyperedges of \(\mathcal {H}\), respectively. We assume that n is divisible by r, as otherwise \(\mathcal {H}\) has no perfect matching. We also assume that \(r\ge 3\) because for \(r\le 2\), Perfect \(r\) -Set Matching can be solved in a polynomial time [52].

We construct the instance \((\textbf{X},k,B)\) of Decision Equal Clustering, where \(\textbf{X}\) is a collection of \((r-1)n+rm\) points of \(\{0,1\}^d\), where \(d=2rn\).

To describe the construction of \(\textbf{X}\), we partition the set \(\{1,\ldots ,2rn\}\) of coordinate indices into n blocks \(R_1,\ldots ,R_n\) of size 2r each. For every \(i\in \{1,\ldots ,n\}\), we select an index \(p_i\in R_i\) and set \(R_i'=R_i\setminus \{p_1\}\). Formally,

  • \(R_i=\{2r(i-1)+1,\ldots ,2ri\}\) for \(i\in \{1,\ldots ,n\}\),

  • \(p_i=2r(i-1)+1\) for \(i\in \{1,\ldots ,n\}\), and

  • \(R_i'=\{2r(i-1)+2,\ldots ,2ri\}\) for \(i\in \{1,\ldots ,n\}\).

The set of points \(\textbf{X}\) consists of \(n+m\) blocks of equal points \(V_1,\ldots ,V_n\) and \(F_1,\ldots ,F_m\), where \(| V_i| =r-1\) for each \(i\in \{1,\ldots ,n\}\) and \(| F_i| =r\) for \(i\in \{1,\ldots ,m\}\). Each block \(V_i\) is used to encode the vertex \(v_i\), and each block \(F_i\) is used to encode the corresponding hyperedge \(E_i\). An example is shown in Fig. 1.

For each \(i\in \{1,\ldots ,n\}\), we define the vector \(\textbf{v}_i\in \{0,1\}^{2rn}\) corresponding to the vertex \(v_i\) of \(\mathcal {H}\):

$$\begin{aligned} \textbf{v}_i[j]= {\left\{ \begin{array}{ll} 1&{}\text{ if } j\in R_i,\\ 0&{}\text{ otherwise }. \end{array}\right. } \end{aligned}$$

Then \(V_i\) consists of \(r-1\) copies of \(\textbf{v}_i\) that we denote \(\textbf{v}_i^{(1)},\ldots ,\textbf{v}_i^{(r-1)}\).

For every \(j\in \{1,\ldots ,m\}\), we define the vector \(\textbf{f}_j\in \{0,1\}^{2rn}\) corresponding to the hyperedge \(E_j=\{v_{i_1^{(j)}},\ldots ,v_{i_r^{(j)}}\}\):

$$\begin{aligned} \textbf{f}_j[h]= {\left\{ \begin{array}{ll} 1&{}\text{ if } h=p_s\text { for some }s\in \{i_1^{(j)},\ldots ,i_r^{(j)}\},\\ 0&{}\text{ otherwise }. \end{array}\right. } \end{aligned}$$

Then \(F_j\) includes r copies of \(\textbf{f}_j\) denoted by \(\textbf{f}_j^{(1)},\ldots ,\textbf{f}_j^{(r)}\).

Fig. 1
figure 1

The construction of \(\textbf{X}\) for \(\mathcal {H}\) with \(V(\mathcal {H})=\{v_1,\ldots ,v_6\}\) and the hyperedges \(E_1=\{v_1,v_2,v_3\}\), \(E_2=\{v_4,v_5,v_6\}\), \(E_3=\{v_1,v_3,v_5\}\), and \(E_4=\{v_2,v_4,v_5\}\). The collection of the points \(\textbf{X}\) is shown here as a matrix, where each column -4ptis a point of \(\textbf{X}\). Note that \(r=3\) here. The blocks of \(\textbf{X}\) are shown by solid lines and the part of \(\textbf{X}\) corresponding to the vertices of \(\mathcal {H}\) is separated from the part corresponding to hyperedges by a double line. The blocks of coordinates with indices \(R_1,\ldots ,R_6\) are separated by solid lines. The coordinates with the indices \(p_1=1\), \(p_2=7\), \(p_3=13\), \(p_4=19\), \(p_5=25\), and \(p_6=31\) are underlined by dashed lines

To complete the construction of the instance of Decision Equal Clustering, we define

  • \(k=n+m-\frac{n}{r}\),

  • \(B=(3r-2)n\).

Recall that n is divisible by r and note that \(\frac{(r-1)n+rm}{k}=r\).

It is straightforward to verify that the construction of \((\textbf{X},k,B)\) is polynomial. We claim that the hypergraph \(\mathcal {H}\) has a perfect matching if and only if \((\textbf{X},k,B)\) is a yes-instance of Decision Equal Clustering. The proof uses the following property of the points of \(\textbf{X}\): for every \(i\in \{1,\ldots ,n\}\) and every \(j\in \{1,\ldots ,m\}\),

$$\begin{aligned} \Vert \textbf{v}_i-\textbf{f}_j\Vert _0= {\left\{ \begin{array}{ll} 3r-2&{}\text{ if } v_i\in E_j,\\ 3r&{}\text{ if } v_i\notin E_j. \end{array}\right. } \end{aligned}$$
(16)

For the forward direction, assume that \(\mathcal {H}\) has a perfect matching M. Assume without loss of generality that \(M=\{E_1,\ldots ,E_s\}\) for \(s=\frac{n}{r}\). Since M is a prefect matching, for every \(i\in \{1,\ldots ,n\}\), there is a unique \(h_i\in \{1,\ldots ,s\}\) such that \(v_i\in E_{h_i}\). We construct the equal k-clustering \(\{X_1,\ldots ,X_k\}\) as follows.

For every \(i\in \{1,\ldots ,n\}\), we define \(X_i=V_i\cup \{\textbf{f}_{h_i}^{(t)}\}\), where t is chosen from the set \(\{1,\ldots ,r\}\) in such a way that \(X_1,\ldots ,X_n\) are disjoint. In words, we initiate each cluster \(X_i\) by setting \(X_i:=V_i\) for \(i\in \{1,\ldots ,n\}\). This way, we obtain n clusters of size \(r-1\) each. Then we consider the blocks of points \(F_1,\ldots ,F_s\) corresponding to the hyperedges of M and split them between the clusters \(X_1,\ldots ,X_n\) by including a single element into each cluster. It is crucial that each \(X_i=V_i\) is complemented by an element of \(F_{h_i}\), that is, by an element of the initial cluster corresponding to the hyperedge saturating the vertex \(v_i\). Since M is a perfect matching, this splitting is feasible.

Notice that the first s blocks of points \(F_1,\ldots ,F_s\) are split between \(X_1,\ldots ,X_n\). The remaining \(m-s\) blocks \(F_{s+1},\ldots , F_{m}\) have size r each and form clusters \(X_{n+1},\ldots ,X_{k}\). This completes the construction of \(\{X_1,\ldots ,X_k\}\).

To evaluate \(\textsf{cost}_0(X_1,\ldots ,X_k)\), notice that the optimal median \(\textbf{c}_i=\textbf{v}_i\) for \(i\in \{1,\ldots ,n\}\) by the majority rule. Then, by (16), \(\textsf{cost}_0(X_i)=\Vert \textbf{v}_i-\textbf{f}_{h_i}\Vert _0=3r-2\). Since the clusters \(X_{n+1},\ldots ,X_{r}\) consist of equal points, we have that \(\textsf{cost}_0(X_i)=0\) for \(i\in \{1,\ldots ,m-s\}\). Then \(\textsf{cost}(X_1,\ldots ,X_k)=(3r-2)n=B\). Therefore, \((\textbf{X},k,B)\) is a yes-instance of Decision Equal Clustering.

For the opposite direction, let \(\{X_1,\ldots ,X_k\}\) be an equal k-clustering of \(\textbf{X}\) of cost at most B. Denote by \(\textbf{c}_1,\ldots ,\textbf{c}_r\) the optimal medians constructed by the majority rule. Observe that the choice of a median by the majority rule described above is not symmetric because if i-th coordinates of the points in a cluster have the same number of zeros and ones, the rule selects the zero value for the i-coordinate of the median. We show the following claim.

Claim 4.1

For every \(i\in \{1,\ldots ,k\}\), either \(\textbf{c}_i\in \{\textbf{v}_1,\ldots ,\textbf{v}_n\}\) or \(\textbf{c}_i[j]=0\) for all \(j\in R_1'\cup \ldots \cup R_n'\). Moreover, the medians of the first type, that is, coinciding with one of \(\textbf{v}_1,\ldots ,\textbf{v}_n\), are distinct.

Proof of Claim 4.1

Suppose that \(\textbf{c}_i[h]\ne 0\) for some \(h\in R_j'\), where \(j\in \{1,\ldots ,n\}\). Observe that, by the construction of \(\textbf{X}\), for every point \(\textbf{x}\in \textbf{X}\), \(\textbf{x}[h]=1\) only if \(\textbf{x}\in V_j\). Since \(\textbf{c}_i\) is constructed by the majority rule, we obtain that more than half of elements of \(X_i\) are from \(V_j\) and \(\textbf{c}_i=\textbf{v}_j\). To see the second part of the claim, notice that \(| V_j| =r-1\) and, therefore, at most one cluster \(X_i\) of size r can have at least half of its elements from \(V_j\).\(\square \)

By Claim 4.1, we assume without loss of generality that \(\textbf{c}_i=\textbf{v}_i\) for \(i\in \{1,\ldots ,\ell \}\) for some \(\ell \in \{0,\ldots ,r\}\) (\(\ell =0\) if there is no cluster with the median from \(\{\textbf{v}_1,\ldots ,\textbf{v}_n\}\)) and \(\textbf{c}_i[j]=0\) for \(j\in R_1'\cup \ldots \cup R_n'\) whenever \(i\in \{\ell +1,\ldots ,k\}\). Because the medians \(\textbf{c}_1,\ldots ,\textbf{c}_\ell \) are equal to points of \(\textbf{X}\), by Lemma 9, we can assume that \(V_i\subset X_i\) for \(i\in \{1,\ldots ,\ell \}\).

Claim 4.2

\(\ell =n\).

Proof of Claim 4.2

The proof is by contradiction. Assume that \(\ell <n\). Consider the elements of \(n-\ell \) blocks \(V_{\ell +1},\ldots ,V_n\). Let p be the number of elements of \(V_{\ell +1}\cup \ldots \cup V_n\) included in \(X_1,\ldots ,X_\ell \) and the remaining \(q=(r-1)(n-\ell )-p\) elements are in \(X_{\ell +1},\ldots ,X_k\). By the definition of \(\textbf{v}_1,\ldots ,\textbf{v}_n\), if a point \(\textbf{v}_h^{(t)}\in V_h\) for some \(h\in \{\ell +1,\ldots ,n\}\) is in \(X_i\) for some \(i\in \{1,\ldots ,\ell \}\), then \(\Vert \textbf{v}_h^{(t)}-\textbf{c}_i\Vert _0=\Vert \textbf{v}_h-\textbf{v}_i\Vert _0=4r\). Also we have that if \(\textbf{v}_h^{(t)}\in V_h\) for some \(h\in \{\ell +1,\ldots ,n\}\) is in \(X_i\) for some \(i\in \{\ell +1,\ldots ,r\}\), then \(\Vert \textbf{v}_h^{(t)}-\textbf{c}_I\Vert _0=\Vert \textbf{v}_h-\textbf{c}_I\Vert _0\ge | R_h'| =2r-1\). By (16), if the unique point \(X_i\setminus V_i\) is \(\textbf{f}_h^{(t)}\in F_h\) for some \(h\in \{1,\ldots ,m\}\), then \(\Vert \textbf{f}_h^{(t)}- \textbf{c}_i\Vert _0=\Vert \textbf{f}_h-\textbf{v}_i\Vert _0\ge 3r-2\). Then \(\sum _{i=1}^\ell \textsf{cost}_0(X_i)\ge 4rp+(3r-2)(\ell -p)\) and \(\sum _{i=\ell +1}^k\textsf{cost}_0(X_i)\ge (2r-1)q\). Recall also that \(r\ge 3\) and, therefore, \(r+2\le 2r-1\) and \((r+2)(r-1)>3r-2\). Summarizing, we obtain that

$$\begin{aligned} \textsf{cost}_0(X_1,\ldots ,X_k)= & {} \sum _{i=1}^\ell \textsf{cost}_0(X_i)+\sum _{i=\ell +1}^k\textsf{cost}_0(X_i)\\\ge & {} \big (4rp+(3r-2)(\ell -p)\big )+\big ((2r-1)q\big )\\ {}= & {} (3r-2)\ell +(r+2)p+(2r-1)q\\\ge & {} (3r-2)\ell +(r+2)(p+q)=(3r-2)\ell \\ {}{} & {} +(r+2)(r-1)(n-\ell )\\> & {} (3r-2)n=B, \end{aligned}$$

but this contradicts that \(\textsf{cost}_0(X_1,\ldots ,X_k)\le B\). This proves the claim.\(\square \)

By Claim 4.2, we obtain that \(\textbf{c}_i=\textbf{v}_i\) and \(X_i\subset I_i\) for \(i\in \{1,\ldots ,n\}\). For every \(i\in \{1,\ldots ,n\}\), \(X_i\setminus V_i\) contains a unique point. Clearly, this is a point from \(F_1\cup \cdots \cup F_m\). Denote by \(\textbf{f}_{h_i}^{(t_i)}\) the point of \(X_i\subset I_i\) for \(i\in \{1,\ldots ,n\}\). By (16), \(\Vert \textbf{c}_i-\textbf{f}_{h_i}^{(t_i)}\Vert _0=\Vert \textbf{c}_i-\textbf{f}_{h_i}\Vert _0\ge 3r-2\) for every \(i\in \{1,\ldots ,n\}\). This means that

$$\begin{aligned} B\ge&\textsf{cost}_0(X_1,\ldots ,X_k)=\sum _{i=1}^n\textsf{cost}_0(X_i)+\sum _{i=n+1}^k\textsf{cost}_0(X_i) \ge \sum _{i=1}^n\textsf{cost}_0(X_i)\\ \ge&(3d-2)n=B. \end{aligned}$$

Therefore, \(\sum _{i=n+1}^k\textsf{cost}_0(X_i)=0\). Hence, \(k-n=m-s\) clusters \(X_{n+1},\ldots ,X_k\subseteq F_1\cup \cdots \cup F_m\), where \(s=\frac{n}{r}\), consists of equal points. Without loss of generality, we assume that \(F_{s+1},\ldots ,F_m\) form these clusters. Then the elements of \(F_1,\ldots ,F_s\) are split to complement \(V_1,\ldots ,V_n\) to form \(X_1,\ldots ,X_n\). In particular, for every \(i\in \{1,\ldots ,n\}\), there is \(\textbf{f}_{h_i}^{(t_i)}\in X_i\) for some \(h_i\in \{1,\ldots ,m\}\) and \(t_i\in \{1,\ldots ,r\}\).

We claim that \(M=\{E_1,\ldots ,E_s\}\) is a perfect matching of \(\mathcal {H}\). To show this, consider a vertex \(v_i\in V(\mathcal {H})\). We prove that \(v_i\in E_{h_i}\). For sake of contradiction, assume that \(v_i\notin E_{h_i}\). Then \(\Vert \textbf{f}_{h_i}^{(t_i)}-\textbf{c}_i\Vert _0=\Vert \textbf{f}_{h_i}-\textbf{v}_i\Vert _0=3r\) by (16) and

$$\begin{aligned} \textsf{cost}_0(X_1,\ldots ,X_k)=&\sum _{j=1}^n\textsf{cost}_0(X_j)\ge \sum _{j=1}^n\Vert \textbf{f}_{h_j}^{(t_j)}-\textbf{c}_i\Vert _0= \sum _{j=1}^n\Vert \textbf{f}_{h_i}-\textbf{v}_i\Vert _0\\ \ge&(3r-2)n+2>B; \end{aligned}$$

a contradiction with \(\textsf{cost}_0(X_1,\ldots ,X_k)\le B\). Hence, every vertex of \(V(\mathcal {H})\) is saturated by some hyperedge of M. Since \(| M| =s=\frac{n}{r}\), we have that the hyperedges of M are pairwise disjoint, that is, M is a matching. Since every vertex is saturated and M is a matching, M is a perfect matching.

This concludes the proof of our claim that \(\mathcal {H}\) has a perfect matching if and only if \((\textbf{X},k,B)\) is a yes-instance of Decision Equal Clustering.

Observe that \(B=(3r-2)n\) in the reduction meaning that \(B=\mathcal {O}(n^2)\). Since Decision Equal Clustering is in \({{\,\mathrm{\textsf{NP}}\,}}\), there is a polynomial reduction form Decision Equal Clustering to Perfect \(r\) -Set Matching. Thus, if Decision Equal Clustering has a polynomial kernel when parameterized by B, then Perfect \(r\) -Set Matching has a polynomial kernel when parameterized by the number of vertices of the input hypergraph. This leads to a contradiction with Corollary 1 and completes the proof of the theorem.\(\square \)

4.2 Polynomial Kernel for \(k+B\) Parameterization

In this subsection, we prove Theorem 3 that we restate here.

Theorem 3

For every nonnegative integer constant p, Decision Equal Clustering admits a polynomial kernel when parameterized by k and B, where the output collection of points has \(\mathcal {O}(kB)\) points of \(\mathbb {Z}^{d'}\) with \(d'= \mathcal {O}(kB^{p+1})\) and each coordinate of a point takes an absolute value of \(\mathcal {O}(kB^2)\).

Proof

Let \((\textbf{X},k,B)\) be an instance of Decision Equal Clustering with \(\textbf{X}=\{\textbf{x}_1,\ldots ,\textbf{x}_n\}\), where the points are from \(\mathbb {Z}^d\). Recall that n is divisible by k.

Suppose \(\frac{n}{k}\ge 4B+1\). Then we can apply the algorithm from Lemma 6. If the algorithm returns that there is no equal k-clustering of cost at most B, then the kernelization algorithm returns a trivial no-instance of Decision Equal Clustering. Otherwise, if \(\textsf{Opt}(X,k)\le B\), then the algorithm returns a trivial yes-instance.

Assume from now that \(\frac{n}{k} \le 4B\), that is, \(n \le 4Bk\). Then we apply the algorithm from Lemma 7. If this algorithm reports that there is no equal k-clustering of cost at most B, then the kernelization algorithm returns a trivial no-instance of Decision Equal Clustering. Otherwise, the algorithm from Lemma 7 returns a collection of \(n\le 4Bk\) points \(\textbf{Y}=\{\textbf{y}_1,\ldots ,\textbf{y}_n\}\) of \(\mathbb {Z}^{d'}\) satisfying conditions (i)–(iii) of the lemma. By (i), we obtain that the instances \((\textbf{X},k,B)\) and \((\textbf{Y},k,B)\) of Decision Equal Clustering are equivalent. By (ii), we have that the dimension \(d'= \mathcal {O}(k(B^{p+1}))\), and by (iii), each coordinate of a point takes an absolute value of \(\mathcal {O}(kB^2)\). Thus, \((\textbf{Y},k,B)\) is a required kernel.\(\square \)

5 APX-hardness of Equal Clustering

In this section, we prove APX-hardness of Decision Equal Clustering w.r.t. Hamming (\(\ell _0\)) and \(\ell _1\) distances. The constructed hard instances consist of high-dimensional binary (0/1) points. As the \(\ell _0\) and \(\ell _1\) distances between any two binary points are the same, we focus on the case of \(\ell _0\) distances. Our reduction is from 3-Dimensional Matching (3DM), where we are given three disjoint sets of elements XY and Z such that \(| X| =| Y| =| Z|=n\) and a set of m triples \(T\subseteq X\times Y\times Z\). In addition, each element of \(W:= X\cup Y\cup Z\) appears in at most 3 triples. A set \(M \subseteq T\) is called a matching if no element of W is contained in more than one triple of M. The goal is to find a maximum cardinality matching. We need the following theorem due to Petrank [56].

Theorem 4

(Restatement of Theorem 4.4 from [56]) There exists a constant \(0<\gamma < 1\), such that it is NP-hard to distinguish the instances of the 3DM problem in which a perfect matching exists, from the instances in which there is a matching of size at most \((1-\gamma )n\).

Here \(\gamma \) should be seen as a very small constant close to 0. We use the construction described in Section 4.1, with a small modification.

We are given an instance of 3DM. Let \(N=3n\), the total number of elements. We construct a binary matrix A of dimension \(6N\times (2N+3m)\). For each element, we take 2 columns and for each triple 3 columns. The 6N row indexes are partitioned into N parts each of size 6. In particular, let \(R_1=\{1,\ldots ,6\}\), \(R_2=\{7,\ldots ,12\}\) and so on. For the i-th element, we construct the column \(a_i\) of length 6N which has 1 corresponding to the indexes in \(R_i\) and 0 elsewhere.

Recall that each element can appear in at most 3 triples. For each element x, consider any arbitrary ranking of the triples that contain it. The occurrence of x in a triple with rank j is called its j-th occurrence for \(1\le j \le 3\). For example, suppose x appears in three triples \(t_w, t_y\) and \(t_z\). One can consider the ranking \(1. t_w, 2. t_y, 3. t_z\). Then, the occurrence of x in \(t_y\) is called 2-nd occurrence. Let \(v_i^j\) be the j-th index of \(R_i\) for \(1\le i\le N, 1\le j\le 3\). For each triple t with \(j_1\)-, \(j_2\)- and \(j_3\)-th occurrences of the elements pq and r in t, respectively, we construct the column \(b_t\) of length 6N which has 1 corresponding to the indexes \(v_p^{j_1}, v_q^{j_2}\) and \(v_r^{j_3}\), and 0 elsewhere.

The triple columns are defined in a different way in our reduction in Section 4.1 where for each triple and each element, a fixed index is set to 1. But, we set different indexes based on the occurrences of the element. This ensures that for two different triple columns \(b_s\) and \(b_t\), their Hamming distance \(d_H(b_s,b_t)=6\). Note that \(d_H(a_i,b_t)=7\) if the element i is in triple t, otherwise \(d_H(a_i,b_t)=9\). Set cluster size to be 3 and the number of clusters k to be \((2N/3)+m\). We will prove the following lemma.

Lemma 10

If there is a perfect matching, there is a feasible clustering of cost 7N. If all matchings have size at most \((1-\gamma )n\), any feasible clustering has cost at least \(7(1-\gamma )N+(23/3)\gamma N\).

Note that it is sufficient to prove the above lemma for showing the APX-hardness of the problem. The proof of the first part of the lemma is exactly the same as in the previous construction. We will prove the second part. To give some intuition of the cost suppose there is a matching of the maximum size \((1-\gamma )n\). Then we can cluster the matched elements and triples in the same way as in the perfect matching case by paying a cost of \(7(1-\gamma )N\). Now for each unmatched element, we put its two columns in a cluster. Now we have \(\gamma N\) clusters with one free slot in each. One can fill in these slots by columns corresponding to \(\gamma N/3\) unmatched triples. All the remaining unmatched triples form their own cluster. Now, consider an unmatched triple s whose 3 columns are used to fill in slots of unmatched elements p, q, and r. As this triple was not matched, it cannot contain all these three elements, i.e, it can contain at most 2 of these elements. Thus, for at least one element, the cost of the cluster must be 9. Therefore, the total cost of the three clusters corresponding to p, q, and r is at least \(7+7+9=23\). The total cost corresponding to all \(\gamma N/3\) unmatched triples is then \((23/3)\gamma N\). We will show that one cannot find a feasible clustering of lesser cost.

For our convenience, we will prove the contrapositive of the second part of the above lemma: if there is a feasible clustering of cost less than \(7(1-\gamma )N+(23/3)\gamma N\), then there is a matching of size greater than \((1-\gamma )n\). So, assume that there is such a clustering. Let \(c_1,c_2,\ldots ,c_k\) be the cluster centers.

By Lemma 9, we can assume that if a column f of A is a center of a cluster C, all the columns equal to f are in C. We will use this in the following. A center \(c_i\) is called an element center if \(c_i\) is an element column. Suppose the given clustering contains \(\ell \) clusters with element centers for some \(\ell \). WLOG, we assume that these are the first \(\ell \) clusters.

Lemma 11

If the cost of the given clustering is less than \(7(1-\gamma )N+(23/3)\gamma N\), \(\ell > (1-2\gamma /9)N\).

Proof

Note that if a cluster center is an element column, then by Lemma 9 we can assume that both element columns are present in the cluster. Thus, in our case, each of the first \(\ell \) clusters contains two element columns and some other column. Now, each of these \(\ell \) other columns can be either a column of some other element or a triple column. Let \(\ell _1\) of these be element columns and \(\ell _2\) of these be triple columns, where \(\ell =\ell _1+\ell _2\). For each cluster corresponding to these \(\ell _1\) element columns, the cost is 12, as \(d_H(a_i,a_j)=12\) for all ij. Similarly, for each cluster corresponding to the \(\ell _2\) triple columns, the cost is at least 7, as \(d_H(a_i,b_t)\ge 7\) for all it.

Note that out of 2N element columns, \(2\ell +\ell _1\) are in the first \(\ell \) clusters. The rest of the element columns are in the other clusters. Now there can be two cases: such a column is in a cluster that contains (i) at least 2 element columns and (ii) exactly one element column.

Claim 5.1

The cost of each element column which is not in the first \(\ell \) clusters is at least 5 in the first case.

Proof

Consider such a column \(a_i\) and let \(c_j\) be the center of the cluster that contains \(a_i\). Note that the only 1 entries in \(a_i\) are corresponding to the indexes in \(R_i\). We claim that at most one entry of \(c_j\) corresponding to the indexes in \(R_i\) can be 1. This proves the original claim, as \(| R_i| =6\). Consider an index \(z \in R_i\) such that \(c_j[z]=1\). As \(c_j\) is not an element column and the centers are defined based on the majority rule, there is a column e in the cluster with \(e[z]=1\). This must be a column of a triple that contains the element i. By construction, e does not contain 1 corresponding to the indexes in \(R_i \setminus \{z\}\). As the third column in the cluster is another element column (as we are in the first case), its entries corresponding to the indexes in \(R_i\) are again 0. Hence, by majority rule, at most one entry of \(c_j\) corresponding to the indexes in \(R_i\) can be 1.\(\square \)

Next, we consider case (ii).

Claim 5.2

Consider a cluster that is not one of the first \(\ell \) clusters and contains exactly one element column. Then, its cost is at least 5. Moreover, the cost of the element column is at least 4.

Proof

Consider the element column \(a_i\) of the cluster and let \(c_j\) be the center of the cluster. Note that the only 1 entries in \(a_i\) are corresponding to the indexes in \(R_i\). Now, if the other two (triple) columns in the cluster are the same, there must be at most one entry of them corresponding to the indexes in \(R_i\) that is 1. This is true by the construction of triple columns. Hence, in this case, at most one entry of \(c_j\) corresponding to the indexes in \(R_i\) can be 1 and the cost is at least 5. Otherwise, there can be two distinct triple columns \(b_s\) and \(b_t\) in the cluster and at most two indexes \(z_1,z_2 \in R_i\) such that \(z_1\ne z_2\) and \(b_s[z_1]=b_t[z_2]=1\). By construction of the triple columns, there are no other indices \(z \in R_i \setminus \{z_1,z_2\}\) such that \(b_s[z]=1\) or \(b_t[z]=1\). Thus, by the majority rule, at most two entries of \(c_j\) corresponding to the indices in \(R_i\) can be 1. Hence, the cost of \(a_i\) is at least 4. Now, as \(b_s\) and \(b_t\) are distinct, the cost of either one of them must be at least 1. It follows that the cost of this cluster is at least 5. \(\square \)

Now, again consider the \(2N-2\ell -\ell _1\) element columns that are not in the first \(\ell \) clusters. Let \(\kappa \) be the number of clusters that are not the first \(\ell \) clusters and contain exactly 1 element column. This implies that \(2N-2\ell -\ell _1-\kappa \) element columns are contained in the clusters which are not the first \(\ell \) clusters and contain at least 2 element columns. By Claim 5.1, the cost of each such column is at least 5. By Claim 5.2, the cost of each of the \(\kappa \) clusters defined above is at least 5.

It follows that the total cost of the clustering is \(12\ell _1+7\ell _2+(2N-2\ell -\ell _1-\kappa )5+5\kappa =10N-3\ell \), as \(\ell =\ell _1+\ell _2\). Now, given that the cost is less than \(7(1-\gamma )N+(23/3)\gamma N\).

$$\begin{aligned}&10N-3\ell< 7(1-\gamma )N+(23/3)\gamma N=7N+2\gamma N/3\\&3N-3\ell < 2\gamma N/3\\&\ell > (1-2\gamma /9)N \end{aligned}$$

\(\square \)

Like before, let \(\ell _2\) be the number of clusters out of the first \(\ell \) clusters such that \(\ell _2\) contains a triple column.

Claim 5.3

\(\ell _2 > (1-2\gamma /3)N\).

Proof

Again consider the cost of the given clustering. The cost of the \(\ell _2\) clusters is at least 7. The cost of the remaining \(\ell -\ell _2\) clusters is exactly 12 as before. Now, as \(\ell > (1-2\gamma /9)N\) by Lemma 11,

$$\begin{aligned}&7\ell _2+12((1-2\gamma /9)N-\ell _2)< 7(1-\gamma )N+(23/3)\gamma N=7N+2\gamma N/3\\&7\ell _2+12N-24\gamma N/9-12\ell _2 < 7N +2\gamma N/3\\&5\ell _2> 5N - 30\gamma N/9\\&\ell _2 > (1-2\gamma /3)N \end{aligned}$$

\(\square \)

We show that out of the \(\ell _2\) elements corresponding to these \(\ell _2\) clusters, more than \((1-\gamma )N\) elements must be matched.

Fig. 2
figure 2

Hierarchy of the clusters. Illustration of the proof of Lemma 12

Lemma 12

There is a matching that matches more than \((1-\gamma )N\) elements.

Proof

Consider the set of elements corresponding to the \(\ell _2\) clusters, each of which contains a triple column. Let M be a maximum matching involving these elements and triples that matches \(\mu \) elements. We will show that \(\mu > (1-\gamma )N\). The total cost of the clusters corresponding to these matched elements is \(7\mu \). Let \(\ell _{1}\) be the number of clusters out of the first \(\ell \) clusters that contain all element columns (see Fig. 2). The total cost of these clusters is \(12\ell _{1}\). Note that \(3{\ell _1}\) columns are involved in these clusters. For the remaining at least \(2(N-\mu ) - 3{\ell _1}\) element columns and correspondingly at least \(N-\mu - 3{\ell _1}/2\) elements, the corresponding columns can either be in one cluster along with a triple column or split into two clusters. Let \(\ell _3\) be the number of such elements whose columns are in one cluster along with a triple column. Also, let \(\ell _4\) be the remaining elements whose columns are split into two clusters (see Fig. 2). By Claims 5.2 and 5.1, the cost of each split column is at least 4. Thus, the total cost corresponding to these \(\ell _4\) elements is at least \(8\ell _4\). Now, we compute the cost corresponding to the \(\ell _3\) elements whose columns are in one cluster along with a triple column. Consider the set of triples involved in these clusters. Also, let \(T_1\) be the set of triples whose three columns appear in these \(\ell _3\) clusters. The cost of such triple columns is at least \(7+7+9=23\), as they are not a part of the maximum matching. Let \(\ell _5\) be the number of clusters among the \(\ell _3\) clusters where the triples in \(T_1\) do not appear and \(T_2\) be the set of associated triples. Each triple in \(T_2\) thus appears in at most 2 clusters among the \(\ell _3\) clusters (see Fig. 2). Let \(T_3\subseteq T_2\) be the set of triples each of which is only associated with the clusters of cost 7 and \(\ell _6\) be the number of these clusters. As these triples are not part of the maximum matching, each of them can cover at most two unmatched elements. Thus, the size of \(T_3\) is at least \(\ell _6/2\). Note that, by definition, at least one column of each such triple does not belong to the first \(\ell \) clusters. We compute the cost of these triple columns. If such a triple column appears in all triple column clusters, the cost of the column is at least 3, by the construction of the triple columns and noting that two copies of the column cannot appear in the cluster. If such a triple is in a cluster with only one element column, its cost must be at least 2, as the element columns’ at most one 1 entry can coincide with the 1 entries of the column. Now, if such a triple column appears in a cluster with two element columns, then the cost of the column is at least 1. However, the cost of the element columns must be at least 10. We charged each such element column a cost of 4 while charging the split columns corresponding to the \(\ell _4\) elements. So, we can charge \(10-8=2\) additional cost to those element columns. Instead, we charge this to the triple column. Thus, its charged cost is \(1+2=3\). Thus, the total cost corresponding to the triples in \(T_3\) is at least \((\ell _6/2)\cdot 2\).

The total cost of the clustering is at least,

$$\begin{aligned}&7\mu +12\ell _1+8\ell _4+(23/3)| T_1|+(\ell _5-\ell _6)((7+9)/2)+7\ell _6+(\ell _6/2)\cdot 2\\ =&7\mu +12\ell _1+8\ell _4+(23/3)(\ell _3-\ell _5)+8\ell _5 \qquad \qquad \text { (as } 3| T_1|=\ell _3-\ell _5)\\ \ge&7\mu +12\ell _1+8\ell _4+(23/3)\ell _3 \\ \ge&7\mu +12\ell _1+(23/3)(\ell _3+\ell _4)\\ \ge&7\mu +12\ell _1+(23/3)(N-\mu -3{\ell _1}/2) \qquad \qquad \text { (as } \ell _3+\ell _4\ge N-\mu -3\ell _1/2)\\ =&7\mu + (23/3)(N-\mu )+{\ell _1}/2\\ \ge&(23/3)N-(2/3)\mu \qquad \qquad \text { (as } \ell _1\ge 0) \end{aligned}$$

Now, we know a strict upper bound on this cost. Thus,

$$\begin{aligned}{} & {} (23/3)N-(2/3)\mu< 7N+(2/3)\gamma N\\{} & {} (23/3-7)N-(2/3)\gamma N< (2/3)\mu \\{} & {} (2/3)N(1-\gamma ) < (2/3)\mu \\{} & {} \mu > (1-\gamma )N \end{aligned}$$

\(\square \)

We summarize the results of this section in the following theorem.

Theorem 5

There exists a constant \(\varepsilon _c > 0\), such that it is \({{\,\mathrm{\textsf{NP}}\,}}\)-hard to obtain a \((1+\varepsilon _c)\)-approximation for Equal Clustering with \(\ell _0\) (or \(\ell _1)\) distances, even if the input points are binary, that is, are from \(\{0,1\}^d\).

6 Conclusion

We initiated the study of lossy kernelization for clustering problems and proved that Parameterized Equal Clustering admits a 2-approximation kernel. It is natural to ask whether the approximation factor may be improved. In particular, does the problem admit a polynomial size approximate kernelization scheme (PSAKS) that is a lossy kernelization analog of PTAS (we refer to [8] for the definition)? Note that we proved that Equal Clustering is APX-hard and this refutes the existence of PTAS and makes it natural to ask the question about PSAKS. We also believe that it is interesting to consider the variants of the considered problems for means instead of medians. Here, the cost of a collection of points \(\textbf{X}\subseteq \mathbb {Z}^d\) is defined as \(\min _{\textbf{c}\in \mathbb {R}^d}\sum _{\textbf{x}\in \textbf{X}}\Vert \textbf{c}-\textbf{x}\Vert _p^p\) for \(p\ge 1\). Clearly, if \(p=1\), that is, in the case of the Manhattan norm, our results hold. However, for \(p\ge 2\), we cannot translate our results directly because our arguments rely on the triangle inequality. We would like to conclude the paper by underlining our belief that lossy kernelization may be a natural tool for the lucrative area of approximation algorithms for clustering problems.