research-article

Open Access

Approximating (k,ℓ)-Median Clustering for Polygonal Curves

Authors:
Maike Buchin

Ruhr University Bochum

Ruhr University Bochum

0000-0002-3446-4343
View Profile

,
Anne Driemel

University of Bonn

University of Bonn

0000-0002-1943-2589
View Profile

,
Dennis Rohde

Ruhr University Bochum

Ruhr University Bochum

0000-0001-8984-1962
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 19 Issue 1Article No.: 4pp 1–32https://doi.org/10.1145/3559764

Published:23 February 2023Publication History

ACM Transactions on Algorithms

Abstract

In 2015, Driemel, Krivošija, and Sohler introduced the k,ℓ-median clustering problem for polygonal curves under the Fréchet distance. Given a set of input curves, the problem asks to find k median curves of at most ℓ vertices each that minimize the sum of Fréchet distances over all input curves to their closest median curve. A major shortcoming of their algorithm is that the input curves are restricted to lie on the real line. In this article, we present a randomized bicriteria-approximation algorithm that works for polygonal curves in ℝ^d and achieves approximation factor (1+ɛ) with respect to the clustering costs. The algorithm has worst-case running time linear in the number of curves, polynomial in the maximum number of vertices per curve (i.e., their complexity), and exponential in d, ℓ, 1/ɛ and 1/δ (i.e., the failure probability). We achieve this result through a shortcutting lemma, which guarantees the existence of a polygonal curve with similar cost as an optimal median curve of complexity ℓ, but of complexity at most 2ℓ -2, and whose vertices can be computed efficiently. We combine this lemma with the superset sampling technique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze a generalization of the algorithm by Ackermann et al., which may be of independent interest.

1 INTRODUCTION

Since the development of $k$-means—the pioneer of modern computational clustering—the past 65 years have brought a diversity of specialized [6, 7, 15, 21, 22, 33, 34] as well as generalized clustering algorithms [2, 5, 26]. However, in most cases, clustering of point sets was studied. Many clustering problems indeed reduce to clustering of point sets, but for sequential data like time series and trajectories—which arise in the natural sciences, medicine, sports, finance, ecology, audio/speech analysis, handwriting, and many more—this is not the case. Hence, we need specialized clustering methods for these purposes (c.f. [1, 14, 20, 31, 32]).

A promising branch of this active research deals with $(k,\ell)$-center and $(k,\ell)$-median clustering—adaptions of the well-known Euclidean $k$-center and $k$-median clustering. In $(k,\ell)$-center clustering, respectively $(k,\ell)$-median clustering, we are given a set of $n$ polygonal curves in $\mathbb {R}^d$ of complexity (i.e., the number of vertices of the curve) at most $m$ each and want to compute $k$ centers that minimize the objective function—just as in Euclidean $k$-clustering. In addition, the complexities of the centers are bounded by a constant $\ell$ to obtain compact aggregates of the clusters, which is enabled by the continuous nature of the employed distance measure. This also prevents overfitting (see the discussions in other works [8, 17]). A great benefit of regarding the sequential data as polygonal curves is that we introduce an implicit linear interpolation. This does not require any additional storage space since we only need to store the vertices of the curves, which are the sequences at hand. We compare the polygonal curves by their Fréchet distance, which is a continuous distance measure that takes the entire course of the curves into account, not only the pairwise distances among their vertices. Therefore, irregular sampled sequences are automatically handled by the interpolation, which is desirable in many cases. Moreover, Buchin et al. [12] showed, by using heuristics, that the $(k,\ell)$-clustering objectives yield promising results on trajectory data.

This branch of research was formed only recently, about 20 years after Alt and Godau [4] developed an algorithm to compute the Fréchet distance between polygonal curves. Several works have since studied this type of clustering [10, 11, 12, 17, 28]. However, all of these clustering algorithms, except the approximation schemes for polygonal curves in $\mathbb {R}$ [17] and the heuristics in the work of Buchin et al. [12], choose a $k$-subset of the input as centers. (This is also often called discrete clustering.) This $k$-subset is later simplified, or all input curves are simplified before choosing a $k$-subset. Either way, using these techniques, one cannot achieve an approximation factor of less than 2. This is because there need not be an input curve with distance to its median that is less than the average distance of a curve to its median.

Driemel et al. [17], who were the first to study clustering of polygonal curves under the Fréchet distance in this setting, already overcame this problem in one dimension by defining and analyzing $\delta$-signatures, which are succinct representations of classes of curves that allow synthetic center curves to be constructed. However, it seems that $\delta$-signatures are only applicable in $\mathbb {R}$. Here, we extend their work and obtain the first randomized bicriteria approximation algorithm for $(k,\ell)$-median clustering of polygonal curves in $\mathbb {R}^d$.

1.1 Related Work

Driemel et al. [17] introduced the $(k,\ell)$-center and $(k,\ell)$-median objectives and developed the first approximation schemes for these objectives, for curves in $\mathbb {R}$. Furthermore, they proved that $(k,\ell)$-center as well as $(k,\ell)$-median clustering is NP-hard, where $k$ is a part of the input and $\ell$ is fixed. In addition, they showed that the doubling dimension of the metric space of polygonal curves under the Fréchet distance is unbounded, even when the complexity of the curves is bounded.

Following this work, Buchin et al. [10] developed a constant-factor approximation algorithm for $(k,\ell)$-center clustering in $\mathbb {R}^d$. Furthermore, they provide improved results on the hardness of approximating $(k,\ell)$-center clustering under the Fréchet distance: the $(k,\ell)$-center problem is NP-hard to approximate within a factor of $(1.5 - \varepsilon)$ for curves in $\mathbb {R}$ and within a factor of $(2.25 - \varepsilon)$ for curves in $\mathbb {R}^d$, where $d \ge 2$, in both cases even if $k = 1$. Furthermore, for the $(k,\ell)$-median variant, Buchin et al. [11] proved NP-hardness using a similar reduction. Again, the hardness holds even if $k=1$. In addition, they provided $(1+\varepsilon)$-approximation algorithms for $(k,\ell)$-center, as well as $(k,\ell)$-median clustering, under the discrete Fréchet distance. Nath and Taylor [30] give improved algorithms for $(1+\varepsilon)$-approximation of $(k,\ell)$-median clustering under discrete Fréchet and Hausdorff distance. Recently, Meintrup et al. [28] introduced a practical $(1+\varepsilon)$-approximation algorithm for discrete $k$-median clustering under the Fréchet distance, when the input adheres to a certain natural assumption (i.e., the presence of a certain number of outliers).

Our algorithms build upon the clustering algorithm of Kumar et al. [27], which was later extended by Ackermann et al. [2]. As a main result, they proved that any dissimilarity measure $\rho (\cdot ,\cdot)$ that satisfies a so-called sampling property (i.e., given an input set $T$ and a constant-size uniform sample $S$ from $T$, a median of $S$ can be computed in time depending only on $\vert S \vert$ and is a $(1+\varepsilon)$-approximate median of $T$ with high probability) admits a randomized $(1+\varepsilon)$-approximation algorithm for the $k$-median problem under $\rho$. This algorithm also exists under a weak variant of the sampling property, where the median of $S$ does not need to be computed exactly but rather a constant-size set of candidates (in time depending only on $\vert S \vert , \varepsilon$ and the probability of failure $\delta$), which contain a $(1+\varepsilon)$-approximate median of $T$ with high probability. It is a recursive approximation scheme that employs two phases in each call. In the so-called candidate phase, it computes candidates by taking a sample $S$ from the input set $T$ and running an algorithm on each subset of $S$ of a certain size. Which algorithm to use depends on the dissimilarity measure at hand. The idea behind this is simple: if $T$ contains a cluster $T^\prime$ that takes a constant fraction of its size, then a constant fraction of $S$ is from $T^\prime$ with high probability. By brute-force enumeration of all subsets of $S$, we find this subset $S^\prime \subseteq T^\prime ,$ and if $S$ is taken uniformly and independently at random from $T,$ then $S^\prime$ is a uniform and independent sample from $T^\prime$. Ackermann et al. [2] proved for various metric and non-metric dissimilarity measures that $S^\prime$ can be used for computing candidates that contain a $(1+\varepsilon)$-approximate median for $T^\prime$ with high probability. The algorithm recursively calls itself for each candidate to eventually evaluate these together with the candidates for the remaining clusters.

The second phase of the algorithm is the so-called pruning phase, where it partitions its input according to the candidates at hand into two sets of equal size: one with the smaller distances to the candidates and one with the larger distances to the candidates. It then recursively calls itself with the second set as input. The idea behind this is that small clusters now become large enough to find candidates for these. Furthermore, the partitioning yields a provably small error. Finally, it returns the set of $k$ candidates that together evaluated best.

1.2 Our Contributions

We present several algorithms for approximating $(1,\ell)$-median clustering of polygonal curves under the Fréchet distance (Figure 1 presents an illustration of the operation principles of our algorithms). Although the first one, Algorithm 1, yields only a coarse approximation (factor 34), it is suitable as plugin for the following two algorithms, Algorithms 2 and 4, due to its asymptotically fast running time. These algorithms yield a better approximation (factor $3+\varepsilon$, respectively $1+\varepsilon$). In addition, Algorithms 2 and 4 are not only able to yield an approximation for the input set $T$ but also for a cluster $T^\prime \subseteq T$ that takes a constant fraction of $T$. We would like to use these as plugins to the $(1+\varepsilon)$-approximation algorithm for $k$-median clustering by Ackermann et al. [2], but that would require our algorithms to comply with one of the sampling properties. Recall that for an input set $T,$ the weak sampling property requires that a constant-size set of candidates that contains with high probability a $(1+\varepsilon)$-approximate median of $T$ can be computed from a constant-size uniform and independent sample of $T$. Furthermore, the running time for computing the candidates depends only on the size of the sample, $\varepsilon ,$ and the failure probability parameter. The strong sampling property is defined similarly, but instead of a candidate set, an approximate median can be computed directly and the running time may only depend on the size of the sample. In our algorithms, however, the running time for computing the candidate set depends on $m,$ which is a parameter of the input. In addition, our first algorithm for computing candidates, which contain a $(3+\varepsilon)$-approximate $(1,\ell)$-median with high probability, does not achieve the required approximation-factor of $(1+\varepsilon)$. However, looking into the analysis of Ackermann et al. [2], any approximation algorithm computing candidates can be used in the recursive approximation scheme. Therefore, we decided to generalize the $k$-median clustering algorithm of Ackermann et al. [2]. Nath and Taylor [30] use a similar approach, but they developed yet another way to compute candidates: they define and analyze $g$-coverability, which is a generalization of the notion of doubling dimension, and indeed, for the discrete Fréchet distance, the proof builds upon the doubling dimension of points in $\mathbb {R}^d$. However, the doubling dimension of polygonal curves under the Fréchet distance is unbounded, even when the complexities of the curves are bounded, and it is an open question whether $g$-coverability holds for the continuous Fréchet distance.

Fig. 1. From left to right: symbolic depiction of the operation principle of Algorithms 1, 2, and 4. Among all approximate $\ell$ -simplifications (depicted in blue) of the input curves (depicted in black), Algorithm 1 returns the one that evaluates best (the solid curve) with respect to a sample of the input. Algorithm 2 does not return a single curve but a set of candidates. These include the curve returned by Algorithm 1 plus all curves with $\ell$ vertices from the cubic grids, covering balls of certain radius centered at the vertices of an input curve that is close to a median, w.h.p. Algorithm 4 is similar to Algorithm 2 but does not only cover the vertices of a single curve but of multiple curves. We depict the best approximate median that can be generated from the grids in solid green.

We circumvent this by taking a different approach using the idea of shortcutting. It is well known that shortcutting a polygonal curve (i.e., replacing a subcurve by the line segment connecting its endpoints) does not increase its Fréchet distance to a line segment. This idea has been used before for a variety of Fréchet-distance related problems [3, 9, 16, 17]. Specifically, we introduce two new shortcutting lemmata. These lemmata guarantee the existence of good approximate medians, with complexity at most $2\ell -2$ and whose vertices can be computed efficiently. The first one enables us to return candidates, which contain a $(3+\varepsilon)$-approximate median for any cluster that takes a constant fraction of the input, w.h.p., and we call it simple shortcutting. The second one enables us to return candidates, which contain a $(1+\varepsilon)$-approximate median for any cluster that takes a constant fraction of the input, w.h.p., and we call it advanced shortcutting. All in all, we obtain as our main result Theorem 1.1, following from Corollary 7.5.

Theorem 1.1.

Given a set $T$ of $n$ polygonal curves in $\mathbb {R}^d$, of complexity at most $m$ each, parameter values $\varepsilon \in (0, 0.158]$ and $\delta \in (0,1)$, and constants $k,\ell \in \mathbb {N}$, there exists an algorithm, which computes a set $C$ of $k$ polygonal curves, each of complexity at most $2\ell -2$, such that with probability at least $(1-\delta)$, it holds that $\begin{align*} \text{cost}( T,C) = \sum _{\tau \in T} \min _{c \in C} \text{d}_\text{F}\;(c,\tau) \le (1+\varepsilon) \sum _{\tau \in T} \min _{c \in C^{\ast }} \text{d}_\text{F}\;(c,\tau) = (1+\varepsilon) \text{cost}( T,C^\ast ) , \end{align*}$ where $C^{\ast }$ is an optimal $(k,\ell)$-median solution for $T$ under the Fréchet distance $\text{d}_\text{F}(\cdot , \cdot)$.

The algorithm has worst-case running time linear in $n$, polynomial in $m,$ and exponential in $1/\delta ,1/\varepsilon , d$ and $\ell$.

We remark that $cost( T,C) \lt cost( T,C^\ast )$ may happen due to the possible increased center complexity of up to $2\ell -2$.

1.3 Organization

The article is organized as follows. First, we present a simple and fast 34-approximation algorithm for $(1,\ell)$-median clustering. Then, we present the $(3+\varepsilon)$-approximation algorithm for $(1,\ell)$-median clustering inside a cluster that takes a constant fraction of the input, which builds upon simple shortcutting and the 34-approximation algorithm. Then, we present a more practical modification of the $(3+\varepsilon)$-approximation algorithm, which achieves a $(5+\varepsilon)$-approximation for $(1,\ell)$-median clustering. Following this, we present the similar but more involved $(1+\varepsilon)$-approximation algorithm for $(1,\ell)$-median clustering inside a cluster that takes a constant fraction of the input, which builds upon the advanced shortcutting and the 34-approximation algorithm. Finally, we present the generalized recursive $k$-median approximation scheme, which leads to our main result.

2 PRELIMINARIES

Here we introduce all necessary definitions. In the following, $d \in \mathbb {N}$ is an arbitrary constant. For $n \in \mathbb {N,}$ we denote $[n] = \lbrace 1, \ldots , n\rbrace ,$ and by $X \uplus Y,$ we denote the union of the disjoint sets $X,Y$. By $\Vert \cdot \Vert ,$ we denote the Euclidean norm, and for $p \in \mathbb {R}^d$ and $r \in \mathbb {R}_{\ge 0}$, we denote by $B(p,r) = \lbrace q \in \mathbb {R}^d \mid \Vert p - q \Vert \le r\rbrace$ the closed ball of radius $r$ with center $p$. By $\mathcal {S}_n,$ we denote the symmetric group of degree $n$. We give a standard definition of grids in Definition 2.1.

Definition 2.1

(Grid).

Given a number $r \in \mathbb {R}_{\gt 0}$, for $p=(p_1, \ldots , p_d) \in \mathbb {R}^d,$ we define by $G(p,r) = (\lfloor p_1 / r \rfloor \cdot r, \ldots , \lfloor p_d / r \rfloor \cdot r)$ the $r$-grid-point of $p$. Let $X$ be a subset of $\mathbb {R}^d$. The grid of cell width $r$ that covers $X$ is the set $\mathbb {G}(X,r) = \lbrace G(p,r) \mid p \in X\rbrace$.

Such a grid partitions the set $X$ into cubic regions, and for each $r \in \mathbb {R}_{\gt 0}$ and $p \in X,$ we have that $\Vert p - G(p,r) \Vert \le \sqrt {d} r$. We give a standard definition of polygonal curves in Definition 2.2.

Definition 2.2

(Polygonal Curve).

A (parameterized) curve is a continuous mapping $\tau :[0,1] \rightarrow \mathbb {R}^d$. A curve $\tau$ is polygonal, iff there exist $v_1, \ldots , v_m \in \mathbb {R}^d$, no three consecutive on a line, called $\tau$’s vertices and $t_1, \ldots , t_m \in [0,1]$ with $t_1 \lt \dots \lt t_m$, $t_1 = 0$ and $t_m = 1$, called $\tau$’s instants, such that $\tau$ connects every two consecutive vertices $v_i = \tau (t_i), v_{i+1} = \tau (t_{i+1})$ by a line segment.

We call the line segments $\overline{v_1v_2}, \ldots , \overline{v_{m-1}v_m}$ the edges of $\tau$ and $m$ the complexity of $\tau$, denoted by $\vert \tau \vert$. Sometimes we will argue about a subcurve $\tau$ of a given curve $\sigma$. We will then refer to $\tau$ by restricting the domain of $\sigma$, denoted by $\sigma \vert _X$, where $X \subseteq [0,1]$.

Definition 2.3

(Fréchet Distance).

Let $\mathcal {H}$ denote the set of all continuous bijections $h:[0,1] \rightarrow [0,1]$ with $h(0) = 0$ and $h(1) = 1$, which we call reparameterizations. The Fréchet distance between curves $\sigma$ and $\tau$ is defined as $\begin{equation*} \text{d}_\text{F}(\sigma , \tau)\ =\ \inf _{h \in \mathcal {H}}\ \max _{t \in [0,1]}\ \Vert \sigma (t) - \tau (h(t)) \Vert . \end{equation*}$

Sometimes, given two curves $\sigma , \tau$, we will refer to an $h \in \mathcal {H}$ as matching between $\sigma$ and $\tau$.

Note that there may not exist a matching $h \in \mathcal {H}$, such that $\max _{t \in [0,1]} \Vert \sigma (t) - \tau (h(t)) \Vert = \text{d}_\text{F}(\sigma , \tau)$. This is due to the fact that in some cases a matching realizing the Fréchet distance would need to match multiple points $p_1, \ldots , p_n$ on $\tau$ to a single point $q$ on $\sigma$, which is not possible since matchings need to be bijections, but the $p_1, \ldots , p_n$ can get matched arbitrarily close to $q$, realizing $\text{d}_\text{F}(\sigma , \tau)$ in the limit, which we formalize in the following lemma.

Lemma 2.4.

Let $\sigma , \tau :[0,1] \rightarrow \mathbb {R}^d$ be curves. Let $r = \text{d}_\text{F}(\sigma , \tau)$. There exists a sequence $(h_i)_{i=1}^\infty$ in $\mathcal {H}$, such that $ \begin{equation*} \lim \limits _{i \rightarrow \infty } \max \limits _{t \in [0,1]} \Vert \sigma (t) - \tau (h_i(t)) \Vert = r. \end{equation*} $

Proof.

Define $\rho :\mathcal {H} \rightarrow \mathbb {R}_{\ge 0}, h \mapsto \max \limits _{t\in [0,1]} \Vert \sigma (t) - \tau (h(t)) \Vert$ with image $R = \lbrace \rho (h) \mid h \in \mathcal {H} \rbrace$. Per definition, we have $\text{d}_\text{F}(\sigma , \tau) = \inf R = r$.

For any non-empty subset $X$ of $\mathbb {R}$ that is bounded from below and for every $\varepsilon \gt 0,$ it holds that there exists an $x \in X$ with $\inf X \le x \lt \inf X + \varepsilon$, by definition of the infimum. Since $R \subseteq \mathbb {R}$ and $\inf R$ exists, for every $\varepsilon \gt 0$ there exists an $r^\prime \in R$ with $\inf R \le r^\prime \lt \inf R + \varepsilon$.

Now, let $a_i = 1/i$ be a zero sequence. For every $i \in \mathbb {N,}$ there exists an $r_i \in R$ with $r \le r_i \lt r + a_i$, thus $\lim \limits _{i \rightarrow \infty } r_i = r$.

Let $\rho ^{-1}(r^\prime) = \lbrace h \in \mathcal {H} \mid \rho (h) = r^\prime \rbrace$ be the preimage of $\rho$. Since $\rho$ is a function, $\vert \rho ^{-1}(r^\prime) \vert \ge 1$ for each $r^\prime \in R$. Now, for $i \in \mathbb {N}$, let $h_i$ be an arbitrary element from $\rho ^{-1}(r_i)$. By definition, it holds that $\begin{equation*} \lim \limits _{i \rightarrow \infty } \max \limits _{t \in [0,1]} \Vert \sigma (t) - \tau (h_i(t)) \Vert = \lim _{i \rightarrow \infty } \rho (h_i) = \lim _{i \rightarrow \infty } r_i = r = \inf R, \end{equation*}$ which proves the claim.□

We define the relation $\sigma \sim \tau \iff \exists h \in \mathcal {H}: \sigma = \tau \circ h$, which is an equivalence relation [4]. Based on this, we introduce the classes of curves we are interested in.

Definition 2.5

(Polygonal Curve Classes).

For $d \in \mathbb {N}$, we define by $\mathbb {R}^d_{\ast }$ the set of equivalence classes with respect to $\sim$, of polygonal curves in ambient space $\mathbb {R}^d$. For $m \in \mathbb {N,}$ we define by $\mathbb {R}^d_m$ the subclass of polygonal curves of complexity at most $m$.

Simplification is a fundamental problem related to curves, which appears as subproblem in our algorithms.

Definition 2.6

(Minimum-Error ℓ-Simplification)

For a polygonal curve $\tau \in \mathbb {R}^d_\ast$, we denote by $\text{simpl}( \alpha ,\tau )$ an $\alpha$-approximate minimum-error $\ell$-simplification of $\tau$ (i.e., a curve $\sigma \in \mathbb {R}^d_\ell$ with $\text{d}_\text{F}(\tau , \sigma) \le \alpha \cdot \text{d}_\text{F}(\tau , \sigma ^\prime)$ for all $\sigma ^\prime \in \mathbb {R}^d_\ell$).

Now we define the $(k,\ell)$-median clustering problem for polygonal curves.

Definition 2.7

(k,ℓ)-Median Clustering)

The $(k,\ell)$-median clustering problem is defined as follows, where $k,l \in \mathbb {N}$ are fixed (constant) parameters of the problem: given a finite and non-empty set $T \subset \mathbb {R}^d_m$ of polygonal curves, compute a set of $k$ curves $C^\ast \subset \mathbb {R}^d_\ell$, such that $\text{cost}( T,C^\ast ) = \sum \nolimits _{\tau \in T} \min \nolimits _{c^\ast \in C^\ast } \text{d}_\text{F}(\tau , c^\ast)$ is minimal.

We call $\text{cost}( \cdot ,\cdot )$ the objective function, and we often write $\text{cost}( T,c)$ as shorthand for $\text{cost}( T,\lbrace c\rbrace)$. In addition, for simplicity, we call the $(1,\ell)$-median problem the $\ell$-median problem and a solution to it an $\ell$-median.

2.1 Sampling

Sampling is the process of repeatedly drawing elements from some non-empty set with a certain probability (c.f. [29]). The result of a sampling process is a multiset $S$, which we call sample. The following theorem of Indyk [25] utilizes uniform sampling and is useful for evaluating the cost of a curve at hand.

Theorem 2.8 ([25, Theorem 31]).

Let $\varepsilon \in (0,1]$ and $T \subset \mathbb {R}^d_\ast$ be a non-empty set of polygonal curves. Further, let $W$ be a non-empty sample, drawn uniformly and independently at random from $T$, with replacement. For $\tau , \sigma \in T$ with $\text{cost}( T,\tau ) \gt (1+\varepsilon) \text{cost}( T,\sigma ) ,$ it holds that $\Pr [\text{cost}( W,\tau ) \le \text{cost}( W,\sigma ) ] \lt \exp (- {\varepsilon ^2 \vert W \vert }/{64})$.

2.1.1 Superset Sampling.

This technique was coined by Kumar et al. [27] and also applied by Ackermann et al. [2]. Here, we draw a sample $S$ uniformly and independently with replacement from a non-empty finite set $X$, and we want $S$ to contain a uniform and independent sample $S^\prime$ from a subset $Y \subseteq X$. As can easily be verified, the elements $y \in Y$ contained in $S$ form a uniform sample $S^\prime$ from $Y$, as shown in Proposition 2.9.

Proposition 2.9.

Let $X$ be a non-empty finite set and $Y \subseteq X$ be a non-empty subset. Let $S$ be sampled uniformly and independently with replacement from $X,$ and let $F$ be the event that there is a subset $S^\prime \subseteq S$ of size at least $n \in \mathbb {N}$, with $S^\prime \subseteq Y$. For each $s^\prime \in S^\prime$ and $y \in Y,$ it holds that $\Pr [s^\prime = y \mid F] = \frac{1}{\vert Y \vert }$.

This statement follows from elementary probability theory; for completeness, however, we provide a proof.

Proof of Proposition 2.9

The sample space is $X^{\vert S \vert }$, where we neglect the order of the elements of the elementary events, and consists of $\vert X \vert ^{\vert S \vert }$ tuples, each of which occurs with the same probability. In $\binom{\vert S \vert }{n} \cdot \vert Y \vert ^n \cdot \vert X \vert ^{\vert S \vert - n}$ of them are at least $n$ elements from $Y$. Thus, $\Pr [F] = \frac{\binom{\vert S \vert }{n} \cdot \vert Y \vert ^n \cdot \vert X \vert ^{\vert S \vert - n}}{\vert X \vert ^{\vert S \vert }}$.

Without loss of generality, we assume that $s^\prime$ corresponds to the first draw from $Y$. There are $\binom{\vert S \vert }{n} \cdot 1 \cdot \vert Y \vert ^{n-1} \cdot \vert X \vert ^{\vert S \vert - n}$ tuples such that $s^\prime = y$ and at least $n$ elements are from $Y$. Hence, $\Pr [(s^\prime = y) \cap F] = \frac{\binom{\vert S \vert }{n} \cdot 1 \cdot \vert Y \vert ^{n-1} \cdot \vert X \vert ^{\vert S \vert - n}}{\vert X \vert ^{\vert S \vert }}$. Finally, we have $\begin{equation*} \Pr [s^\prime = y \mid F] = \frac{\Pr [(s^\prime = y) \cap F]}{\Pr [F]} = \frac{ \frac{\binom{\vert S \vert }{n} \cdot 1 \cdot \vert Y \vert ^{n-1} \cdot \vert X \vert ^{\vert S \vert - n}}{\vert X \vert ^{\vert S \vert }}}{\frac{\binom{\vert S \vert }{n} \cdot \vert Y \vert ^{n} \cdot \vert X \vert ^{\vert S \vert - n}}{\vert X \vert ^{\vert S \vert }}} = \frac{1}{\vert Y \vert }. \end{equation*}$□

We often employ the following, which can be seen as a union bound for superset sampling.

Lemma 2.10.

Let $E_1, \ldots , E_n, F$ be events, with $\Pr [F] \gt 0$. It holds that $\begin{equation*} \Pr \left[ F \cap \bigcap _{i=1}^n E_i \right] \ge 1- \left(\Pr [\overline{F}] + \sum _{i=1}^n \Pr [\overline{E}_i \mid F]\right). \end{equation*}$

Proof.

We have the following: $\begin{align*} \Pr \left[F \cap \bigcap _{i=1}^n E_i \right] & = 1 - \Pr \left[ \overline{F} \cup \bigcup _{i=1}^n \overline{E}_i \right] = 1 - \Pr \left[ \bigcup _{i=1}^n (\overline{E}_i \cap \overline{F}) \cup \bigcup _{i=1}^n (E_i \cap \overline{F}) \cup \bigcup _{i=1}^n (\overline{E}_i \cap F) \right]\\ & = 1 - \Pr \left[\overline{F} \cup \bigcup _{i=1}^n (\overline{E}_i \cap F)\right] \ge 1 - \left(\Pr [\overline{F}] + \sum _{i=1}^n \Pr [\overline{E}_i \cap F]\right) \ge 1 - \left(\Pr [\overline{F}] + \sum _{i=1}^n \Pr [\overline{E}_i \mid F]\right). \end{align*}$ In the first equation, we use De Morgan’s law, the first inequality follows from a union bound, and the last inequality follows, since $\Pr [\overline{E}_i \mid F] \ge \Pr [\overline{E}_i \cap F ]$ for all $i \in [n]$.□

The following concentration bound also applies to independent Bernoulli trials, which are a special case of Poisson trials where each trial has same probability of success. In particular, it can be used to bound the probability that superset sampling is not successful.

Lemma 2.11 (Chernoff Bound for Independent Poisson Trials [29, Theorem 4.5]).

Let $X_1, \ldots , X_n$ be independent Poisson trials. For $\delta \in (0,1),$ it holds that $\begin{equation*} \Pr \left[\sum _{i=1}^n X_i \le (1-\delta) E\left[\sum _{i=1}^n X_i\right] \right] \le \exp \left(- \frac{\delta ^2}{2} E\left[\sum _{i=1}^n X_i\right]\right). \end{equation*}$

3 SIMPLE AND FAST 34-APPROXIMATION FOR ℓ-MEDIAN

Here, we present Algorithm 1, a 34-approximation algorithm for the $\ell$-median problem. It is based on the following. We can obtain a 3-approximate solution (in terms of objective value) to the $\ell$-median problem for a given set $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subset \mathbb {R}^d_m$ of polygonal curves with high probability by uniformly and independently sampling a sufficient number of curves from $T$. In detail, when the sample size only depends on the failure probability, it contains with high probability one of the at least $n/2$ input curves (by an averaging argument) that are within distance $2 \cdot \text{cost}( T,c^\ast ) /n$ to an optimal $\ell$-median $c^\ast$ for $T$. These curves have cost up to $3 \cdot \text{cost}( T,c^\ast )$ by the triangle inequality. We call this sample the candidate sample.

We can improve on running time by using Theorem 2.8 and evaluate the cost of each curve in the candidate sample against another sample (which we call the witness sample) of similar size instead of against the complete input. However, we have to accept an approximation factor of 5 (if we set $\varepsilon = 1$ in Theorem 2.8). That is indeed acceptable, since we only obtain an approximate solution in terms of objective value and ignore the bound on the number of vertices of the center curve, which is a disadvantage of this approach and may even result in the obtained center having cost less than $\text{cost}( T,c^\ast )$, if $\ell \gt m$. To fix this, we simplify the candidate curve that evaluated best against the witness sample, using an efficient minimum-error $\ell$-simplification approximation algorithm, which downgrades the approximation factor to $6+7\alpha$, where $\alpha$ is the approximation factor of the minimum-error $\ell$-simplification.

However, Algorithm 1 is very fast in terms of the input size. Indeed, it has worst-case running time independent of $n$ and sub-quartic in $m$. Now, Algorithm 1 has the purpose to provide us an approximate median for a given set of polygonal curves: the bicriteria approximation algorithms (Algorithms 2 and 4), which we present afterward and which are capable of generating center curves with up to $2\ell -2$ vertices, need an approximate median (and the approximation factor) to bound the optimal objective value. Furthermore, there is a case where Algorithms 2 and 4 may fail to provide a good approximation (i.e., when the cost estimation fails and the grids are thus not set up properly), but it can be proven that the result of Algorithm 1 is then a very good approximation, which can be used instead.

Next, we prove the quality of approximation of Algorithm 1.

Theorem 3.1.

Given a parameter $\delta \in (0,1)$ and a set $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subset \mathbb {R}^d_m$ of polygonal curves, Algorithm 1 returns with probability at least $1-\delta$ a polygonal curve $c \in \mathbb {R}^d_\ell$, such that $\text{cost}( T,c^\ast ) \le \text{cost}( T,c) \le (6 + 7\alpha) \cdot \text{cost}( T,c^\ast )$, where $c^\ast$ is an optimal $\ell$-median for $T$ and $\alpha$ is the approximation factor of the utilized minimum-error $\ell$-simplification approximation algorithm.

Proof.

First, we know that $\text{d}_\text{F}(\tau , \text{simpl}( \alpha ,\tau )) \le \alpha \cdot \text{d}_\text{F}(\tau , c^\ast)$, for each $\tau \in T$, by Definition 2.6.

Now, there are at least $\frac{n}{2}$ curves in $T$ that are within distance at most $\frac{2\text{cost}( T,c^\ast ) }{n}$ to $c^\ast$. Otherwise, the cost of the remaining curves would exceed $\text{cost}( T,c^\ast )$, which is a contradiction. Hence, each $s \in S$ has probability at least $\frac{1}{2}$ to be within distance $\frac{2\text{cost}( T,c^\ast ) }{n}$ to $c^\ast$.

Since the elements of $S$ are sampled independently, we conclude that the probability that every $s \in S$ has distance to $c^\ast$ greater than $\frac{2\text{cost}( T,c^\ast ) }{n}$ is at most $(1-\frac{1}{2})^{\vert S \vert } \le \exp (-\frac{2(\ln (2)+\ln (1/\delta))}{2}) = \frac{\delta }{2}$.

Now, assume there is an $s \in S$ with $\text{d}_\text{F}(s, c^\ast) \le \frac{2 \text{cost}( T,c^\ast ) }{n}$. We do not want any $\tau \in S \setminus \lbrace s\rbrace$ with $\text{cost}( T,\tau ) \gt 2 \text{cost}( T,s)$ to have $\text{cost}( W,\tau ) \le \text{cost}( W,s)$. Using Theorem 2.8, we conclude that this happens with probability at most $ \begin{equation*} \exp \left(-\frac{64 (\ln (1/\delta) + \ln (\lceil 4(\ln (2)+\ln (1/\delta))\rceil))}{64}\right) \le \frac{\delta }{\lceil 4 (\ln (2) + \ln (1/\delta)) \rceil } \le \frac{\delta }{2 \vert S \vert }, \end{equation*} $ for each $\tau \in S \setminus \lbrace s\rbrace$.

Using a union bound over all bad events, we conclude that with probability at least $1-\delta$, Algorithm 1 samples a curve $s \in S$, with $\text{d}_\text{F}(s, c^\ast) \le 2 \text{cost}( T,c^\ast ) /n$ and returns the simplification $c = \text{simpl}( \alpha ,t)$ of a curve $t \in S$, with $\text{cost}( T,t) \le 2 \text{cost}( T,s)$. The triangle inequality yields $\begin{equation*} \sum _{\tau \in T} (\text{d}_\text{F}(t, c^\ast) - \text{d}_\text{F}(\tau , c^\ast)) \le \sum _{\tau \in T} \text{d}_\text{F}(t, \tau) \le 2 \sum _{\tau \in T} \text{d}_\text{F}(s, \tau) \le 2 \sum _{\tau \in T} (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , s)), \end{equation*}$ which is equivalent to $\begin{equation*} n \cdot \text{d}_\text{F}(t, c^\ast) \le 2 \text{cost}( T,c^\ast ) + \text{cost}( T,c^\ast ) + 2 n \frac{2\text{cost}( T,c^\ast ) }{n} \iff {} \text{d}_\text{F}(t, c^\ast) \le \frac{7\text{cost}( T,c^\ast ) }{n}. \end{equation*}$

Hence, we have $\begin{align*} \text{cost}( T,c) & ={} \sum _{\tau \in T} \text{d}_\text{F}(\tau , \text{simpl}( \alpha ,t)) \le \sum _{\tau \in T} (\text{d}_\text{F}(\tau , t) + \text{d}_\text{F}(t, \text{simpl}( \alpha ,t))) \\ & \le {} 2 \text{cost}( T,s) + \sum _{\tau \in T} \alpha \cdot \text{d}_\text{F}(t, c^\ast) \le {} 2 \sum _{\tau \in T} (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , s)) + 7\alpha \cdot \text{cost}( T,c^\ast ) \\ & \le {} 2 \text{cost}( T,c^\ast ) + 4 \text{cost}( T,c^\ast ) + 7\alpha \cdot \text{cost}( T,c^\ast ) ={} (6 + 7\alpha) \text{cost}( T,c^\ast ) . \end{align*}$

The lower bound $\text{cost}( T,c^\ast ) \le \text{cost}( T,c)$ follows from the fact that the returned curve has $\ell$ vertices and that $c^\ast$ has minimum cost among all curves with $\ell$ vertices.□

The following lemma enables us to obtain a concrete approximation factor and worst-case running time of Algorithm 1.

Lemma 3.2 (Buchin et al. [10, Lemma 7.1]).

Given a curve $\sigma \in \mathbb {R}^d_m$, a 4-approximate minimum-error $\ell$-simplification can be computed in $O(m^3 \log m)$ time.

The simplification algorithm used for obtaining this statement is a combination of the algorithm by Imai and Iri [24] and the algorithm by Alt and Godau [4]. Combining Theorem 3.1 and Lemma 3.2, we obtain the following corollary.

Corollary 3.3.

Given a parameter $\delta \in (0,1)$ and a set $T \subset \mathbb {R}^d_m$ of polygonal curves, Algorithm 1 returns with probability at least $1-\delta$ a polygonal curve $c \in \mathbb {R}^d_\ell$, such that $\begin{equation*} \text{cost}( T,c^\ast ) \le \text{cost}( T,c) \le 34 \cdot \text{cost}( T,c^\ast ) , \end{equation*}$ where $c^\ast$ is an optimal $\ell$-median for $T$, in time $O(m^2 \log (m) \ln ^2(1/\delta) + m^3 \log m)$, when the algorithms by Imai and Iri [24] and Alt and Godau [4] are combined for $\ell$-simplification.

Proof.

We use Lemma 3.2 together with Theorem 3.1, which yields an approximation factor of 34.

Now, drawing the samples takes time $O(\ln (1/\delta))$ each. Evaluating the samples against each other takes time $O(m^2 \log (m) \ln ^2(1/\delta))$ and simplifying one of the curves that evaluates best takes time $O(m^3 \log m)$. We conclude that Algorithm 1 has running time $O(m^2 \log (m) \ln ^2(1/\delta) + m^3 \log m)$.□

4 (3+ɛ)-APPROXIMATION FOR ℓ-MEDIAN BY SIMPLE SHORTCUTTING

Here, we present Algorithm 2, which returns candidates, containing with high probability a $(3+\varepsilon)$-approximate $(1,\ell)$-median of complexity at most $2\ell -2$ for any (fixed) subset that takes a constant fraction of the input. Algorithm 2 can be used as plugin in our generalized version (Algorithm 5, Section 7) of the algorithm by Ackermann et al. [2].

In contrast to Nath and Taylor [30], we cannot use the property that the vertices of a median must be found in the balls of radius $\text{d}_\text{F}(\tau , c^\ast)$, centered at $\tau$’s vertices, where $c^\ast$ is an optimal $(1,\ell)$-median for a given input $T$, which $\tau$ is an element of. This is an immediate consequence of using the continuous Fréchet distance.

We circumvent this by proving the following shortcutting lemmata. We start with the simplest, which states that we can indeed search the aforementioned balls, if we accept a resulting curve of complexity at most $2\ell -2$. Figure 2 presents a visualization.

Fig. 2. Visualization of a simple shortcut. The black curve is an input curve that is close to an optimal median, which is depicted in red. By inserting the blue shortcut, we can find a curve that has the same distance to the black curve as the median but with all vertices contained in the balls centered at the black curve’s vertices.

Lemma 4.1.

Let $\sigma , \tau \in \mathbb {R}^d_{\ast }$ be polygonal curves. Let $v^\tau _1, \ldots , v^\tau _{\vert \tau \vert }$ be the vertices of $\tau ,$ and let $r = \text{d}_\text{F}(\sigma , \tau)$. There exists a polygonal curve $\sigma ^\prime \in \mathbb {R}^d_{2 \vert \sigma \vert - 2}$ with $\text{d}_\text{F}(\sigma ^\prime , \tau) \le \text{d}_\text{F}(\sigma , \tau)$ and every vertex contained in at least one of $B(v^\tau _1, r), \ldots , B(v^\tau _{\vert \tau \vert }, r)$.

Proof.

Let $v^\sigma _1, \ldots , v^\sigma _{\vert \sigma \vert }$ be the vertices of $\sigma$. Further, let $t^\sigma _1, \ldots , t^\sigma _{\vert \sigma \vert }$ and $t^\tau _1, \ldots , t^\tau _{\vert \tau \vert }$ be the instants of $\sigma$ and $\tau$, respectively. In addition, for $h \in \mathcal {H}$ (recall that $\mathcal {H}$ is the set of all continuous bijections $h:[0,1] \rightarrow [0,1]$ with $h(0) = 0$ and $h(1) = 1$), let $r_h = \max \nolimits _{t \in [0,1]} \Vert \sigma (t) - \tau (h(t)) \Vert$ be the distance realized by $h$. We know from Lemma 2.4 that there exists a sequence $(h_x)_{x=1}^\infty$ in $\mathcal {H}$, such that $\lim \nolimits _{x \rightarrow \infty } r_{h_x} = \text{d}_\text{F}(\sigma , \tau) = r$.

Now, fix an arbitrary $h \in \mathcal {H}$ and assume that there is a vertex $v^\sigma _i$ of $\sigma$, with instant $t^\sigma _i$, which is not contained in any of $B(v^\tau _1, r_h), \ldots , B(v^\tau _{\vert \tau \vert }, r_h)$. Let $j$ be the maximum of $[\vert \tau \vert - 1]$, such that $t^\tau _j \le h(t^\sigma _i) \le t^\tau _{j+1}$. So $v^\sigma$ is matched to $\overline{\tau (t^\tau _j) \tau (t^\tau _{j+1})}$ by $h$. We modify $\sigma$ in such a way that $v^\sigma _i$ is replaced by two new vertices that are elements of $B(v^\tau _j, r_h)$ and $B(v^\tau _{j+1}, r_h)$, respectively.

Namely, let $t^-$ be the maximum of $[0, t^\sigma _i)$, such that $\sigma (t^-) \in B(v^\tau _j, r_h),$ and let $t^+$ be the minimum of $(t^\sigma _i, 1]$, such that $\sigma (t^+) \in B(v^\tau _{j+1}, r_h)$. These are the instants when $\sigma$ leaves $B(v^\tau _j, r_h)$ before visiting $v^\sigma _i$ and $\sigma$ enters $B(v^\tau _{j+1}, r_h)$ after visiting $v^\sigma _i$, respectively. Let $\sigma ^\prime _h$ be the piecewise defined curve, defined just like $\sigma$ on $[0,t^-]$ and $[t^+,1]$, but on $(t^-, t^+)$ it connects $\sigma (t^-)$ and $\sigma (t^+)$ with the line segment $s(t) =(1-\frac{t-t^-}{t^+-t^-}) \tau (t^-) + \frac{t-t^-}{t^+-t^-} \tau (t^+)$.

We know that $\Vert \sigma (t^-) - \tau (h(t^-)) \Vert \le r_h$ and $\Vert \sigma (t^+) - \tau (h(t^+)) \Vert \le r_h$. Note that $t^\tau _j \le h(t^-)$ and $h(t^+) \le t^\tau _{j+1}$ since $\sigma (t^-)$ and $\sigma (t^+)$ are the closest points to $v^\sigma _i$ on $\sigma$ that have distance $r_h$ to $v^\tau _j$ and $v^\tau _{j+1}$, respectively, by definition. Therefore, $\tau$ has no vertices between the instants $h(t^-)$ and $h(t^+)$. Now, $h$ can be used to match $\sigma ^\prime _h\vert _{[0,t^-)}$ to $\tau \vert _{[0,h(t^-))}$ and $\sigma ^\prime _h\vert _{(t^+,1]}$ to $\tau \vert _{(t^+,1]}$ with distance at most $r_h$. Since $\sigma ^\prime _h\vert _{[t^-, t^+]}$ and $\tau \vert _{[h(t^-), h(t^+)]}$ are just line segments, they can be linearly matched to each other with distance at most $\max \lbrace \Vert \sigma ^\prime _h(t^-) - \tau (h(t^-)) \Vert , \Vert \sigma ^\prime _h(t^+) - \tau (h(t^+)) \Vert \rbrace \le r_h$. We conclude that $\text{d}_\text{F}(\sigma ^\prime _h, \tau) \le r_h$.

Because this modification works for every $h \in \mathcal {H}$, we have $\text{d}_\text{F}(\sigma ^\prime _{h}, \tau) \le r_h$ for every $h \in \mathcal {H}$. Thus, $\lim \nolimits _{x \rightarrow \infty } \text{d}_\text{F}(\sigma ^\prime _{h_x}, \tau) \le \text{d}_\text{F}(\sigma , \tau) = r$.

Now, to prove the claim, for every $h \in \mathcal {H}$ we apply this modification to $v^\sigma _i$ and successively to every other vertex $v^{\sigma ^\prime _h}_i$ of the resulting curve $\sigma ^\prime _h$, not contained in one of the balls, until every vertex of $\sigma ^\prime _h$ is contained in a ball. Note that the modification is repeated at most $\vert \sigma \vert - 2$ times for every $h \in \mathcal {H}$, since the start and end vertex of $\sigma$ must be contained in $B(v^\tau _1, r_h)$ and $B(v^\tau _{\vert \tau \vert }, r_h)$, respectively. Therefore, the number of vertices of every $\sigma ^\prime _h$ can be bounded by $2 \cdot (\vert \sigma \vert - 2) + 2$ since every other vertex must not lie in a ball and for each such vertex one new vertex is created. Thus, $\vert \sigma ^\prime _h \vert \le 2 \vert \sigma \vert - 2$.□

We now present Algorithm 2, which works similarly to Algorithm 1 but uses shortcutting instead of simplification. As a consequence, we can achieve an approximation factor of $3+\varepsilon$ instead of $(2+\varepsilon)(1+\alpha)$, where $(2+\varepsilon)$ comes from the candidate sampling and $(1+\alpha)$ comes from simplification with approximation factor $\alpha \ge 1$. Indeed, this factor is the best we can achieve by the previously used techniques in combination with simplification. Thus, to achieve an approximation factor of $(4+\varepsilon),$ one would need to compute the optimal minimum-error $\ell$-simplifications of the input curves, and to the best of our knowledge, there is no such algorithm for the continuous Fréchet distance.

In contrast to Algorithm 1, Algorithm 2 utilizes the superset sampling technique (see Section 2.1.1) to fulfill the requirements to be used as plugin algorithm for the $(k,\ell)$-median approximation algorithm to be presented in Section 7—that is, it is able to obtain an approximate $\ell$-median for any cluster $T^\prime$ that takes a constant fraction of the input $T$. Therefore, it has running time exponential in the size of the sample $S$. A further difference is that we need an upper and a lower bound on the cost of an optimal $\ell$-median $c^\ast$ for $T^\prime$, to properly set up the grids we use for shortcutting. The lower bound can be obtained by simple estimation, using Markov’s inequality—with high probability, a multiple of the cost of the result of Algorithm 1 run on a subset $S^\prime \subseteq S$ with $S^\prime \subseteq T^\prime$ and with respect to $S^\prime$ is a lower bound on the cost of $c^\ast$ with respect to $T^\prime$. For the upper bound, we utilize a case distinction, which guarantees us that if we fail to obtain an upper bound on the optimal cost, the result of Algorithm 1 then is a good approximation (factor $2+\varepsilon$, an immediate consequence of the distinction) and can be used instead of a best curve obtained by shortcutting.

Algorithm 2 has several parameters: $\beta$ determines the size (in terms of a fraction of the input) of the smallest subset of the input for which an approximate median can be computed, $\delta$ determines the probability of failure of the algorithm, and $\varepsilon$ determines the approximation factor.

We prove the quality of approximation of Algorithm 2.

Theorem 4.2.

Given three parameters $\beta \in [1, \infty)$, $\delta , \varepsilon \in (0,1)$ and a set $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subset \mathbb {R}^d_m$ of polygonal curves, with probability at least $1-\delta$ the set of candidates that Algorithm 2 returns contains a $(3+\varepsilon)$-approximate $\ell$-median with up to $2\ell -2$ vertices for any (fixed) $T^\prime \subseteq T$, if $\vert T^\prime \vert \ge \frac{1}{\beta } \vert T \vert$.

Proof.

We assume that $\vert T^\prime \vert \ge \frac{1}{\beta } \vert T \vert$. Let $n^\prime$ be the number of sampled curves in $S$ that are elements of $T^\prime$. Clearly, $E\left[n^\prime \right] \ge \sum _{i=1}^{\vert S \vert } \frac{1}{\beta } = \frac{\vert S \vert }{\beta }$. In addition, $n^\prime$ is the sum of independent Bernoulli trials. A Chernoff bound (see Lemma 2.11) yields the following: $\begin{align*} \Pr \left[n^\prime \lt \frac{\vert S \vert }{2\beta }\right] \le \Pr \left[n^\prime \lt \frac{1}{2}E\left[n^\prime \right]\right] \le \exp \left(-\frac{1}{4}\frac{\vert S \vert }{2\beta }\right) \le \exp \left(\frac{\ln (\delta)-\ln (4)}{\varepsilon } \right) = \left(\frac{\delta }{4}\right)^{\frac{1}{\varepsilon }} \le \frac{\delta }{4}. \end{align*}$

In other words, with probability at most $\delta /4,$ no subset $S^\prime \subseteq S$, of cardinality at least $\frac{\vert S \vert }{2\beta }$, is a subset of $T^\prime$. We condition the rest of the proof on the contrary event, denoted by $\mathcal {E}_{T^\prime }$, namely that there is a subset $S^\prime \subseteq S$ with $S^\prime \subseteq T^\prime$ and $\vert S^\prime \vert \ge \frac{\vert S \vert }{2\beta }$. Note that $S^\prime$ is then a uniform and independent sample of $T^\prime$ (see Section 2.1.1).

Now, let $c^\ast \in \text{arg min}_{c \in \mathbb {R}^d_\ell }\; \text{cost}\;({T^\prime }{c})$ be an optimal $\ell$-median for $T^\prime$. The expected distance between $s \in S^\prime$ and $c^\ast$ is $\begin{equation*} E\left[\text{d}_\text{F}(s, c^\ast)\ \vert \ \mathcal {E}_{T^\prime }\right] = \sum _{\tau \in T^\prime } \text{d}_\text{F}(c^\ast , \tau) \cdot \frac{1}{\vert T^\prime \vert } = \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }. \end{equation*}$ By linearity, we have $E\left[\text{cost}( S^\prime ,c^\ast ) \ \vert \ \mathcal {E}_{T^\prime }\right] = \frac{\vert S^\prime \vert }{\vert T^\prime \vert } \text{cost}( T^\prime ,c^\ast )$. Markov’s inequality yields the following: $\begin{align*} \Pr \left[ \frac{\delta \vert T^\prime \vert }{4\vert S^\prime \vert }\text{cost}( S^\prime ,c^\ast ) \gt \text{cost}( T^\prime ,c^\ast ) \ \Big \vert \ \mathcal {E}_{T^\prime }\right] \le \frac{\delta }{4}. \end{align*}$ We conclude that with probability at most $\delta /4,$ we have $\frac{\delta \vert T^\prime \vert }{4\vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \gt \text{cost}( T^\prime ,c^\ast )$.

Using Markov’s inequality again, for every $s \in S^\prime ,$ we have $\begin{equation*} \Pr \left[\text{d}_\text{F}(s, c^\ast) \gt (1+\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }\ \Big \vert \ \mathcal {E}_{T^\prime }\right] \le \frac{1}{1+\varepsilon }, \end{equation*}$ and therefore by independence, $\begin{equation*} \Pr \left[\min _{s \in S^\prime } \text{d}_\text{F}(s, c^\ast) \gt (1+\varepsilon)\frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }\ \Big \vert \ \mathcal {E}_{T^\prime }\right] \le \frac{1}{(1+\varepsilon)^{\vert S^\prime \vert }} \le \exp \left(-\frac{\varepsilon }{2}\frac{\vert S \vert }{2\beta }\right). \end{equation*}$ Hence, with probability at most $\exp (-\frac{\varepsilon \left\lceil \frac{8\beta (\ln (1/\delta) + \ln (4))}{\varepsilon } \right\rceil }{4\beta }) \le \delta ^2/16 \le \delta /4,$ there is no $s \in S^\prime$ with $\text{d}_\text{F}(s, c^\ast) \le (1+\varepsilon) \cdot \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$. In addition, with probability at most $\delta /4,$ Algorithm 1 fails to compute a 34-approximate $\ell$-median $c \in \mathbb {R}^d_\ell$ for $S^\prime$ (see Corollary 3.3).

Using Lemma 2.10, we conclude that with probability at least $1-\delta ,$ all of the following events occur simultaneously:

(1)	There is a subset $S^\prime \subseteq S$ of cardinality at least $\vert S \vert /(2\beta)$ that is a uniform and independent sample of $T^\prime$;
(2)	there is a curve $s \in S^\prime$ with $\text{d}_\text{F}(s, c^\ast) \le (1+\varepsilon)\frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$;
(3)	Algorithm 1 computes a polygonal curve $c \in \mathbb {R}^d_\ell$ with ${\rm cost}({S^\prime },{c^\ast _{S^\prime }}) \le \text{cost}( S^\prime ,c) \le 34\ {\rm cost}({S^\prime },{c^\ast _{S^\prime }})$, where $c^\ast _{S^\prime } \in \mathbb {R}^d_\ell$ is an optimal $\ell$-median for $S^\prime$; and
(4)	it holds that $\frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \le \text{cost}( T^\prime ,c^\ast )$.

Since $c^\ast _{S^\prime }$ is an optimal $\ell$-median for $S^\prime ,$ we get the following from the last two items: $\begin{align*} \text{cost}( T^\prime ,c^\ast ) \ge \frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \ge \frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast _{S^\prime }) \ge \frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \frac{\text{cost}( S^\prime ,c) }{34}. \end{align*}$

We now distinguish between two cases.

Case 1. $\text{d}_\text{F}(c,c^\ast) \ge (1+2\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }.$

The triangle inequality yields $\begin{align*} \text{d}_\text{F}(c,s) & \ge {} \text{d}_\text{F}(c,c^\ast) - \text{d}_\text{F}(c^\ast ,s) \ge \text{d}_\text{F}(c,c^\ast) - (1+\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \\ & \ge {} (1+2\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } - (1+\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } = \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }. \end{align*}$

As a consequence, $\text{cost}( S^\prime ,c) \ge \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \iff \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \le \frac{1}{\varepsilon } \text{cost}( S^\prime ,c)$.

Now, let $v^{s}_1, \ldots , v^{s}_{\vert s \vert }$ be the vertices of $s$. By Lemma 4.1, there exists a polygonal curve $c^\prime$ with up to $2\ell - 2$ vertices, every vertex contained in one of $B(v^{s}_1, \text{d}_\text{F}(c^\ast , s)), \ldots , B(v^{s}_{\vert s \vert }, \text{d}_\text{F}(c^\ast , s))$ and $\text{d}_\text{F}(s, c^\prime) \le \text{d}_\text{F}(s, c^\ast) \le (1+\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \le (1+\varepsilon) \frac{\text{cost}( S^\prime ,c) }{\varepsilon }$.

In the set of candidates that Algorithm 2 returns, a curve $c^{\prime \prime }$ with up to $2\ell -2$ vertices from the union of the grid covers and distance at most $\frac{\varepsilon \frac{\delta n}{2\vert S \vert }\text{cost}( S^\prime ,c) }{n} \le \frac{\varepsilon \frac{\delta \vert T^\prime \vert }{4\vert S^\prime \vert }\text{cost}( S^\prime ,c) }{\vert T^\prime \vert } \le \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$ between every corresponding pair of vertices of $c^\prime$ and $c^{\prime \prime }$ is contained. We conclude that $\text{d}_\text{F}(c^\prime , c^{\prime \prime }) \le \frac{\varepsilon \text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$.

We can now bound the cost of $c^{\prime \prime }$ as follows: $\begin{align*} \text{cost}( T^\prime ,c^{\prime \prime }) & ={} \sum _{\tau \in T^\prime } \text{d}_\text{F}(\tau , c^{\prime \prime }) \le \sum _{\tau \in T^\prime } \left(\text{d}_\text{F}(\tau , c^\prime) + \frac{\varepsilon \text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }\right) \\ & \le {} \sum _{\tau \in T^\prime } (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , c^\prime)) + \varepsilon \text{cost}( T,c^\ast ) \\ & \le {} \sum _{\tau \in T^\prime } (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , s) + \text{d}_\text{F}(s, c^\prime)) + \varepsilon \text{cost}( T^\prime ,c^\ast ) \le {} (3+3\varepsilon) \text{cost}( T^\prime ,c^\ast ) . \end{align*}$

Case 2. $\text{d}_\text{F}(c,c^\ast) \lt (1+2\varepsilon) \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }.$

The cost of $c$ can easily be bounded: $\begin{align*} \text{cost}( T^\prime ,c) \le \sum _{\tau \in T^\prime } (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , c)) \lt \text{cost}( T^\prime ,c^\ast ) + (1+2\varepsilon) \text{cost}( T^\prime ,c^\ast ) = (2+2\varepsilon) \text{cost}( T^\prime ,c^\ast ) . \end{align*}$

The claim follows by rescaling $\varepsilon$ by $\frac{1}{3}$.□

Next we analyze the worst-case running time of Algorithm 2 and the number of candidates it returns.

Theorem 4.3.

The running time as well as the number of candidates that Algorithm 2 returns can be bounded by $\begin{equation*} m^{O(\ell)} \left(\frac{1}{\delta \varepsilon }\right)^{O(\ell d)} \beta ^{O(\ell d 1/\varepsilon \ln (1/\delta))}. \end{equation*}$

Proof. The sample $S$ has size $O(\frac{\ln (1/\delta) \cdot \beta }{\varepsilon })$, and obtaining it takes time $O(\frac{\ln (1/\delta) \cdot \beta }{\varepsilon })$. Let $n_S = \vert S \vert$. The outer for-loop runs $\begin{equation*} \binom{n_S}{\frac{n_S}{2 \beta }} \le \left(\frac{e n_S}{\frac{n_S}{2\beta }} \right)^{\frac{n_S}{2\beta }} = (2e\beta)^{\frac{n_S}{2\beta }} = \beta ^{O\left(\frac{\ln (1/\delta)}{\varepsilon }\right)} \end{equation*}$ times. In each iteration, we run Algorithm 1, taking time $O(m^2 \log (m) \ln ^2(1/\delta) + m^3 \log m)$ (see Corollary 3.3), we compute the cost of the returned curve with respect to $S^\prime$, taking time $O(\frac{\ln (1/\delta)}{\varepsilon ^\prime } \cdot m \log (m))$, and per curve in $S^\prime ,$ we build up to $m$ grids of size $\begin{equation*} \left(\frac{\frac{(1+\varepsilon ^\prime)\Delta }{\varepsilon ^\prime }}{\frac{\varepsilon ^\prime \delta n \Delta }{n\sqrt {d} 2 \vert S \vert 34}}\right)^d = \left(\frac{68 \sqrt {d} \vert S \vert (1+\varepsilon ^\prime)}{{\varepsilon ^\prime }^2 \delta }\right)^d \in O\left(\frac{\beta ^d\ln (1/\delta)^d}{\varepsilon ^{3d}\delta ^d}\right) \end{equation*}$ each. For each curve $s \in S^\prime$, Algorithm 2 then enumerates all combinations of $2\ell -2$ points from these up to $m$ grids, resulting in $\begin{equation*} O\left(\frac{m^{2\ell -2} \beta ^{2\ell d-2d} \ln (1/\delta)^{2\ell d-2d}}{\varepsilon ^{6\ell d-6d}\delta ^{2\ell d-2d}}\right) \end{equation*}$ candidates per $s \in S^\prime$, per iteration of the outer for-loop.

Thus, Algorithm 2 enumerates $m^{O(\ell)}(\beta \cdot 1/\delta \cdot 1/\varepsilon)^{O(d\ell)}$ candidates per iteration of the outer for-loop.

All in all, the running time and number of candidates the algorithm returns can be bounded by $\begin{equation*} m^{O(\ell)} 1/(\delta \varepsilon)^{O(\ell d)} \beta ^{O(\ell d 1/\varepsilon \ln (1/\delta))}. \end{equation*}$□

5 MORE PRACTICAL APPROXIMATION FOR (1,ℓ)-MEDIAN BY SIMPLE SHORTCUTTING

The following algorithm is a modification of Algorithm 2. It is more practical since it needs to cover only up to $m$ (small) balls, using grids. Unfortunately, it is not compatible with the superset sampling technique and can therefore not be used as plugin in Algorithm 5.

We prove the quality of approximation of Algorithm 3.

Theorem 5.1.

Given two parameters $\delta , \varepsilon \in (0,1)$ and a set $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subset \mathbb {R}^d_m$ of polygonal curves, with probability at least $1-\delta ,$ Algorithm 3 returns a $(5+\varepsilon)$-approximate $\ell$-median for $T$ with up to $2\ell -2$ vertices.

Proof.

Let $c^\ast \in \text{arg min} _{c \in \mathbb {R}^d_\ell }\; \text{cost}\;({T}{c})$ be an optimal $\ell$-median for $T$. The expected distance between $s \in S$ and $c^\ast$ is $\begin{equation*} \text{E}\left[\text{d}_\text{F}(s, c^\ast)\right] = \sum _{i=1}^n \text{d}_\text{F}(c^\ast , \tau _i) \cdot \frac{1}{n} = \frac{\text{cost}( T,c^\ast ) }{n}. \end{equation*}$

Now using Markov’s inequality, for every $s \in S$ we have $\begin{equation*} \Pr [\text{d}_\text{F}(s, c^\ast) \gt (1+\varepsilon)\text{cost}( T,c^\ast ) /n] \le \frac{\text{cost}( T,c^\ast ) n^{-1}}{(1+\varepsilon)\text{cost}( T,c^\ast ) n^{-1}} = \frac{1}{1+\varepsilon }, \end{equation*}$ and therefore by independence, $\begin{equation*} \Pr \left[\min _{s \in S} \text{d}_\text{F}(s, c^\ast) \gt (1+\varepsilon)\text{cost}( T,c^\ast ) /n\right] \le \frac{1}{(1+\varepsilon)^{\vert S \vert }} \le \exp (-\frac{\varepsilon \vert S \vert }{2}). \end{equation*}$ Hence, with probability at most $\exp (-\frac{\varepsilon \left\lceil \frac{2(\ln (1/\delta) + \ln (4))}{\varepsilon } \right\rceil }{2}) \le \delta /4,$ there is no $s \in S$ with $\text{d}_\text{F}(s, c^\ast) \le (1+\varepsilon) \frac{\text{cost}( T,c^\ast ) }{n}$. Now, assume there is an $s \in S$ with $\text{d}_\text{F}(s, c^\ast) \le (1+\varepsilon) \text{cost}( T,c^\ast ) /n$. We do not want any $\tau \in S \setminus \lbrace s\rbrace$ with $\text{cost}( T,\tau ) \gt (1+\varepsilon)\text{cost}( T,s)$ to have $\text{cost}( W,\tau ) \le \text{cost}( W,s)$. Using Theorem 2.8, we conclude that this happens with probability at most $\begin{equation*} \exp \left(-\frac{\varepsilon ^2 \lceil 64\varepsilon ^{-2}(\ln (1/\delta) + \ln (\lceil 8 (\varepsilon ^\prime)^{-1} (\ln (1/\delta)+\ln (4))\rceil)) \rceil }{64}\right) \le \frac{\delta }{\lceil -8 (\varepsilon ^\prime)^{-1} (\ln (\delta)-\ln (4))\rceil } \le \frac{\delta }{4 \vert S \vert }, \end{equation*}$ for each $\tau \in S \setminus \lbrace s\rbrace$. In addition, with probability at most $\delta /2,$ Algorithm 1 fails to compute a 34-approximate $\ell$-median $\widehat{c} \in \mathbb {R}^d_\ell$ for $T$ (see Corollary 3.3).

Using a union bound over these bad events, we conclude that with probability at least $1-\delta$,

(1)	Algorithm 3 samples a curve $s \in S$ with $\text{d}_\text{F}(s,c^\ast) \le \text{cost}( T,c^\ast ) /n$,
(2)	Algorithm 3 samples a curve $t \in S$ with $\text{cost}( T,t) \le (1+\varepsilon) \text{cost}( T,s) ,$ and
(3)	Algorithm 1 computes a 34-approximate $\ell$-median $\widehat{c} \in \mathbb {R}^d_\ell$ for $T$ (i.e., $\text{cost}( T,c^\ast ) \le \Delta = \text{cost}( T,\widehat{c}) \le 34 \text{cost}( T,c^\ast )$).

Let $v^{t}_1, \ldots , v^{t}_{\vert t \vert }$ be the vertices of $t$. By Lemma 4.1, there exists a polygonal curve $c^\prime$ with up to $2\ell - 2$ vertices, every vertex contained in one of $B(v^{t}_1, \text{d}_\text{F}(c^\ast , t)), \ldots , B(v^{t}_{\vert t \vert }, \text{d}_\text{F}(c^\ast , t))$ and $\text{d}_\text{F}(t, c^\prime) \le \text{d}_\text{F}(t, c^\ast)$. Using the triangle inequality yields $\begin{align*} \sum _{\tau \in T} (\text{d}_\text{F}(t, c^\ast) - \text{d}_\text{F}(\tau , c^\ast)) \le \sum _{\tau \in T} \text{d}_\text{F}(t, \tau) \le (1+\varepsilon) \sum _{\tau \in T} \text{d}_\text{F}(s, \tau) \le (1+\varepsilon) \sum _{\tau \in T} (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , s)), \end{align*}$ which is equivalent to $\begin{align*} n \cdot \text{d}_\text{F}(t, c^\ast) \le (2+\varepsilon) \text{cost}( T,c^\ast ) + (1+\varepsilon) n (1+\varepsilon) \text{cost}( T,c^\ast ) /n \iff & \text{d}_\text{F}(t, c^\ast) \le (3 + 4\varepsilon) \text{cost}( T,c^\ast ) /n. \end{align*}$ Hence, we have $\text{d}_\text{F}(t, c^\prime) \le \text{d}_\text{F}(t, c^\ast) \le (3+4\varepsilon) \text{cost}( T,c^\ast ) /n \le (3+ 4\varepsilon) \Delta /n$.

In the last step, Algorithm 3 returns a curve $c^{\prime \prime }$ from the set $C$ of all curves with up to $2\ell -2$ vertices from $P$, the union of the grid covers, that evaluates best. We can assume that $c^{\prime \prime }$ has distance at most $\frac{\varepsilon \Delta _l}{n} \le \varepsilon \frac{\text{cost}( T,c^\ast ) }{n}$ between every corresponding pair of vertices of $c^\prime$ and $c^{\prime \prime }$. We conclude that $\text{d}_\text{F}(c^\prime , c^{\prime \prime }) \le \frac{\varepsilon \Delta _l}{n} \le \varepsilon \frac{\text{cost}( T,c^\ast ) }{n}$.

We can now bound the cost of $c^{\prime \prime }$ as follows: $\begin{align*} \text{cost}( T,c^{\prime \prime }) & ={} \sum _{\tau \in T} \text{d}_\text{F}(\tau , c^{\prime \prime }) \le \sum _{\tau \in T} \left(\text{d}_\text{F}(\tau , c^\prime) + \frac{\varepsilon \Delta _l}{n}\right) \le \sum _{\tau \in T} (\text{d}_\text{F}(\tau , t) + \text{d}_\text{F}(t, c^\prime)) + \varepsilon \text{cost}( T,c^\ast ) \\ & \le {} (1+\varepsilon) \text{cost}( T,s) + (3+5\varepsilon) \text{cost}( T,c^\ast ) \\ & \le {} (1+\varepsilon) \sum _{\tau \in T}(\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , s)) + (3+5\varepsilon) \text{cost}( T,c^\ast ) \\ & \le {} (1+\varepsilon)\text{cost}( T,c^\ast ) + (1+\varepsilon)^2 \text{cost}( T,c^\ast ) + (3+5\varepsilon) \text{cost}( T,c^\ast ) \\ & \le {} (5 + 9\varepsilon) \text{cost}( T,c^\ast ) . \end{align*}$

The claim follows by rescaling $\varepsilon$ by $\frac{1}{9}$.□

We analyze the worst-case running time of Algorithm 3.

Theorem 5.2.

Algorithm 3 has running time $O(\frac{nm^{2\ell -1} \log (m)}{\varepsilon ^{(2\ell -2)d}} + \frac{m^3 \log (m) \ln ^2(1/\delta)}{\varepsilon ^3})$.

Proof.

Algorithm 1 has running time $O(m^2 \log (m) \ln ^2(1/\delta) + m^3 \log m)$. The sample $S$ has size $O(\frac{\ln (1/\delta)}{\varepsilon })$, and the sample $W$ has size $O(\frac{\ln (1/\delta)}{\varepsilon ^2})$. Evaluating each curve of $S$ against $W$ takes time $O(\frac{m^2 \log (m) \ln ^2(1/\delta)}{\varepsilon ^3})$, using the algorithm of Alt and Godau [4] to compute the distances.

Now, $c$ has up to $m$ vertices, and every grid consists of $(\frac{\frac{(3+\varepsilon)\Delta }{n}}{\frac{2\varepsilon ^\prime \Delta }{n\sqrt {d}}})^d = (\frac{(3+\varepsilon)\sqrt {d}}{2 \varepsilon ^\prime })^d \in O(\frac{1}{\varepsilon ^d})$ points. Therefore, we have $O(\frac{m}{\varepsilon ^{d}})$ points in $P,$ and Algorithm 3 enumerates all combinations of $2\ell -2$ points from $P$ taking time $O(\frac{m^{2\ell -2}}{\varepsilon ^{(2\ell -2)d}})$. Afterward, these candidates are evaluated, which takes time $O(n m \log (m))$ per candidate using the algorithm of Alt and Godau [4] to compute the distances. All in all, we then have running time $O(\frac{nm^{2\ell -1} \log (m)}{\varepsilon ^{(2\ell -2)d}} + \frac{m^3 \log (m) \ln ^2(1/\delta)}{\varepsilon ^3})$.□

6 (1+ɛ)-APPROXIMATION FOR ℓ-MEDIAN BY ADVANCED SHORTCUTTING

Now we present Algorithm 4, which returns candidates, containing with high probability a $(1+\varepsilon)$-approximate $(1,\ell)$-median of complexity at most $2\ell -2$ for any (fixed) subset that takes a constant fraction of the input. Before we present the algorithm, we present our second shortcutting lemma. Here, we do not introduce shortcuts with respect to a single curve, but with respect to several curves: by introducing shortcuts with respect to $\varepsilon \vert T \vert$ well-chosen curves from the given set $T \subset \mathbb {R}^d_m$ of polygonal curves, for a given $\varepsilon \in (0,1)$, we preserve the distances to at least $(1-\varepsilon)\vert T \vert$ curves from $T$. In this context, well chosen means that there exists a certain number of subsets of $T$, of each we have to pick a curve for shortcutting. This will enable the high quality of approximation of Algorithm 4, and we formalize this in the following lemma.

Lemma 6.1.

Let $\sigma \in \mathbb {R}^d_{\ast }$ be a polygonal curve with $\vert \sigma \vert \gt 2$ vertices and $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subset \mathbb {R}^d_{\ast }$ be a set of polygonal curves. For $i \in [n]$, let $r_i = \text{d}_\text{F}(\tau _i, \sigma),$ and for $j \in [\vert \tau _i \vert ]$, let $v^{\tau _i}_j$ be the $j$^th vertex of $\tau _i$. For any $\varepsilon \in (0,1),$ there are $2\vert \sigma \vert -4$ subsets $T_1, \ldots , T_{2\vert \sigma \vert -4} \subseteq T$, not necessarily disjoint, and of $\frac{\varepsilon n}{2\vert \sigma \vert }$ curves each, such that for every subset $T^\prime \subseteq T$ containing at least one curve out of each $T_k \in \lbrace T_1, \ldots , T_{2\vert \sigma \vert -4}\rbrace$, a polygonal curve $\sigma ^\prime \in \mathbb {R}^d_{2 \vert \sigma \vert - 2}$ exists with every vertex contained in $\begin{equation*} \bigcup \limits _{\tau _i \in T^\prime } \bigcup \limits _{j \in [\vert \tau _i \vert ]} B\left(v^{\tau _i}_j, r_i\right) \end{equation*}$ and $\text{d}_\text{F}(\tau , \sigma ^\prime) \le \text{d}_\text{F}(\tau , \sigma)$ for each $\tau \in T\setminus (T_1 \cup \dots \cup T_{2 \vert \sigma \vert -4})$.

The idea is the following (Figure 3 presents a visualization). One can argue that every vertex $v$ of $\sigma$ not contained in any of the balls centered at the vertices of the curves in $T$ (and of radius according to their distance to $\sigma$) can be shortcut by connecting the last point $p^{-}$ before $v$ (in terms of the parameter of $\sigma$) contained in one ball and first point $p^{+}$ after $v$ contained in one ball. This does not increase the Fréchet distances between $\sigma$ and the $\tau \in T$, because only matchings among line segments are affected by this modification. Furthermore, most distances are preserved when we do not actually use the last and first ball before and after $v$, but one of the $\frac{\varepsilon n}{2\vert \sigma \vert }$ balls before and one of the $\frac{\varepsilon n}{2\vert \sigma \vert }$ balls after $v$, which is the key of the following proof.

Fig. 3. By using a subset of well-chosen input curves, a shortcut can be constructed that preserves the majority of distances to the input curves: $\text{d}_\text{F}(\sigma ^\prime , \tau) \le \text{d}_\text{F}(\sigma , \tau)$ for most $\tau \in T$ .

Proof of Lemma 6.1

Let $\ell = \vert \sigma \vert$. For the sake of simplicity, we assume that $\frac{\varepsilon n}{2\ell }$ is integral. For $i \in [n]$, let $v^{\tau _i}_1, \ldots , v^{\tau _i}_{\vert \tau _i \vert }$ be the vertices of $\tau _i$ with instants $t^{\tau _i}_{1}, \ldots , t^{\tau _i}_{\vert \tau _i \vert }$, and let $v^\sigma _1, \ldots , v^\sigma _{\ell }$ be the vertices of $\sigma$ with instants $t^\sigma _1, \ldots , t^\sigma _{\ell }$. In addition, for $h \in \mathcal {H}$ (recall that $\mathcal {H}$ is the set of all continuous bijections $h:[0,1] \rightarrow [0,1]$ with $h(0) = 0$ and $h(1) = 1$) and $i \in [n]$, let $r_{i,h} = \max \limits _{t \in [0,1]} \Vert \sigma (t) - \tau _i(h(t)) \Vert$ be the distance realized by $h$ with respect to $\tau _i$. We know from Lemma 2.4 that for each $i \in [n],$ there exists a sequence $(h_{i,x})_{x=1}^\infty$ in $\mathcal {H}$, such that $\lim \limits _{x \rightarrow \infty } r_{i,h_{i,x}} = \text{d}_\text{F}(\sigma , \tau _i) = r_i$.

In the following, given arbitrary $h_1, \ldots , h_n \in \mathcal {H}$, we describe how to modify $\sigma$, such that its vertices can be found in the balls around the vertices of the $\tau \in T$, of radii determined by $h_1, \ldots , h_n$. Later we will argue that this modification can be applied using the $h_{1,x}, \ldots , h_{n,x}$, for each $x \in \mathbb {N}$, in particular.

Now, fix arbitrary $h_1, \ldots , h_n \in \mathcal {H,}$ and for an arbitrary $k \in \lbrace 2, \ldots , \vert \sigma \vert -1\rbrace$, fix the vertex $v^\sigma _k$ of $\sigma$ with instant $t^\sigma _k$. For $i \in [n]$, let $s_i$ be the maximum of $[\vert \tau _i \vert -1]$, such that $t^{\tau _i}_{s_i} \le h_i(t^\sigma _k) \le t^{\tau _i}_{s_{i}+1}$. Namely, $v^\sigma _k$ is matched to a point on the line segment $\overline{v^{\tau _1}_{s_1}v^{\tau _1}_{s_1+1}}, \ldots , \overline{v^{\tau _n}_{s_n}v^{\tau _n}_{s_n+1}}$, respectively, by $h_1, \ldots , h_n$.

For $i \in [n]$, let $t^{-}_i$ be the maximum of $[0, t^\sigma _k]$, such that $\sigma (t^{-}_i) \in B(v^{\tau _i}_{s_i}, r_{i,h_i}),$ and let $t^{+}_i$ be the minimum of $[t^\sigma _k, 1]$, such that $\sigma (t^+_i) \in B(v^{\tau _i}_{s_i+1}, r_{i,h_i})$. These are the instants when $\sigma$ visits $B(v^{\tau _i}_{s_i}, r_{i,h_i})$ before or when it visits $v^\sigma _k$ and $\sigma$ visits $B(v^{\tau _i}_{s_i+1}, r_{i,h_i})$ when or after it visits $v^\sigma _k$, respectively. Furthermore, there is a permutation $\alpha \in \mathcal {S}_n$ of the index set $[n]$, such that $\begin{equation*} t^{-}_{\alpha ^{-1}(1)} \le \dots \le t^{-}_{\alpha ^{-1}(n)}. \qquad \qquad \qquad \qquad {\rm (I)} \end{equation*}$ In addition, there is a permutation $\zeta \in \mathcal {S}_n$ of the index set $[n]$, such that $\begin{equation*} t^{+}_{\zeta ^{-1}(1)} \le \dots \le t^{+}_{\zeta ^{-1}(n)}. \qquad \qquad \qquad \qquad {\rm (II)} \end{equation*}$ As well, for each $i \in [n],$ we have $\begin{equation*} t^{\tau _i}_{s_i} \le h_i(t^{-}_i) \qquad \qquad \qquad \qquad {\rm (III)} \end{equation*}$ and $\begin{equation*} h_i(t^{+}_i) \le t^{\tau _i}_{s_i+1}, \qquad \qquad \qquad \qquad {\rm (IV)} \end{equation*}$ because $\sigma (t^-_i)$ and $\sigma (t^+_i)$ are the closest points to $v^\sigma$ on $\sigma$ that have distance at most $r_{i,h_i}$ to $v^{\tau _i}_{s_i}$ and $v^{\tau _i}_{s_i+1}$, respectively, by definition. We will now use Equations (I) to (IV) to prove that an advanced shortcut only affects matchings among line segments, and hence we can easily bound the resulting distances for at least $(1-\varepsilon)n$ of the curves.

Let $\begin{equation*} I_{v^\sigma _k}(h_1, \ldots , h_n) = \lbrace \tau _{\alpha ^{-1}((1-\frac{\varepsilon }{2\ell })n + 1)}, \ldots , \tau _{\alpha ^{-1}(n)}\rbrace ,\ O_{v^\sigma _k}(h_1, \ldots , h_n) = \lbrace \tau _{\zeta ^{-1}(1)}, \ldots , \tau _{\zeta ^{-1}(\frac{\varepsilon n}{2\ell })} \rbrace . \end{equation*}$ $I_{v^\sigma _k}(h_1, \ldots , h_n)$ is the set of the last $\frac{\varepsilon n}{2\ell }$ curves whose balls are visited by $\sigma$, before or when $\sigma$ visits $v^\sigma _k$. Similarly, $O_{v^\sigma _k}(h_1, \ldots , h_n)$ is the set of the first $\frac{\varepsilon n}{2\ell }$ curves whose balls are visited by $\sigma$, when or immediately after $\sigma$ visited $v^\sigma _k$. We now modify $\sigma$, such that $v^\sigma _k$ is replaced by two new vertices that are elements of at least one $B(v^{\tau _i}_{j}, r_{i,h_i})$, for a $\tau _i \in I_{v^\sigma _k}(h_1, \ldots , h_n)$, respectively for a $\tau _i \in O_{v^\sigma _k}(h_1, \ldots , h_n)$, and $j \in [\vert \tau _i \vert ]$, each.

Let $\sigma ^\prime _{h_1, \ldots , h_n}$ be the piecewise defined curve, defined just like $\sigma$ on $[0,t^-_{\alpha ^{-1}(k_1)}]$ and $[t^+_{\zeta ^{-1}(k_2)},1]$ for arbitrary $k_1 \in \lbrace (1-\frac{\varepsilon }{2\ell })n + 1, \ldots , n\rbrace$ and $k_2 \in [\frac{\varepsilon n}{2\ell }]$, but on $(t^-_{\alpha ^{-1}(k_1)}, t^+_{\zeta ^{-1}(k_2)})$ it connects $\sigma (t^-_{\alpha ^{-1}(k_1)})$ and $\sigma (t^+_{\zeta ^{-1}(k_2)})$ with the line segment $\begin{equation*} \gamma (t) = \left(1-\frac{t-t^-_{\alpha ^{-1}(k_1)}}{t^+_{\zeta ^{-1}(k_2)}-t^-_{\alpha ^{-1}(k_1)}}\right) \sigma \left(t^-_{\alpha ^{-1}(k_1)}\right) + \frac{t-t^-_{\alpha ^{-1}(k_1)}}{t^+_{\zeta ^{-1}(k_2)}-t^-_{\alpha ^{-1}(k_1)}} \sigma \left(t^+_{\zeta ^{-1}(k_2)}\right). \end{equation*}$ We now argue that for all $\tau _i \in T \setminus (I_{v^\sigma _k}(h_1, \ldots , h_n) \cup O_{v^\sigma _k}(h_1, \ldots , h_n)),$ the Fréchet distance between $\sigma ^\prime _{h_1, \ldots , h_n}$ and $\tau _i$ is upper bounded by $r_{i,h_i}$. First, note that by definition, $h_1, \ldots , h_n$ are strictly increasing functions, since they are continuous bijections that map 0 to 0 and 1 to 1. As immediate consequence, we have that $\begin{equation*} t^{\tau _i}_{s_i} \le h_i(t^-_i) \le h_i\left(t^-_{\alpha ^{-1}(k_1)}\right)\qquad \qquad \qquad \qquad {\rm (V)} \end{equation*}$ for each $\tau _i \in T \setminus I_{v^\sigma _k}(h_1, \ldots , h_n)$ and $\begin{equation*} h_i\left(t^+_{\zeta ^{-1}(k_2)}\right) \le h_i(t^+_i) \le t^{\tau _i}_{s_i+1} \qquad \qquad \qquad \qquad {\rm (VI)} \end{equation*}$ for each $\tau _i \in T \setminus O_{v^\sigma _k}(h_1, \ldots , h_n)$, using Equations (I) to (IV).. Therefore, each $\tau _i \in T \setminus (I_{v^\sigma _k}(h_1, \ldots , h_n) \cup O_{v^\sigma _k}(h_1, \ldots , h_n))$ has no vertex between the instants $h_i(t^-_{\alpha ^{-1}(k_1)})$ and $h_i(t^+_{\zeta ^{-1}(k_2)})$. We also know that for each $\tau _i \in T,$ $\begin{equation*} \left\Vert \sigma \left(t^-_{\alpha ^{-1}(k_1)}\right) - \tau _i\left(h_i\left(t^-_{\alpha ^{-1}(k_1)}\right)\right) \right\Vert \le r_{i,h_i} \qquad \qquad \qquad \qquad {\rm (VII)} \end{equation*}$ and $\begin{equation*} \left\Vert \sigma \left(t^+_{\zeta ^{-1}(k_2)}\right) - \tau _i\left(h_i\left(t^+_{\zeta ^{-1}(k_2)}\right)\right) \right\Vert \le r_{i,h_i}. \qquad \qquad \qquad \qquad {\rm (VIII)} \end{equation*}$

Let $D_{s,\sigma } = [0,t^-_{\alpha ^{-1}(k_1)})$, $D_{m,\sigma } = [t^-_{\alpha ^{-1}(k_1)}, t^+_{\zeta ^{-1}(k_2)}]$ and $D_{e,\sigma } = (t^+_{\zeta ^{-1}(k_2)}, 1]$. In addition, for $i \in [n]$, let $D_{s,\tau _i} = [0,h_i(t^-_{\alpha ^{-1}(k_1)}))$, $D_{m,\tau _i} = [h_i(t^-_{\alpha ^{-1}(k_1)}), h_i(t^+_{\zeta ^{-1}(k_2)})]$, and $D_{e,\tau _i} = (h_i(t^+_{\zeta ^{-1}(k_2)}),1]$. Now, for each $\tau _i \in T \setminus (I_{v^\sigma _k}(h_1, \ldots , h_n) \cup O_{v^\sigma _k}(h_1, \ldots , h_n)),$ we use $h_i$ to match $\sigma ^\prime _{h_1, \ldots , h_n}\vert _{D_{s,\sigma }}$ to $\tau _i\vert _{D_{s,\tau _i}}$ and $\sigma ^\prime _{h_1, \ldots , h_n}\vert _{D_{e,\sigma }}$ to $\tau _i\vert _{D_{e,\tau _i}}$ with distance at most $r_{i,h_i}$. Since $\sigma ^\prime _{h_1, \ldots , h_n}\vert _{D_{m,\sigma }}$ and $\tau _i\vert _{D_{m,\tau _i}}$ are just line segments by Equations (V) and (VI), they can be linearly matched to each other with distance at most $\begin{equation*} \max \left\lbrace \left\Vert \sigma \left(t^-_{\alpha ^{-1}(k_1)}\right) - \tau _i\left(h_i\left(t^-_{\alpha ^{-1}(k_1)}\right)\right) \right\Vert , \left\Vert \sigma \left(t^+_{\zeta ^{-1}(k_2)}\right) - \tau _i\left(h_i\left(t^+_{\zeta ^{-1}(k_2)}\right)\right) \right\Vert \right\rbrace , \end{equation*}$ which is at most $r_{i,h_i}$ by Equations (VII) to (VIII). We conclude that $\text{d}_\text{F}(\sigma ^\prime _{h_1, \ldots , h_n}, \tau _i) \le r_{i,h_i}$.

Because this modification works for every $h_1, \ldots , h_n \in \mathcal {H}$, we conclude that $\text{d}_\text{F}(\sigma ^\prime _{h_1, \ldots , h_n}, \tau _i) \le r_{i,h_i}$ for every $h_1, \ldots , h_n \in \mathcal {H}$ and $\tau _i \in T \setminus (I_{v^\sigma _k}(h_1, \ldots , h_n) \cup O_{v^\sigma _k}(h_1, \ldots , h_n))$. Thus, $\lim \limits _{x \rightarrow \infty } \text{d}_\text{F}(\sigma ^\prime _{h_{1,x}, \ldots , h_{n,x}}, \tau _i) \le \text{d}_\text{F}(\sigma , \tau _i) = r_i$ for each $\tau _i \in T \setminus (I_{v^\sigma _k}(h_{1,x}, \ldots , h_{n,x}) \cup O_{v^\sigma _k}(h_{1,x}, \ldots , h_{n,x}))$.

Now, to prove the claim, for each combination $h_1, \ldots , h_n \in \mathcal {H}$, we apply this modification to $v^\sigma _k$ and successively to every other vertex $v^{\sigma ^\prime _{h_1, \ldots , h_n}}_l$ of the resulting curve $\sigma ^\prime _{h_1, \ldots , h_n}$, except $v^{\sigma ^\prime _{h_1, \ldots , h_n}}_1$ and $v^{\sigma ^\prime _{h_1, \ldots , h_n}}_{\vert \sigma ^\prime _{h_1, \ldots , h_n}\vert }$, since these must be elements of $B(v^{\tau _i}_1, r_{i,h_i})$ and $B(v^{\tau _i}_{\vert \tau _i \vert }, r_{i,h_i})$, respectively, for each $i \in [n]$, by definition of the Fréchet distance.

Since the modification is repeated at most $\vert \sigma \vert - 2$ times for each combination $h_1, \dots h_n \in \mathcal {H}$, we conclude that the number of vertices of each $\sigma ^\prime _{h_1, \ldots , h_n}$ can be bounded by $2 \cdot (\vert \sigma \vert - 2) + 2$.

$T_1, \ldots , T_{2\ell -4}$ are therefore all the $I_{v^\sigma _k}(h_{1,x}, \ldots , h_{n,x})$ and $O_{v^\sigma _k}(h_{1,x}, \ldots , h_{n,x})$ for $k \in \lbrace 2, \ldots , 2\vert \sigma \vert - 3\rbrace$, when $x \rightarrow \infty$. Note that every $I_{v^\sigma _k}(h_{1,x}, \ldots , h_{n,x})$ and $O_{v^\sigma _k}(h_{1,x}, \ldots , h_{n,x})$ is determined by the visiting order of the balls, and since their radii converge, these sets do as well.□

We now present Algorithm 4, which is nearly identical to Algorithm 2 but uses the advanced shortcutting lemma. Furthermore, like Algorithm 2, it can be used as plugin in the recursive $k$-median approximation scheme (Algorithm 5) that we present in Section 7.

We prove the quality of approximation of Algorithm 4.

Theorem 6.2.

Given three parameters $\beta \in [1, \infty)$, $\delta \in (0,1)$, $\varepsilon \in (0,0.158],$ and a set $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subset \mathbb {R}^d_m$ of polygonal curves, with probability at least $1-\delta ,$ the set of candidates that Algorithm 4 returns contains a $(1+\varepsilon)$-approximate $\ell$-median with up to $2\ell -2$ vertices for any (fixed) $T^\prime \subseteq T$, if $\vert T^\prime \vert \ge \frac{1}{\beta } \vert T \vert$.

In the following proof, we make use of a case distinction developed by Nath and Taylor [30, Proof of Theorem 10], which is a key ingredient to enable the $(1+\varepsilon)$-approximation, although the domain of $\varepsilon$ has to be restricted to $(0, 0.158]$.

Proof of Theorem 6.2

We assume that $\vert T^\prime \vert \ge \frac{1}{\beta } \vert T \vert$ and $\ell \gt 2$. Let $n^\prime$ be the number of sampled curves in $S$ that are elements of $T^\prime$. Clearly, $E\left[n^\prime \right] \ge \sum _{i=1}^{\vert S \vert } \frac{1}{\beta } = \frac{\vert S \vert }{\beta }$. In addition, $n^\prime$ is the sum of independent Bernoulli trials. A Chernoff bound (see Lemma 2.11) yields the following: $\begin{align*} \Pr \left[n^\prime \lt \frac{\vert S \vert }{2\beta }\right] \le \Pr \left[n^\prime \lt \frac{E\left[n^\prime \right]}{2}\right] \le \exp \left(-\frac{1}{4}\frac{\vert S \vert }{2\beta }\right) \le \exp \left(\frac{\ell \ln \left(\frac{\delta }{4(2\ell -4)}\right)}{\varepsilon } \right) \le \left(\frac{\delta ^\ell }{8^\ell }\right)^{\frac{1}{\varepsilon }} \le \frac{\delta }{8}. \end{align*}$

In other words, with probability at most $\delta /8,$ no subset $S^\prime \subseteq S$, of cardinality at least $\frac{\vert S \vert }{2\beta }$, is a subset of $T^\prime$. We condition the rest of the proof on the contrary event, denoted by $\mathcal {E}_{T^\prime }$, namely that there is a subset $S^\prime \subseteq S$ with $S^\prime \subseteq T^\prime$ and $\vert S^\prime \vert \ge \frac{\vert S \vert }{2\beta }$. Note that $S^\prime$ is then a uniform and independent sample of $T^\prime$ (see Section 2.1.1).

Now, let $c^\ast \in \mathbb {R}^d_\ell$ be an optimal $\ell$-median for $T^\prime$. The expected distance between $s \in S^\prime$ and $c^\ast$ is $\begin{equation*} E\left[\text{d}_\text{F}(s, c^\ast)\ \vert \ \mathcal {E}_{T^\prime }\right] = \sum _{\tau \in T^\prime } \text{d}_\text{F}(c^\ast , \tau) \cdot \frac{1}{\vert T^\prime \vert } = \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }. \end{equation*}$ By linearity, we have $E\left[\text{cost}( S^\prime ,c^\ast ) \ \vert \ \mathcal {E}_{T^\prime }\right] = \frac{\vert S^\prime \vert }{\vert T^\prime \vert } \text{cost}( T^\prime ,c^\ast )$. Markov’s inequality yields the following: $\begin{align*} \Pr \left[ \frac{\delta \vert T^\prime \vert }{4\vert S^\prime \vert }\text{cost}( S^\prime ,c^\ast ) \gt \text{cost}( T^\prime ,c^\ast ) \ \Big \vert \ \mathcal {E}_{T^\prime }\right] \le \frac{\delta }{4}. \end{align*}$

We conclude that with probability at most $\delta /4,$ we have $\frac{\delta \vert T^\prime \vert }{4\vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \gt \text{cost}( T^\prime ,c^\ast )$.

Now, from Lemma 6.1, we know that there are $2\ell -4$ subsets $T^\prime _1, \ldots , T^\prime _{2\ell -4} \subseteq T^\prime$, of cardinality $\frac{\varepsilon \vert T^\prime \vert }{2\ell }$ each and which are not necessarily disjoint, such that for every set $W \subseteq T^\prime$ that contains at least one curve $\tau \in T^\prime _i$ for each $i \in [2\ell -4]$, there exists a curve $c^\prime \in \mathbb {R}^d_{2\ell -2}$ that has all of its vertices contained in $\begin{equation*} \bigcup \limits _{\tau \in W} \bigcup \limits _{j \in [\vert \tau \vert ]} B\left(v^{\tau }_j, \text{d}_\text{F}(\tau , c^\ast)\right) \end{equation*}$ and for at least $(1-\varepsilon) \vert T^\prime \vert$ curves $\tau \in T^\prime \setminus (T^\prime _1 \cup \dots \cup T^\prime _{2\ell -4})$ it holds that $\text{d}_\text{F}(\tau , c^\prime) \le \text{d}_\text{F}(\tau , c^\ast)$.

There are up to $\frac{\varepsilon \vert T^\prime \vert }{4\ell }$ curves with distance to $c^\ast$ at least $\frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert }$. Otherwise, the cost of these curves would exceed $\text{cost}( T^\prime ,c^\ast )$, which is a contradiction. Later we will prove that each ball we cover has radius at most $\frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert }$. Therefore, for each $i \in [2\ell -4],$ we have to ignore up to half of the curves $\tau \in T^\prime _i$, since we do not cover the balls of radius $\text{d}_\text{F}(\tau , c^\ast)$ centered at their vertices. For each $i \in [2\ell -4]$ and $s \in S^\prime ,$ we now have $\begin{equation*} \Pr \left[s \in T^\prime _i \wedge \text{d}_\text{F}(s, c^\ast) \le \frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert } \ \Big \vert \ \mathcal {E}_{T^\prime } \right] \ge \frac{\varepsilon }{4\ell }. \end{equation*}$ Therefore, by independence, for each $i \in [2\ell -4],$ the probability that no $s \in S^\prime$ is an element of $T^\prime _i$ and has distance to $c^\ast$ at most $\frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert }$ is at most $(1-\frac{\varepsilon }{4\ell })^{\vert S^\prime \vert } \le \exp (-\frac{\varepsilon }{4\ell } \frac{4\ell (\ln (1/\delta)+\ln (4(2\ell -4)))}{\varepsilon }) = \exp (\ln (\frac{\delta }{4(2\ell -4)})) = \frac{\delta }{4(2\ell -4)}$. In addition, with probability at most $\delta /4,$ Algorithm 1 fails to compute a 34-approximate $\ell$-median $c \in \mathbb {R}^d_\ell$ for $S^\prime$ (see Corollary 3.3).

Using Lemma 2.10, we conclude that with probability at least $1-7/8 \delta ,$ all of the following events occur simultaneously:

(1)	There is a subset $S^\prime \subseteq S$ of cardinality at least $\vert S \vert /(2\beta)$ that is a uniform and independent sample of $T^\prime$;
(2)	for each $i \in [2\ell -4]$, $S^\prime$ contains at least one curve from $T^\prime _i$ with distance to $c^\ast$ up to $\frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert }$;
(3)	Algorithm 1 computes a polygonal curve $c \in \mathbb {R}^d_\ell$ with ${\rm cost}({S^\prime },{c^\ast _{S^\prime }}) \le \text{cost}( S^\prime ,c) \le 34 {\rm cost}({S^\prime },{c^\ast _{S^\prime }})$, where $c^\ast _{S^\prime } \in \mathbb {R}^d_\ell$ is an optimal $\ell$-median for $S^\prime$; and
(4)	it holds that $\frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \le \text{cost}( T^\prime ,c^\ast )$.

Let $B_{c^\ast } = \lbrace \tau \in T^\prime \mid \text{d}_\text{F}(\tau , c^\ast) \le \frac{\text{cost}( T^\prime ,c^\ast ) }{\varepsilon ^2 \vert T^\prime \vert } \rbrace$ and $B_{c} = \lbrace \tau \in T^\prime \mid \text{d}_\text{F}(\tau , c) \le \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \rbrace$. First, note that $\vert T^{\prime } \setminus B_{c^\ast } \vert \le \varepsilon ^2 \vert T^\prime \vert$; otherwise, $\text{cost}( T^\prime \setminus B_{c^\ast },c^\ast ) \gt \text{cost}( T^\prime ,c^\ast )$, which is a contradiction, and therefore $\vert B_{c^\ast } \vert \ge (1-\varepsilon ^2) \vert T^\prime \vert$. We now distinguish two cases.

Case 1. $\vert B_{c^\ast } \setminus B_{c} \vert \gt 2 \varepsilon \vert B_{c^\ast } \vert .$

We have $2 \varepsilon \vert B_{c^\ast } \vert \ge (1-\varepsilon ^2)2\varepsilon \vert T^\prime \vert \ge \varepsilon \vert T^\prime \vert$, and hence $\Pr [\text{d}_\text{F}(s,c) \gt \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }\ \vert \ \mathcal {E}_{T^\prime } ] \ge \varepsilon$ for each $s \in S^\prime$. Using independence, we conclude that with probability at most $\begin{equation*} (1-\varepsilon)^{\vert S^\prime \vert } \le \exp \left(-\varepsilon \frac{4\ell (\ln (1/\delta)+\ln (4(2\ell -4)))}{\varepsilon }\right) \le \frac{\delta ^{4\ell }}{4^{4\ell }} \le \frac{\delta }{8} \end{equation*}$ no $s \in S^\prime$ has distance to $c$ greater than $\varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$. Including this bad event, by Lemma 2.10 we conclude that with probability at least $1-\delta ,$ Items 1 to 4 occur simultaneously and at least one $s \in S^\prime$ has distance to $c$ greater than $\varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$, and hence $\text{cost}( S^\prime ,c) \gt \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \iff \frac{\text{cost}( S^\prime ,c) }{\varepsilon } \gt \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$ and thus we indeed cover the balls of radius at most $\frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert } \lt \frac{4\ell }{\varepsilon } \frac{\text{cost}( S^\prime ,c^\ast ) }{\varepsilon }$.

In the last step, Algorithm 4 returns a set $C$ of all curves with up to $2\ell -2$ vertices from the grids that contains one curve, denoted by $c^{\prime \prime }$ with same number of vertices as $c^\prime$ (recall that this is the curve guaranteed from Lemma 6.1) and distance at most $\frac{\varepsilon }{n} \Delta _l \le \frac{\varepsilon }{\vert T^\prime \vert } \text{cost}( T^\prime ,c^\ast )$ between every corresponding pair of vertices of $c^\prime$ and $c^{\prime \prime }$. We conclude that $\text{d}_\text{F}(c^\prime , c^{\prime \prime }) \le \frac{\varepsilon }{\vert T^\prime \vert } \text{cost}( T^\prime ,c^\ast )$. In addition, recall that $\text{d}_\text{F}(\tau , c^\prime) \le \text{d}_\text{F}(\tau , c^\ast)$ for $\tau \in T^\prime \setminus (T^\prime _1 \cup \dots \cup T^\prime _{2\ell -4})$. Further, $T^\prime$ contains at least $\frac{\vert T^\prime \vert }{2}$ curves with distance at most $\frac{2\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$ to $c^\ast$; otherwise, the cost of the remaining curves would exceed $\text{cost}( T^\prime ,c^\ast )$, which is a contradiction, and since $\varepsilon \lt \frac{1}{2}$, there is at least one curve $\sigma \in T^\prime \setminus (T^\prime _1 \cup \dots \cup T^\prime _{2\ell -4})$ with $\text{d}_\text{F}(\sigma , c^\prime) \le \text{d}_\text{F}(\sigma , c^\ast) \le \frac{2\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }$ by the pigeonhole principle. We can now bound the cost of $c^{\prime \prime }$ as follows: \(\begin{align*} \text{cost}( T^\prime ,c^{\prime \prime }) & ={} \sum _{\tau \in T^\prime } \text{d}_\text{F}(\tau , c^{\prime \prime }) \le \sum _{\tau \in T^\prime \setminus (T^\prime _1 \cup \dots \cup T^\prime _{2\ell -4})} \left(\text{d}_\text{F}(\tau , c^\prime) + \frac{\varepsilon }{\vert T^\prime \vert } \text{cost}( T^\prime ,c^\ast ) \right)\ + \\ & \ \ \ \ \sum _{\tau \in (T^\prime _1 \cup \dots \cup T^\prime _{2\ell -4})} \left(\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , \sigma) + \text{d}_\text{F}(\sigma , c^{\prime }) + \text{d}_\text{F}(c^\prime , c^{\prime \prime }) \right) \\ & \le {} (1+\varepsilon) \text{cost}( T^\prime ,c^\ast ) + \sum _{\tau \in (T^\prime _1 \cup \dots \cup T^\prime _{2\ell -4})} \left((2+2+\varepsilon)\frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \right) \\ & \le {} \text{cost}( T^\prime ,c^\ast ) + \varepsilon \text{cost}( T^\prime ,c^\ast ) + 5\varepsilon \text{cost}( T^\prime ,c^\ast ) = (1 + 6 \varepsilon) \text{cost}( T^\prime ,c^\ast ) . \end{align*}\)

Case 2. $\vert B_{c^\ast } \setminus B_{c} \vert \le 2 \varepsilon \vert B_{c^\ast } \vert .$

Again, we distinguish two cases.

Case 2.1. $\text{d}_\text{F}(c,c^\ast) \le 4\varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }.$

We can easily bound the cost of $c$: $\begin{align*} \text{cost}( T^\prime ,c) \le \sum _{\tau \in T^\prime } (\text{d}_\text{F}(\tau , c^\ast) + \text{d}_\text{F}(c^\ast , c)) \le (1+4\varepsilon) \text{cost}( T^\prime ,c^\ast ) . \end{align*}$

Case 2.2. $\text{d}_\text{F}(c,c^\ast) \gt 4\varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }.$

Recall that $\vert B_{c^\ast } \vert \ge (1-\varepsilon ^2) \vert T^\prime \vert$. We have $\begin{align*} \vert T^\prime \setminus B_{c} \vert & \le {} \vert T^\prime \setminus B_{c^\ast } \vert + 2\varepsilon \vert B_{c^\ast } \vert = \vert T^\prime \vert - (1-2\varepsilon) \vert B_{c^\ast } \vert \le \vert T^\prime \vert - (1-2\varepsilon)(1-\varepsilon ^2) \vert T^\prime \vert \\ & = (2\varepsilon + \varepsilon ^2 - 2\varepsilon ^3) \vert T^\prime \vert \lt \frac{1}{3} \vert T^\prime \vert . \end{align*}$

Hence, $\vert B_{c} \vert \ge (1-2\varepsilon -\varepsilon ^2 + 2 \varepsilon ^3) \vert T^\prime \vert \gt \frac{2}{3} \vert T^\prime \vert$. Assume that we assign all curves to $c$ instead of to $c^\ast$. For $\tau \in B_{c}$, we now have decrease in cost $\text{d}_\text{F}(\tau , c^\ast) - \text{d}_\text{F}(\tau , c)$, which can be bounded as follows: $\begin{align*} \text{d}_\text{F}(\tau , c^\ast) - \text{d}_\text{F}(\tau , c) & \ge {} \text{d}_\text{F}(\tau , c^\ast) - \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \ge \text{d}_\text{F}(c,c^\ast) - \text{d}_\text{F}(\tau , c) - \varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \\ & \ge \text{d}_\text{F}(c, c^\ast) - 2\varepsilon \frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert } \gt \frac{1}{2} \text{d}_\text{F}(c,c^\ast). \end{align*}$

For $\tau \in T^\prime \setminus B_{c}$, we have an increase in cost $\text{d}_\text{F}(\tau , c) - \text{d}_\text{F}(\tau , c^\ast) \le \text{d}_\text{F}(c, c^\ast)$. Let the overall increase in cost be denoted by $\alpha$, which can be bounded as follows: $\begin{align*} \alpha \lt \vert T^\prime \setminus B_{c} \vert \cdot \text{d}_\text{F}(c, c^\ast) - \vert B_{c} \vert \cdot \frac{\text{d}_\text{F}(c,c^\ast)}{2}. \end{align*}$

By the fact that $\vert T^\prime \setminus B_{c} \vert \lt \frac{1}{2} \vert B_{c} \vert$ for our choice of $\varepsilon$, we conclude that $\alpha \lt 0$, which is a contradiction because $c^\ast$ is an optimal $\ell$-median for $T^\prime$. Therefore, Case 2.2 cannot occur. Rescaling $\varepsilon$ by $\frac{1}{6}$ proves the claim.□

We analyze the worst-case running time of Algorithm 4 and the number of candidates it returns.

Theorem 6.3.

The running time as well as the number of candidates that Algorithm 4 returns can be bounded by $\begin{equation*} m^{O(\ell)} \left(\frac{1}{\delta \varepsilon }\right)^{O(\ell d)} \beta ^{O(\ell d 1/\varepsilon \ln (1/\delta))}. \end{equation*}$

Proof. The sample $S$ has size $O(\frac{\ln (1/\delta) \cdot \beta }{\varepsilon })$, and obtaining it takes time $O(\frac{\ln (1/\delta) \cdot \beta }{\varepsilon })$. Let $n_S = \vert S \vert$. The outer for-loop runs $\begin{equation*} \binom{n_S}{\frac{n_S}{2 \beta }} \le \left(\frac{e n_S}{\frac{n_S}{2\beta }} \right)^{\frac{n_S}{2\beta }} = (2e\beta)^{\frac{n_S}{2\beta }} = \beta ^{O(\frac{\ln (1/\delta)}{\varepsilon })} \end{equation*}$ times. In each iteration, we run Algorithm 1, taking time $O(m^2 \log (m) \ln ^2(1/\delta) + m^3 \log m)$ (see Corollary 3.3); we compute the cost of the returned curve with respect to $S^\prime$, taking time $O(\frac{\ln (1/\delta)}{\varepsilon } \cdot m \log (m))$; and per curve in $S^\prime ,$ we build up to $m$ grids of size $\begin{equation*} \left(\frac{\frac{(1+\varepsilon)\Delta }{\varepsilon }}{\frac{2\varepsilon 2 \delta n \Delta }{n\sqrt {d} 4 \vert S \vert }}\right)^d = \left(\frac{\sqrt {d} \vert S \vert (1+\varepsilon)}{\varepsilon ^2 \delta }\right)^d \in O\left(\frac{\beta ^d\ln (1/\delta)^d}{\varepsilon ^{3d}\delta ^d}\right) \end{equation*}$ each. Algorithm 4 then enumerates all combinations of $2\ell -2$ points from up to $\vert S^\prime \vert \cdot m$ grids, resulting in $\begin{equation*} O\left(\frac{m^{2\ell -2} \beta ^{2\ell d-2d+2\ell -2} \ln (1/\delta)^{2\ell d-2d+2\ell -2}}{\varepsilon ^{6\ell d-6d+2\ell -2}\delta ^{2\ell d-2d}}\right) \end{equation*}$ candidates per iteration of the outer for-loop. Thus, Algorithm 4 computes $m^{O(\ell)}(\beta \cdot 1/\delta \cdot 1/\varepsilon)^{O(d\ell)}$ candidates per iteration of the outer for-loop.

All in all, the running time and number of candidates the algorithm returns can be bounded by $\begin{equation*} *{122pt}m^{O(\ell)} \left(\frac{1}{\delta \varepsilon }\right)^{O(\ell d)} \beta ^{O(\ell d 1/\varepsilon \ln (1/\delta))}.*{122pt} \end{equation*}$□

7 (1+ɛ)-APPROXIMATION FOR (k,ℓ)-MEDIAN

We generalize the algorithm of Ackermann et al. [2] in the following way: instead of drawing a uniform sample and running a problem-specific algorithm on this sample in the candidate phase, we only run a problem-specific “plugin” algorithm in the candidate phase, thus dropping the framework around the sampling property. We believe that the problem-specific algorithms used in the work of Ackermann et al. [2] do not fulfill the role of a plugin, since parts of the problem-specific operations, such as the uniform sampling, remain in the main algorithm. Here, we separate the problem-specific operations from the main algorithm: any algorithm can serve as plugin if it is able to return candidates for a cluster that takes a constant fraction of the input, where the fraction is an input parameter of the algorithm and some approximation factor is guaranteed (w.h.p.). The calls to the candidate-finder plugin do not even need to be independent (stochastically), allowing adaptive algorithms.

Now, let $\mathcal {X} = (X,\rho)$ be an arbitrary (non-)metric space, where $X$ is any non-empty (ground-)set and $\rho :X \times X \rightarrow \mathbb {R}_{\ge 0}$ is a distance function (not necessarily a metric). We introduce a generalized definition of $k$-median clustering, where the input is restricted to come from a predefined subset $Y \subseteq X$ and the medians are restricted to come from a predefined subset $Z \subseteq X$.

Definition 7.1

(Generalized k-median)

The generalized $k$-median clustering problem is defined as follows, where $k \in \mathbb {N}$ is a fixed (constant) parameter of the problem: given a finite and non-empty set $T = \lbrace \tau _1, \ldots , \tau _n \rbrace \subseteq Y$, compute a set $C$ of $k$ elements from $Z$, such that $\text{cost}( T,C) = \sum _{\tau \in T} \min _{c \in C} \rho (\tau ,c)$ is minimal.

The following algorithm, Algorithm 5, can approximate every $k$-median problem compatible with Definition 7.1, when provided with such a problem-specific plugin algorithm for computing candidates. In particular, it can approximate the $(k,\ell)$-median problem for polygonal curves under the Fréchet distance, when provided with Algorithm 2 or Algorithm 4. Then, we have $X = \mathbb {R}^d_\ast$, $Y = \mathbb {R}^d_m \subseteq \mathbb {R}^d_\ast = X$ and $Z = \mathbb {R}^d_\ell \subseteq \mathbb {R}^d_\ast = X$. Note that the algorithm computes a bicriteria approximation—that is, the solution is approximated in terms of the cost and the number of vertices of the center curves (i.e., the centers come from $\mathbb {R}^d_{2\ell -2}$).

Algorithm 5 has several parameters. The first parameter $C$ is the set of centers found so far, and $\kappa$ is the number of centers yet to be found. The following parameters concern only the plugin algorithm used within the algorithm: $\beta$ determines the size (in terms of a fraction of the input) of the smallest cluster for which an approximate median can be computed, $\delta$ determines the probability of failure of the plugin algorithm, and $\varepsilon$ determines the approximation factor of the plugin algorithm.

Algorithm 5 works as follows. If it has already computed some centers (and there are still centers to compute), it does pruning: some clusters might be too small for the plugin algorithm to compute approximate medians for them. Algorithm 5 then calls itself recursively with only half of the input: the elements with larger distances to the centers yet found. This way the small clusters will eventually take a larger fraction of the input and can be found in the candidate phase. In this phase, Algorithm 5 calls its plugin, and for each candidate that the plugin returned, it calls itself recursively: adding the candidate at hand to the set of centers yet found and decrementing $\kappa$ by one. Eventually, all combinations of computed candidates are evaluated against the original input, and the centers that together evaluated best are returned.

The quality of approximation and worst-case running time of Algorithm 5 is stated in the following two theorems, which we prove further in the following. The proofs are adaptations of corresponding proofs in the work of Ackermann et al. [2]. We provide them for the sake of completeness. We note that no metric properties are used in the proofs.

Theorem 7.2.

Let $\alpha \in [1, \infty)$ and Median-Candidates be an algorithm that, given three parameters $\beta \in [1, \infty)$, $\delta , \varepsilon \in [0,1),$ and a finite set $T \subseteq Y$, returns with probability at least $1-\delta$ an $(\alpha + \varepsilon)$-approximate 1-median for any $T^\prime \subseteq T$, if $\vert T^\prime \vert \ge \frac{1}{\beta } \vert T \vert$.

Let $T \subseteq Y$ be a finite set. Algorithm 5 called with parameters $(T,\emptyset ,k,\beta ,\delta , \varepsilon)$, where $\beta \in (2k, \infty)$ and $\delta , \varepsilon \in (0,1)$, returns with probability at least $1-\delta$ a set $C = \lbrace c_1, \ldots , c_k\rbrace$ with $\text{cost}( T,C) \le (1+\frac{4k^2}{\beta -2k})(\alpha + \varepsilon) \text{cost}( T,C^\ast )$, where $C^\ast$ is an optimal set of $k$ medians for $T$.

Theorem 7.3.

Let $T_1(n, \beta , \delta , \varepsilon)$ denote the worst-case running time of Median-Candidates for an arbitrary input set $T$ with $\vert T \vert = n,$ and let $C(n, \beta , \delta , \varepsilon)$ denote the maximum number of candidates it returns. In addition, let $T_\rho$ denote the worst-case running time needed to compute $\rho$ for an input element and a candidate.

If $T_1$ and $C$ are non-decreasing in $n$, Algorithm 5 has running time $O(C(n,\beta ,\delta ,\varepsilon)^{k+2} \cdot n \cdot T_\rho + C(n,\beta ,\delta ,\varepsilon)^{k+1} \cdot T_1(n, \beta , \delta , \varepsilon))$.

Our main results, which we state in the following, follow from Theorems 4.2 and 4.3, respectively Theorems 6.2 and 6.3, and Theorems 7.2 and 7.3. The first result is a $(3+\varepsilon)$-approximation that follows by using Algorithm 2 as a plugin, and the second one is a $(1+\varepsilon)$-approximation that follows by using Algorithm 4 as a plugin.

Corollary 7.4.

Given two parameters $\delta , \varepsilon \in (0,1)$ and a finite set $T \subset \mathbb {R}^d_m$ of polygonal curves, Algorithm 5 endowed with Algorithm 2 as Median-Candidates and run with parameters $(T,\emptyset ,k,\frac{20k^2}{\varepsilon }+2k,\delta , \varepsilon /5)$ returns with probability at least $1-\delta$ a set $C \subset \mathbb {R}^d_{2\ell -2}$ that is a $(3+\varepsilon)$-approximate solution to the $(k,\ell)$-median for $T$. Algorithm 5 then has running time $\begin{equation*} n m^{O(k\ell)}\left(\frac{1}{\delta \varepsilon }\right)^{O(k \ell d)}\left(\frac{k^2}{\varepsilon }\right)^{O(k \ell d 1/\varepsilon \ln (1/\delta))}. \end{equation*}$

We note that the following result does not differ in asymptotic running time from the previous one. However, the hidden constants are larger if we combine Algorithms 5 and 4.

Corollary 7.5.

Given two parameters $\delta \in (0,1), \varepsilon \in (0, 0.158]$ and a finite set $T \subset \mathbb {R}^d_m$ of polygonal curves, Algorithm 5 endowed with Algorithm 4 as Median-Candidates and run with parameters $(T,\emptyset ,k,\frac{12k^2}{\varepsilon }+2k,\delta , \varepsilon /3)$ returns with probability at least $1-\delta$ a set $C \subset \mathbb {R}^d_{2\ell -2}$ that is a $(1+\varepsilon)$-approximate solution to the $(k,\ell)$-median for $T$. Algorithm 5 then has running time $\begin{equation*} n m^{O(k\ell)}\left(\frac{1}{\delta \varepsilon }\right)^{O(k \ell d)}\left(\frac{k^2}{\varepsilon }\right)^{O(k \ell d 1/\varepsilon \ln (1/\delta))}. \end{equation*}$

The following proof is an adaption of Ackermann et al. [2, Theorems 2.2 to 2.5].

Proof of Theorem 7.2

For $k=1$, the claim trivially holds. We now distinguish two cases. In the first case, the principle of the proof is presented in all of its detail. In the second case, we only show how to generalize the first case to $k \gt 2$.

Case 1. $k=2.$

Let $C^\ast = \lbrace c^\ast _1, c^\ast _2 \rbrace$ be an optimal set of $k$ medians for $T$ with clusters $T^\ast _1$ and $T^\ast _2$, respectively, that form a partition of $T$. For the sake of simplicity, assume that $n$ is a power of 2 and w.l.o.g. assume that $\vert T^\ast _1 \vert \ge \frac{1}{2} \vert T \vert \gt \frac{1}{\beta } \vert T \vert$. Let $C_1$ be the set of candidates returned by Median-Candidates in the initial call. With probability at least $1-\delta /k$, there is a $c_1 \in C_1$ with $\text{cost}({T^\ast _1}{c_1}) \le (\alpha +\varepsilon)\ \text{cost}({T^\ast _1}{c^\ast _1})$. We distinguish two cases.

Case 1.1. There exists a recursive call with parameters $(T^\prime ,\lbrace c_1\rbrace ,1,\beta ,\delta ,\varepsilon)$ and $\vert T^\ast _2 \cap T^\prime \vert \ge \frac{1}{\beta } \vert T^\prime \vert$.

First, we assume that $T^\prime$ is the maximum cardinality input with $\vert T^\ast _2 \cap T^\prime \vert \ge \frac{1}{\beta } \vert T^\prime \vert$, occurring in a recursive call of the algorithm. Let $C_2$ be the set of candidates returned by Median-Candidates in this call. With probability at least $1-\delta /k$, there is a $c_2 \in C_2$ with ${\rm cost}({T^\ast _2 \cap T^\prime },{c_2}) \le (\alpha +\varepsilon) {\rm cost}({T^\ast _2 \cap T^\prime },{\widetilde{c}_2})$, where $\widetilde{c}_2$ is an optimal median for $T^\ast _2 \cap T^\prime$.

Let $P$ be the set of elements of $T$ removed in the $m \in \mathbb {N}$, $m \le \log (n)$, pruning phases between obtaining $c_1$ and $c_2$. Without loss of generality, we assume that $P \ne \emptyset$. For $i \in [m]$, let $P_i$ be the elements removed in the $i$^th (in the order of the recursive calls occurring) pruning phase. Note that the $P_i$ are pairwise disjoint, and we have that $P = \cup _{i=1}^t P_i$ and $\vert P_i \vert = \frac{n}{2^{i}}$. Since $T = T^\ast _1 \uplus (T^\ast _2 \cap T^\prime) \uplus (T^\ast _2 \cap P)$, we have $\begin{align*} \text{cost}( T,\lbrace c_1, c_2) \le \text{cost}( T^\ast _1,c_1) + \text{cost}( T^\ast _2 \cap T^\prime ,c_2) + \text{cost}( T^\ast _2 \cap P,c_1) \qquad \qquad \qquad \qquad {\rm (I)} \end{align*}$ Our aim is now to prove that the number of elements wrongly assigned to $c_1$ (i.e., $T^\ast _2 \cap P$) is small and further, that their cost is a fraction of the cost of the elements correctly assigned to $c_1$ (i.e., $T^\ast _1$).

We define $R_0 = T,$ and for $i \in [m],$ we define $R_i = R_{i-1} \setminus P_i$. The $R_i$ are the elements remaining after the $i$^th pruning phase. Note that by definition, $\vert R_i \vert = \frac{n}{2^i} = \vert P_i \vert$. Since $R_{m} = T^\prime$ is the maximum cardinality input, with $\vert T^\ast _2 \cap T^\prime \vert \ge \frac{1}{\beta } \vert T^\prime \vert$, we have that $\vert T^\ast _2 \cap R_i \vert \lt \frac{1}{\beta } \vert R_i \vert$ for all $i \in [m-1]$. In addition, for each $i \in [m],$ we have $P_i \subseteq R_{i-1}$, and therefore $\begin{align*} \vert T^\ast _2 \cap P_i \vert \le \vert T^\ast _2 \cap R_{i-1} \vert \lt \frac{1}{\beta } \vert R_{i-1} \vert = \frac{2}{\beta } \frac{n}{2^i} \qquad \qquad \qquad \qquad {\rm (II)} \end{align*}$ and as an immediate consequence $\begin{align*} \vert T^\ast _1 \cap P_i \vert = \vert P_i \vert - \vert T^\ast _2 \cap P_i \vert \gt \vert P_i \vert - \frac{1}{\beta } \vert R_{i-1} \vert = \left(1-\frac{2}{\beta }\right) \frac{n}{2^i}. \qquad \qquad \qquad \qquad {\rm (III)} \end{align*}$ This tells us that mainly the elements of $T^\ast _1$ are removed in the pruning phase and only very few elements of $T^\ast _2$. By definition, we have for all $i \in [m-1]$, $\sigma \in P_i$ and $\tau \in P_{i+1}$ that $\rho (\sigma , c_1) \le \rho (\tau , c_1)$, hence $\begin{equation*} \frac{1}{\vert T^\ast _2 \cap P_i \vert } \text{cost}( T^\ast _2 \cap P_i,c_1) \le \frac{1}{\vert T^\ast _1 \cap P_{i+1} \vert } \text{cost}( T^\ast _1 \cap P_{i+1},c_1) . \end{equation*}$ Combining this inequality with Equations (II) and (III), we obtain for $i \in [m-1]$: $\begin{align*} & \frac{\beta 2^i}{2n} \text{cost}( T^\ast _2 \cap P_i,c_1) \lt \frac{2^{i+1}}{(1-2/\beta)n} \text{cost}( T^\ast _1 \cap P_{i+1},c_1) \\ \iff & \text{cost}( T^\ast _2 \cap P_i,c_1) \lt \frac{2^{i+1} 2n}{(1-2/\beta)n\beta 2^i} \text{cost}( T^\ast _1 \cap P_{i+1},c_1) = \frac{4}{(\beta -2)} \text{cost}( T^\ast _1 \cap P_{i+1},c_1) . \qquad \qquad \qquad \qquad {\rm (IV)} \end{align*}$ We still need such a bound for $i=m$. Since $\vert R_m \vert = \vert P_m \vert$ and also $R_m \subseteq R_{m-1}$, we can use Equation (II) to obtain $\begin{align*} \vert T^\ast _1 \cap R_m \vert = \vert R_m \vert - \vert T^\ast _2 \cap R_m \vert \ge \vert R_m \vert - \vert T^\ast _2 \cap R_{m-1} \vert \gt \left(1-\frac{2}{\beta }\right)\frac{n}{2^m}. \qquad \qquad \qquad \qquad {\rm (V)} \end{align*}$ In addition, we have for all $\sigma \in P_m$ and $\tau \in R_m$ that $\rho (\sigma , c_1) \le \rho (\tau , c_1)$ by definition, thus $\begin{equation*} \frac{1}{\vert T^\ast _2 \cap P_m \vert } \text{cost}( T^\ast _2 \cap P_m,c_1) \le \frac{1}{\vert T^\ast _1 \cap R_{m} \vert } \text{cost}( T^\ast _1 \cap R_m,c_1) . \end{equation*}$ We combine this inequality with Equations (II) and (V) and obtain $\begin{align*} & \frac{\beta 2^m}{2n} \text{cost}( T^\ast _2 \cap P_m,c_1) \lt \frac{2^m2n}{(1-2/\beta)n\beta 2^m} \text{cost}( T^\ast _1 \cap R_m,c_1) \\ \iff & \text{cost}( T^\ast _2 \cap P_m,c_1) \lt \frac{2}{(\beta -2)} \text{cost}( T^\ast _1 \cap R_m,c_1) .\qquad \qquad \qquad \qquad {\rm (VI)} \end{align*}$ We are now ready to bound the cost of the elements of $T^\ast _2$ wrongly assigned to $c_1$. Combining Equations (IV) and (VI) yields $\begin{align*} \text{cost}( T^\ast _2 \cap P,c_1) & ={} \sum _{i=1}^m \text{cost}( T^\ast _2 \cap P_i,c_1) \lt \frac{4}{\beta -2} \sum _{i=1}^{m-1} \text{cost}( T^\ast _1 \cap P_{i+1},c_1) + \frac{2}{\beta -2} \text{cost}( T^\ast _1 \cap R_m,c_1) \\ & \lt \frac{4}{\beta -2} \text{cost}( T^\ast _1,c_1) . \end{align*}$ Here, the last inequality holds, because $P_2, \ldots , P_m$ and $R_m$ are pairwise disjoint. In addition, we have $\begin{align*} \text{cost}( T^\ast _2 \cap T^\prime ,c_2) \le (\alpha +\varepsilon) \text{cost}( T^\ast _2 \cap T^\prime ,\widetilde{c_2}) \le (\alpha +\varepsilon) \text{cost}( T^\ast _2 \cap T^\prime ,c^\ast _2) \le (\alpha +\varepsilon) \text{cost}( T^\ast _2,c^\ast _2) . \end{align*}$ Finally, using Equation (I) and a union bound, with probability at least $1-\delta ,$ the following holds: $\begin{align*} \text{cost}( T,\lbrace c_1,c_2) & \lt (\alpha +\varepsilon) \text{cost}( T^\ast _1,c^\ast _1) + (\alpha +\varepsilon) \text{cost}( T^\ast _2,c^\ast _2) + \frac{4}{\beta -2} (\alpha +\varepsilon) \text{cost}( T^\ast _1,c^\ast _1) \\ & \lt \left(1+\frac{4}{\beta -2}\right) (\alpha + \varepsilon) \text{cost}( T,C^\ast ) = \left(1+\frac{4k}{k\beta -2k}\right) (\alpha + \varepsilon) \text{cost}( T,C^\ast ) \\ & \le {} \left(1+\frac{4k^2}{\beta -2k}\right) (\alpha + \varepsilon) \text{cost}( T,C^\ast ) . \end{align*}$

Case 1.2. For all recursive calls with parameters $(T^\prime ,\lbrace c_1\rbrace ,1,\beta ,\delta ,\varepsilon),$ it holds that $\vert T^\ast _2 \cap T^\prime \vert \lt \frac{1}{\beta } \vert T^\prime \vert$.

After $\log (n)$ pruning phases, we end up with a singleton $\lbrace \sigma \rbrace = T^\prime$ as the input set. Since $\vert T^\ast _2 \cap T^\prime \vert \lt \frac{1}{\beta } \vert T^\prime \vert$, it must be that $0 = \vert T^\ast _2 \cap T^\prime \vert \lt \frac{1}{\beta } \vert T^\prime \vert = \frac{1}{\beta } \lt 1$ and thus $\sigma \in T^\ast _1$.

Let $C_2$ be the set of candidates returned by Median-Candidates in this call. With probability at least $1-\delta /k,$ there is a $c_2 \in C_2$ with $\text{cost}( \lbrace \sigma \rbrace ,c_2) \le (\alpha + \varepsilon) \text{cost}( \lbrace \sigma \rbrace ,\widetilde{c}_2) \le (\alpha + \varepsilon) {\rm cost}({\lbrace \sigma \rbrace },{c^\ast _1})$, where $\widetilde{c}_2$ is an optimal median for $\lbrace \sigma \rbrace$. Since ${\rm cost}({T^\ast _2 \cap P},{c_1})$ is bounded as in Case 1.1, by a union bound we have with probability at least $1- \delta$: $\begin{align*} \text{cost}( T,\lbrace c_1, c_2) & \le {} \text{cost}( T^\ast _1 \setminus \lbrace \sigma \rbrace ,c_1) + \text{cost}( T^\ast _2 \cap P,c_1) + \text{cost}( \lbrace \sigma \rbrace ,c_2) \\ & \le (\alpha + \varepsilon) \text{cost}( T^\ast _1,c^\ast _1) + \text{cost}( T^\ast _2 \cap P,c_1) \\ & \le \left(1+\frac{4}{\beta -2}\right) (\alpha + \varepsilon) \text{cost}( T,C^\ast ) \\ & \le \left(1+\frac{4k^2}{\beta -2k}\right) (\alpha + \varepsilon) \text{cost}( T,C^\ast ) . \end{align*}$

Case 2. $k \gt 2.$

We only prove the generalization of Case 1.1 to $k \gt 2$, and the remainder of the proof is analogous to Case 1. Let $C^\ast = \lbrace c^\ast _1, \ldots , c^\ast _k \rbrace$ be an optimal set of $k$ medians for $T$ with clusters $T^\ast _1, \ldots , T^\ast _k$, respectively, that form a partition of $T$. For the sake of simplicity, assume that $n$ is a power of 2 and w.l.o.g. assume that $\vert T^\ast _1 \vert \ge \dots \ge \vert T^\ast _k \vert$. For $i \in [k]$ and $j \in [k] \setminus [i],$ we define $T^\ast _{i,j} = \uplus _{t=i}^j T^\ast _t$.

Let $\mathcal {T}_0 = T,$ and let $(\mathcal {T}_j = \mathcal {T}_{j-1} \setminus \mathcal {P}_j)_{j=1}^m$ be the sequence of input sets in the recursive calls of the $m \in \mathbb {N}$, $m \le \log (n)$, pruning phases, where $\mathcal {P}_j$ is the set of elements removed in the $j$^th (in the order of the recursive calls occurring) pruning phase. Let $\mathcal {T} = \lbrace \mathcal {T}_0\rbrace \cup \lbrace \mathcal {T}_j \mid j \in [m] \rbrace$. For $i \in [k]$, let $T_i$ be the maximum cardinality set in $\mathcal {T}$, with $\vert T^\ast _i \cap T_i \vert \ge \frac{1}{\beta } \vert T_i \vert$. Note that by assumption and since $\beta \gt 2k$, $T_1 = T$ must hold and also $T_j \subset T_i$ for $j \in [k] \setminus [i]$.

Using a union bound, with probability at least $1-\delta$, for each $i \in [k]$ the call of Median-Candidates with input $T_i$ yields a candidate $c_i$ with $\begin{align*} \text{cost}( T^\ast _i \cap T_i,c_i) \le (\alpha +\varepsilon) \text{cost}( T^\ast _i \cap T_i,\widetilde{c}_i) \le (\alpha +\varepsilon) \text{cost}( T^\ast _i \cap T_i,c^\ast _i) \le (\alpha +\varepsilon) \text{cost}( T^\ast _i,c^\ast _i) , \qquad \qquad \qquad \qquad {\rm (I)} \end{align*}$ where $\widetilde{c}_i$ is an optimal 1-median for $T^\ast _i \cap T_i$. Let $C = \lbrace c_1, \ldots , c_k\rbrace$ be the set of these candidates, and for $i \in [k-1]$, let $P_i = T_{i} \setminus T_{i+1}$ denote the set of elements of $T$ removed by the pruning phases between obtaining $c_{i}$ and $c_{i+1}$. Note that the $P_i$ are pairwise disjoint.

By definition, the sets $\begin{equation*} T^\ast _1 \cap T_1, \ldots , T^\ast _k \cap T_k, T^\ast _{2,k} \cap P_1, \ldots , T^\ast _{k,k} \cap P_{k-1} \end{equation*}$ form a partition of $T$, and therefore $\begin{align*} \text{cost}( T,\lbrace c_1, \ldots , c_k ) & \le {} \sum _{i=1}^k {\rm cost}\left({T^\ast _i \cap T_i},{c_i}\right) + \sum _{i=1}^{k-1} \text{cost}( T^\ast _{i+1,k} \cap P_{i},\lbrace c_1, \ldots , c_{i} ) \\ & \le {} (\alpha + \varepsilon) \sum _{i=1}^k {\rm cost}\left({T^\ast _i},{c^\ast _i}\right) + \sum _{i=1}^{k-1} \text{cost}( T^\ast _{i+1,k} \cap P_{i},\lbrace c_1, \ldots , c_{i} ) . \qquad \qquad \qquad \qquad {\rm (II)} \end{align*}$ Now, it only remains to bound the cost of the wrongly assigned elements of $T^\ast _{i+1,k}$. For $i \in [k]$, let $n_i = \vert T_i \vert ,$ and w.l.o.g. assume that $P_i \ne \emptyset$ for each $i \in [k-1]$. Each $P_i$ is the disjoint union $\uplus _{j=1}^{m_i} P_{i,j}$ of $m_i \in \mathbb {N}$ sets of elements of $T$ removed in the interim pruning phases, and it holds that $\vert P_{i,j} \vert = \frac{n_{i}}{2^j}$. We now prove for each $i \in [k-1]$ and $j \in [m_i]$ that $P_i$ contains many elements from $T^\ast _{1,i}$ and only a few elements from $T^\ast _{i+1,k}$.

For $i \in [k-1]$, we define $R_{i,0} = T_i$, and for $j \in [m_i],$ we define $R_{i,j} = R_{i,j-1} \setminus P_{i,j}$. By definition, $\vert R_{i,j} \vert = \frac{n_i}{2^j} = \vert P_{i,j} \vert$, $R_{i,j_1} \supset R_{i,j_2}$ for each $j_1 \in [m_i]$ and $j_2 \in [m_i] \setminus [j_1]$, also $R_{i,m_i} = T_{i+1}$. Thus, $\vert T^\ast _t \cap R_{i,j} \vert \lt \frac{1}{\beta } \vert R_{i,j} \vert$ for all $i \in [k-1], j \in [m_i],$ and $t \in [k] \setminus [i]$. As an immediate consequence, we obtain $\vert T^\ast _{i+1,k} \cap R_{i,j} \vert \le \frac{k}{\beta } \vert R_{i,j} \vert$. Since $P_{i,j} \subseteq R_{i,j-1}$ for all $i \in [k-1]$ and $j \in [m_i]$, we have $\begin{align*} \vert T_{i+1,k} \cap P_{i,j} \vert \le \vert T_{i+1,k} \cap R_{i,j-1} \vert \le \frac{k}{\beta } \vert R_{i,j-1} \vert = \frac{2k}{\beta } \frac{n_i}{2^j}, \qquad \qquad \qquad \qquad {\rm (III)} \end{align*}$ which immediately yields $\begin{align*} \vert T_{1,i} \cap P_{i,j} \vert = \vert P_{i,j} \vert - \vert T_{i+1,k} \cap P_{i,j} \vert \ge \left(1-\frac{2k}{\beta }\right) \frac{n_i}{2^j}. \qquad \qquad \qquad \qquad {\rm (IV)} \end{align*}$ Now, by definition, we know that for all $i \in [k-1]$, $j \in [m_i] \setminus \lbrace m_i\rbrace$, $\sigma \in P_{i,j}$, and $\tau \in P_{i,j+1}$ that $\min \nolimits _{c \in \lbrace c_1, \ldots , c_{i}\rbrace }\rho (\sigma , c) \le \min \nolimits _{c \in \lbrace c_1, \ldots , c_{i}\rbrace }\rho (\tau , c)$. Thus, $\begin{align*} \frac{\text{cost}( T^\ast _{i+1,k} \cap P_{i,j},\lbrace c_1, \ldots , c_i) }{\vert T^\ast _{i+1,k} \cap P_{i,j} \vert } \le \frac{\text{cost}( T^\ast _{1,i} \cap P_{i,j+1},\lbrace c_1, \ldots , c_i) }{\vert T^\ast _{1,i} \cap P_{i,j+1} \vert }. \end{align*}$ Combining this inequality with Equations (III) and (IV) yields for $i \in [k-1]$ and $j \in [m_i] \setminus \lbrace m_i\rbrace$: $\begin{align*} & \frac{\beta 2^j}{2kn_i} \text{cost}( T^\ast _{i+1,k} \cap P_{i,j},\lbrace c_1, \ldots , c_i) \le \frac{2^{j+1}}{(1-\frac{2k}{\beta })n_i} \text{cost}( T^\ast _{1,i} \cap P_{i,j+1},\lbrace c_1, \ldots , c_i) \\ \iff & \text{cost}( T^\ast _{i+1,k} \cap P_{i,j},\lbrace c_1, \ldots , c_i) \le \frac{4k}{\beta -2k} \text{cost}( T^\ast _{1,i} \cap P_{i,j+1},\lbrace c_1, \ldots , c_i) . \qquad \qquad \qquad \qquad {\rm (V)} \end{align*}$ For each $i \in [k-1],$ we still need an upper bound on ${\rm cost}({T^\ast _{i+1,k} \cap P_{i,m_i}},{\lbrace c_1, \ldots , c_i\rbrace })$. Since $\vert R_{i,m_i} \vert = \vert P_{i,m_i} \vert$ and also $R_{i,m_i} \subseteq R_{i,m_i-1}$, we can use Equation (III) to obtain $\begin{align*} \vert T^\ast _{1,i} \cap R_{i,m_i} \vert = \vert R_{i,m_i} \vert - \vert T^\ast _{i+1,k} \cap R_{i,m_i} \vert \ge \vert R_{i,m_i} \vert - \vert T^\ast _{i+1,k} \cap R_{i,m_i-1} \vert \gt \left(1-\frac{2k}{\beta }\right)\frac{n_i}{2^{m_i}}. \qquad \qquad \qquad \qquad {\rm (VI)} \end{align*}$ By definition, we also know that for all $i \in [k-1]$, $\sigma \in P_{i,m_i}$ and $\tau \in R_{i,m_i}$ that $\min \limits _{c \in \lbrace c_1, \ldots , c_{i}\rbrace }\rho (\sigma , c) \le \min \limits _{c \in \lbrace c_1, \ldots , c_{i}\rbrace }\rho (\tau , c)$. Thus, $\begin{equation*} \frac{\text{cost}( T^\ast _{i+1,k} \cap P_{i,m_i},\lbrace c_1, \ldots , c_i ) }{\vert T^\ast _{i+1,k} \cap P_{i,m_i} \vert } \le \frac{\text{cost}( T^\ast _{1,i} \cap R_{i,m_i},\lbrace c_1, \ldots , c_i ) }{\vert T^\ast _{1,i} \cap R_{i,m_i} \vert }. \end{equation*}$ Combining this inequality with Equations (III) and (VI) yields $\begin{align*} & \frac{\beta 2^{m_i}}{2kn_i} \text{cost}( T^\ast _{i+1,k} \cap P_{i,m_i},\lbrace c_1, \ldots , c_i ) \lt \frac{2^{m_i}}{(1-\frac{2k}{\beta })n_i} \text{cost}( T^\ast _{1,i} \cap R_{i,m_i},\lbrace c_1, \ldots , c_i ) \\ \iff & \text{cost}( T^\ast _{i+1,k} \cap P_{i,m_i},\lbrace c_1, \ldots , c_i ) \lt \frac{2k}{\beta -2k} \text{cost}( T^\ast _{1,i} \cap R_{i,m_i},\lbrace c_1, \ldots , c_i ) . \qquad \qquad \qquad \qquad {\rm (VII)} \end{align*}$ We can now give the following bound, combining Equations (V) and (VII), for each $i \in [k-1]$: $\begin{align*} \text{cost}( T^\ast _{i+1,k} \cap P_i,\lbrace c_1, \ldots , c_i) & ={} \sum _{j=1}^{m_i} \text{cost}( T^\ast _{i+1,k} \cap P_{i,j},\lbrace c_1, \ldots , c_i) \\ &\quad \lt \sum _{j=1}^{m_i-1} \frac{4k}{\beta -2k} \text{cost}( T^\ast _{1,i} \cap P_{i,j+1},\lbrace c_1, \ldots , c_i) \\ &\quad + \frac{2k}{\beta -2k} \text{cost}( T^\ast _{1,i} \cap R_{i,m_i},\lbrace c_1, \ldots , c_i ) \\ &\quad \lt \frac{4k}{\beta -2k} \text{cost}( T^\ast _{1,i} \cap T_i,\lbrace c_1, \ldots , c_i ) . \qquad \qquad \qquad \qquad {\rm (VIII)} \end{align*}$ Here, the last inequality holds, because $P_{i,2}, \ldots , P_{i,m_i}$ and $R_{i,m_i}$ are pairwise disjoint subsets of $T_i$.

Now, we plug this bound into Equation (II). Note that $T^\ast _j \cap T_i \subseteq T^\ast _j \cap T_j$ for each $i \in [k]$ and $j \in [i]$ by definition. We obtain \(\begin{align*} \text{cost}( T,\lbrace c_1, \ldots , c_k ) & \le {} (\alpha + \varepsilon) \sum _{i=1}^k \text{cost}( T^\ast _i,c^\ast _i) + \sum _{i=1}^{k-1} \text{cost}( T^\ast _{i+1,k} \cap P_{i},\lbrace c_1, \ldots , c_{i} ) \\ &\quad \lt (\alpha + \varepsilon) \sum _{i=1}^k \text{cost}( T^\ast _i,c^\ast _i) + \frac{4k}{\beta -2k} \sum _{i=1}^{k-1} \text{cost}( T^\ast _{1,i} \cap T_i,\lbrace c_1, \ldots , c_i ) \\ & \le {} (\alpha + \varepsilon) \sum _{i=1}^k \text{cost}( T^\ast _i,c^\ast _i) + \frac{4k}{\beta -2k} \sum _{i=1}^{k-1} \sum _{t=1}^i \text{cost}( T^\ast _{t} \cap T_i,c_t) \\ & \le {} (\alpha + \varepsilon) \sum _{i=1}^k \text{cost}( T^\ast _i,c^\ast _i) + \frac{4k}{\beta -2k} \sum _{i=1}^{k-1} \sum _{t=1}^i \text{cost}( T^\ast _{t} \cap T_t,c_t) \\ & \le (\alpha + \varepsilon) \sum _{i=1}^k \text{cost}( T^\ast _i,c^\ast _i) + \frac{4k^2}{\beta -2k} \sum _{i=1}^{k-1} \text{cost}( T^\ast _{i} \cap T_i,c_i) \\ & \le {} \left(1+\frac{4k^2}{\beta -2k}\right) (\alpha + \varepsilon) \sum _{i=1}^k \text{cost}( T^\ast _i,c^\ast _i) = \left(1+\frac{4k^2}{\beta -2k}\right) (\alpha + \varepsilon) \text{cost}( T,C^\ast ) . \end{align*}\) The last inequality follows from Equation (I).□

The following analysis of the worst-case running time of Algorithm 4 is a slight adaption of the work of Ackermann et al. [2, Theorem 2.8], which is also provided for the sake of completeness.

Proof of Theorem 7.3

Let $T(n, \kappa , \beta , \delta , \varepsilon)$ denote the worst-case running time of Algorithm 5 for input set $T$ with $\vert T \vert = n$. For the sake of simplicity, we assume that $n$ is a power of 2. Note that we always have $\kappa \le n$.

If $\kappa = 0$, Algorithm 5 has running time $c_1 \in O(1)$. If $n \ge \kappa \ge 1$, Algorithm 5 has running time at most $c_2 \cdot (n \cdot T_\rho + n) \in O(n \cdot T_\rho)$ to obtain $P$, $T(n/2, \kappa , \beta , \delta , \varepsilon)$ for the recursive call in the pruning phase; $T_1(n, \beta , \delta , \varepsilon)$ to obtain the candidates $C(n, \beta , \delta , \varepsilon) \cdot T(n, \kappa - 1, \beta , \delta , \varepsilon)$ for the recursive calls in the candidate phase, one for each candidate; and $c_3 \cdot n \cdot T_\rho \cdot C(n, \beta , \delta , \varepsilon) \in O(n \cdot T_\rho \cdot C(n, \beta , \delta , \varepsilon))$ to eventually evaluate the candidate sets. Let $c = c_1 + c_2 + c_3 + 1$. We obtain the following recurrence relation: $\begin{align*} T(n, \kappa , \beta , \delta , \varepsilon) \le {\left\lbrace \begin{array}{ll} c & \text{if } \kappa = 0 \\ C(n, \beta , \delta , \varepsilon) \cdot T(n, \kappa -1, \beta , \delta , \varepsilon) + T(n/2, \kappa , \beta , \delta , \varepsilon) \\ + T_1(n, \beta , \delta , \varepsilon) + c n \cdot T_\rho \cdot C(n, \beta , \delta , \varepsilon)) & \text{else} \end{array}\right.}. \end{align*}$

Let $f(n, \beta , \delta , \varepsilon) = \frac{1}{cn} \cdot T_1(n, \beta , \delta , \varepsilon) + T_\rho \cdot C(n, \beta , \delta , \varepsilon)$.

We prove that $T(n, \kappa , \beta , \delta , \varepsilon) \le c \cdot 4^{\kappa } \cdot C(n,\beta ,\delta ,\varepsilon)^{\kappa +1} \cdot n \cdot f(n, \beta , \delta , \varepsilon)$, by induction on $n, \kappa$.

For $\kappa = 0,$ we have $T(n,\kappa , \beta , \delta , \varepsilon) \le c \le cn \le c \cdot 4^0 \cdot C(n,\beta ,\delta ,\varepsilon) \cdot n \cdot f(n, \beta , \delta , \varepsilon)$.

Now, let $n \ge \kappa \ge 1,$ and assume the claim holds for $T(n^\prime ,\kappa ^\prime , \beta , \delta , \varepsilon)$, for each $\kappa ^\prime \in \lbrace 0, \ldots , \kappa -1\rbrace$ and $n^\prime \in [n-1]$. We have \(\begin{align*} T(n, \kappa , \beta , \delta , \varepsilon) & \le {} C(n, \beta , \delta , \varepsilon) \cdot T(n, \kappa -1, \beta , \delta , \varepsilon) + T(n/2, \kappa , \beta , \delta , \varepsilon) \\ & \ \ \ + T_1(n, \beta , \delta , \varepsilon) + cn \cdot T_\rho \cdot C(n, \beta , \delta , \varepsilon) \\ & \le {} C(n, \beta , \delta , \varepsilon) \cdot c \cdot 4^{\kappa -1} \cdot C(n,\beta ,\delta ,\varepsilon)^{\kappa } \cdot n \cdot f(n, \beta , \delta , \varepsilon) \\ & \ \ \ + c \cdot 4^{\kappa } \cdot C(n/2,\beta ,\delta ,\varepsilon)^{\kappa +1} \cdot \frac{n}{2} \cdot f(n/2, \beta , \delta , \varepsilon) \\ & \ \ \ + cn \cdot f(n, \beta , \delta , \varepsilon) \\ & \le {} \left(\frac{1}{4} + \frac{1}{2} + \frac{1}{4^\kappa C(n,\beta ,\delta ,\varepsilon)^{\kappa +1}}\right) c \cdot 4^{\kappa } \cdot C(n,\beta ,\delta ,\varepsilon)^{\kappa +1} \cdot n \cdot f(n, \beta , \delta , \varepsilon) \\ & \le {} c \cdot 4^{\kappa } \cdot C(n,\beta ,\delta ,\varepsilon)^{\kappa +1} \cdot n \cdot f(n, \beta , \delta , \varepsilon). \end{align*}\)

The last inequality holds, because $\frac{1}{4^\kappa C(n,\beta ,\delta ,\varepsilon)^{\kappa +1}} \le \frac{1}{4}$, and the claim follows by induction.□

8 CONCLUSION

We have developed bicriteria approximation algorithms for $(k,\ell)$-median clustering of polygonal curves under the Fréchet distance. Although it showed to be relatively easy to obtain a good approximation where the centers have up to $2\ell$ vertices in reasonable time, a way to obtain good approximate centers with up to $\ell$ vertices in reasonable time is not in sight. This is due to the continuous Fréchet distance: the vertices of a median need not be anywhere near a vertex of an input curve, resulting in a huge search space. If we cover the whole search space by, say, grids, the worst-case running time of the resulting algorithms become dependent on the arc lengths of the input curves edges, which is not acceptable. We note that $g$-coverability of the continuous Fréchet distance would imply the existence of sublinear size $\varepsilon$-coresets for $(k,\ell)$-center clustering of polygonal curves under the Fréchet distance. It is an interesting open question, if the $g$-coverability holds for the continuous Fréchet distance. In contrast to the doubling dimension, which was shown to be infinite even for curves of bounded complexity [17], the VC-dimension of metric balls under the continuous Fréchet distance is bounded in terms of the complexities $\ell$ and $m$ of the curves [18]. Whether this bound can be combined with the framework by Feldman and Langberg [19] to achieve faster approximations for the $(k,\ell)$-median problem under the continuous Fréchet distance is an interesting open problem (c.f. [13]). The general relationship between the VC dimension of range spaces derived from metric spaces and their doubling properties is a topic of ongoing research (e.g., see Huang et al. [23]).

REFERENCES

[1] Abraham C., Cornillon P. A., Matzner-Løber E., and Molinari N.. 2003. Unsupervised curve clustering using B-splines. Scandinavian Journal of Statistics 30, 3 (2003), 581–595.Google ScholarCross Ref
Reference
[2] Ackermann Marcel R., Blömer Johannes, and Sohler Christian. 2010. Clustering for metric and nonmetric distance measures. ACM Transactions on Algorithms 6, 4 (2010), Article 59, 26 pages.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
[3] Agarwal Pankaj K., Har-Peled Sariel, Mustafa Nabil H., and Wang Yusu. 2002. Near-linear time approximation algorithms for curve simplification. In Algorithms—, Möhring Rolf and Raman Rajeev (Eds.). Springer, 29–41.Google Scholar
Reference
[4] Alt Helmut and Godau Michael. 1995. Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry & Applications 5 (1995), 75–91.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[5] Banerjee Arindam, Merugu Srujana, Dhillon Inderjit S., and Ghosh Joydeep. 2005. Clustering with Bregman divergences. Journal of Machine Learning Research 6 (2005), 1705–1749.Google ScholarDigital Library
Reference
[6] Bansal Nikhil, Blum Avrim, and Chawla Shuchi. 2004. Correlation clustering. Machine Learning 56, 1–3 (2004), 89–113.Google ScholarDigital Library
Reference
[7] Ben-Hur Asa, Horn David, Siegelmann Hava T., and Vapnik Vladimir. 2001. Support vector clustering. Journal of Machine Learning Research 2 (2001), 125–137.Google ScholarDigital Library
Reference
[8] Brankovic Milutin, Buchin Kevin, Klaren Koen, Nusser André, Popov Aleksandr, and Wong Sampson. 2020. (k, l)-medians clustering of trajectories using continuous dynamic time warping. In SIGSPATIAL’20: 28th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, November 3–6, 2020, Lu Chang-Tien, Wang Fusheng, Trajcevski Goce, Huang Yan, Newsam Shawn D., and Xiong Li (Eds.). ACM, New York, NY, 99–110.Google ScholarDigital Library
Reference
[9] Buchin Kevin, Buchin Maike, and Wenk Carola. 2008. Computing the Fréchet distance between simple polygons. Computational Geometry 41, 1–2 (2008), 2–20.Google ScholarDigital Library
Reference
[10] Buchin Kevin, Driemel Anne, Gudmundsson Joachim, Horton Michael, Kostitsyna Irina, Löffler Maarten, and Struijs Martijn. 2019. Approximating (k, l)-center clustering for curves. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms. 2922–2938.Google ScholarCross Ref
Reference 1Reference 2
[11] Buchin Kevin, Driemel Anne, and Struijs Martijn. 2020. On the hardness of computing an average curve. In Proceedings of the 17th Scandinavian Symposium and Workshops on Algorithm Theory. Article 19, 19 pages.Google Scholar
Reference 1Reference 2
[12] Buchin Kevin, Driemel Anne, L’Isle Natasja van de, and Nusser André. 2019. klcluster: Center-based clustering of trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 496–499.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[13] Buchin Maike and Rohde Dennis. 2022. Coresets for $(k, \ell)$-median clustering under the Fréchet distance. In Algorithms and Discrete Applied Mathematics. Lecture Notes in Computer Science, Vol. 13179. Springer, 167–180.Google Scholar
Reference
[14] Chiou Jeng-Min and Li Pai-Ling. 2007. Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 4 (2007), 679–699.Google ScholarCross Ref
Reference
[15] Cilibrasi Rudi and Vitányi Paul M. B.. 2005. Clustering by compression. IEEE Transactions on Information Theory 51, 4 (2005), 1523–1545.Google ScholarDigital Library
Reference
[16] Driemel Anne and Har-Peled Sariel. 2013. Jaywalking your dog: Computing the Fréchet distance with shortcuts. SIAM Journal on Computing 42, 5 (2013), 1830–1866.Google ScholarDigital Library
Reference
[17] Driemel Anne, Krivosija Amer, and Sohler Christian. 2016. Clustering time series under the Fréchet distance. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms. 766–785.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[18] Driemel Anne, Phillips Jeff M., and Psarros Ioannis. 2019. The VC dimension of metric balls under Fréchet and Hausdorff distances. In Proceedings of the 35th International Symposium on Computational Geometry. Article 28, 16 pages.Google Scholar
Reference
[19] Feldman Dan and Langberg Michael. 2011. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing. ACM, New York, NY, 569–578.Google ScholarDigital Library
Reference
[20] Garcia-Escudero Luis Angel and Gordaliza Alfonso. 2005. A proposal for robust curve clustering. Journal of Classification 22, 2 (2005), 185–201.Google ScholarCross Ref
Reference
[21] Guha Sudipto and Mishra Nina. 2016. Clustering data streams. In Data Stream Management—Processing High-Speed Data Streams, Garofalakis Minos N., Gehrke Johannes, and Rastogi Rajeev (Eds.). Springer, 169–187.Google Scholar
Reference
[22] Har-Peled Sariel and Mazumdar Soham. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing. 291–300.Google ScholarDigital Library
Reference
[23] Huang Lingxiao, Jiang Shaofeng H.-C., Li Jian, and Wu Xuan. 2018. Epsilon-coresets for clustering (with outliers) in doubling metrics. In Proceedings of the 59th IEEE Annual Symposium on Foundations of Computer Science. IEEE, Los Alamitos, CA, 814–825.Google ScholarCross Ref
Reference
[24] Imai Hiroshi and Iri Masao. 1988. Polygonal approximations of a curve–formulations and algorithms. Machine Intelligence and Pattern Recognition 6 (Jan.1988), 71–86.Google ScholarCross Ref
Reference 1Reference 2
[25] Indyk Piotr. 2000. High-Dimensional Computational Geometry. Ph. D. Dissertation. Stanford University, Stanford, CA.Google ScholarDigital Library
Reference
[26] Johnson Stephen C.. 1967. Hierarchical clustering schemes. Psychometrika 32, 3 (1967), 241–254.Google ScholarCross Ref
Reference
[27] Kumar Amit, Sabharwal Yogish, and Sen Sandeep. 2004. A simple linear time (1+$\varepsilon$)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS’04). IEEE, Los Alamitos, CA, 454–462.Google ScholarDigital Library
Reference 1Reference 2
[28] Meintrup Stefan, Munteanu Alexander, and Rohde Dennis. 2019. Random projections and sampling algorithms for clustering of high-dimensional polygonal curves. In Advances in Neural Information Processing Systems 32. 12807–12817.Google Scholar
Reference 1Reference 2
[29] Mitzenmacher Michael and Upfal Eli. 2017. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis (2nd ed.). Cambridge University Press, Cambridge, MA.Google Scholar
Reference
[30] Nath Abhinandan and Taylor Erin. 2020. k-Median clustering under discrete Fréchet and Hausdorff distances. In Proceedings of the 36th International Symposium on Computational Geometry.Article 58, 15 pages.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[31] Petitjean François and Gançarski Pierre. 2012. Summarizing a set of time series by averaging: From Steiner sequence to compact multiple alignment. Theoretical Computer Science 414, 1 (2012), 76–91.Google ScholarDigital Library
Reference
[32] Petitjean François, Ketterlin Alain, and Gançarski Pierre. 2011. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition 44, 3 (2011), 678–693.Google ScholarDigital Library
Reference
[33] Schaeffer Satu Elisa. 2007. Graph clustering. Computer Science Review 1, 1 (2007), 27–64.Google ScholarDigital Library
Reference
[34] Vidal René. 2011. Subspace clustering. IEEE Signal Processing Magazine 28, 2 (2011), 52–68.Google ScholarCross Ref
Reference

Index Terms

Approximating (k,ℓ)-Median Clustering for Polygonal Curves
1. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic algorithms
2. Theory of computation
  1. Design and analysis of algorithms
  2. Randomness, geometry and discrete structures
    1. Computational geometry

Recommendations

Coresets for $(k, ℓ)$ -Median Clustering Under the Fréchet Distance
Algorithms and Discrete Applied Mathematics
Abstract
We present an algorithm for computing $ε$ -coresets for $(k, ℓ)$ -median clustering of polygonal curves in $R^{d}$ under the Fréchet distance. This type of clustering is an adaption of Euclidean k-median clustering: we are given a set of n polygonal curves in ... $^{}$ $^{}$ $^{}$
Read More
Approximating (k, ℓ)-median clustering for polygonal curves
SODA '21: Proceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms

In 2015, Driemel, Krivošija and Sohler introduced the (k, ℓ)-median problem for clustering polygonal curves under the Fréchet distance. Given a set of input curves, the problem asks to find k median curves of at most ℓ vertices each that minimize the ...
Read More
Translating the Discrete Logarithm Problem on Jacobians of Genus 3 Hyperelliptic Curves with $(ℓ, ℓ, ℓ)$ -Isogenies
Abstract
We give an algorithm to compute $(ℓ, ℓ, ℓ)$ -isogenies from the Jacobians of genus three hyperelliptic curves to the Jacobians of non-hyperelliptic curves over a finite field of characteristic different from 2 in time $\tilde{O} (ℓ^{3})$ , where $ℓ$ is an odd prime ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Algorithms Volume 19, Issue 1
January 2023
254 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3582898
Editor:
Edith Cohen
Google Research, USA and Tel Aviv University, Israel
Issue’s Table of Contents
Copyright © 2023 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2023
- Online AM: 31 August 2022
- Accepted: 23 August 2022
- Revised: 22 August 2022
- Received: 26 February 2021
Published in talg Volume 19, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clustering
polygonal curves
approximation algorithms
median
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 597
  Total Downloads
- Downloads (Last 12 months)470
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

(1)	There is a subset \(S^\prime \subseteq S\) of cardinality at least \(\vert S \vert /(2\beta)\) that is a uniform and independent sample of \(T^\prime\);
(2)	there is a curve \(s \in S^\prime\) with \(\text{d}_\text{F}(s, c^\ast) \le (1+\varepsilon)\frac{\text{cost}( T^\prime ,c^\ast ) }{\vert T^\prime \vert }\);
(3)	Algorithm 1 computes a polygonal curve \(c \in \mathbb {R}^d_\ell\) with \({\rm cost}({S^\prime },{c^\ast _{S^\prime }}) \le \text{cost}( S^\prime ,c) \le 34\ {\rm cost}({S^\prime },{c^\ast _{S^\prime }})\), where \(c^\ast _{S^\prime } \in \mathbb {R}^d_\ell\) is an optimal \(\ell\)-median for \(S^\prime\); and
(4)	it holds that \(\frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \le \text{cost}( T^\prime ,c^\ast )\).

(1)	Algorithm 3 samples a curve \(s \in S\) with \(\text{d}_\text{F}(s,c^\ast) \le \text{cost}( T,c^\ast ) /n\),
(2)	Algorithm 3 samples a curve \(t \in S\) with \(\text{cost}( T,t) \le (1+\varepsilon) \text{cost}( T,s) ,\) and
(3)	Algorithm 1 computes a 34-approximate \(\ell\)-median \(\widehat{c} \in \mathbb {R}^d_\ell\) for \(T\) (i.e., \(\text{cost}( T,c^\ast ) \le \Delta = \text{cost}( T,\widehat{c}) \le 34 \text{cost}( T,c^\ast )\)).

(1)	There is a subset \(S^\prime \subseteq S\) of cardinality at least \(\vert S \vert /(2\beta)\) that is a uniform and independent sample of \(T^\prime\);
(2)	for each \(i \in [2\ell -4]\), \(S^\prime\) contains at least one curve from \(T^\prime _i\) with distance to \(c^\ast\) up to \(\frac{4\ell \text{cost}( T^\prime ,c^\ast ) }{\varepsilon \vert T^\prime \vert }\);
(3)	Algorithm 1 computes a polygonal curve \(c \in \mathbb {R}^d_\ell\) with \({\rm cost}({S^\prime },{c^\ast _{S^\prime }}) \le \text{cost}( S^\prime ,c) \le 34 {\rm cost}({S^\prime },{c^\ast _{S^\prime }})\), where \(c^\ast _{S^\prime } \in \mathbb {R}^d_\ell\) is an optimal \(\ell\)-median for \(S^\prime\); and
(4)	it holds that \(\frac{\delta \vert T^\prime \vert }{4 \vert S^\prime \vert } \text{cost}( S^\prime ,c^\ast ) \le \text{cost}( T^\prime ,c^\ast )\).

Approximating (k,ℓ)-Median Clustering for Polygonal Curves

ACM Transactions on Algorithms

Abstract

1 INTRODUCTION

1.1 Related Work

1.2 Our Contributions

1.3 Organization

2 PRELIMINARIES

(Grid).

(Polygonal Curve).

(Fréchet Distance).

(Polygonal Curve Classes).

(Minimum-Error ℓ-Simplification)

(k,ℓ)-Median Clustering)

2.1 Sampling

2.1.1 Superset Sampling.

3 SIMPLE AND FAST 34-APPROXIMATION FOR ℓ-MEDIAN

4 (3+ɛ)-APPROXIMATION FOR ℓ-MEDIAN BY SIMPLE SHORTCUTTING

5 MORE PRACTICAL APPROXIMATION FOR (1,ℓ)-MEDIAN BY SIMPLE SHORTCUTTING

6 (1+ɛ)-APPROXIMATION FOR ℓ-MEDIAN BY ADVANCED SHORTCUTTING

7 (1+ɛ)-APPROXIMATION FOR (k,ℓ)-MEDIAN

(Generalized k-median)

8 CONCLUSION

REFERENCES

Cited By

Index Terms

Recommendations

Coresets for (k,ℓ)-Median Clustering Under the Fréchet Distance

Approximating (k, ℓ)-median clustering for polygonal curves

Translating the Discrete Logarithm Problem on Jacobians of Genus 3 Hyperelliptic Curves with (ℓ,ℓ,ℓ)-Isogenies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media

Coresets for $(k, ℓ)$ -Median Clustering Under the Fréchet Distance

Translating the Discrete Logarithm Problem on Jacobians of Genus 3 Hyperelliptic Curves with $(ℓ, ℓ, ℓ)$ -Isogenies