In search of maximum non-overlapping codes

Stanovnik, Lidija; Moškon, Miha; Mraz, Miha

doi:10.1007/s10623-023-01344-z

In search of maximum non-overlapping codes

Open access
Published: 08 January 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Designs, Codes and Cryptography Aims and scope Submit manuscript

In search of maximum non-overlapping codes

Download PDF

441 Accesses
1 Altmetric
Explore all metrics

Abstract

Non-overlapping codes are block codes that have arisen in diverse contexts of computer science and biology. Applications typically require finding non-overlapping codes with large cardinalities, but the maximum size of non-overlapping codes has been determined only for cases where the codeword length divides the size of the alphabet, and for codes with codewords of length two or three. For all other alphabet sizes and codeword lengths no computationally feasible way to identify non-overlapping codes that attain the maximum size has been found to date. Herein we characterize maximal non-overlapping codes. We formulate the maximum non-overlapping code problem as an integer optimization problem and determine necessary conditions for optimality of a non-overlapping code. Moreover, we solve several instances of the optimization problem to show that the hitherto known constructions do not generate the optimal codes for many alphabet sizes and codeword lengths. We also evaluate the number of distinct maximum non-overlapping codes.

Minimum-Density Identifying Codes in Square Grids

Efficient Construction of Hierarchical Overlap Graphs

Constructions of r-identifying codes and (r, ≤ l)-identifying codes

Article 23 May 2019

1 Introduction

Non-overlapping codes, also known under the terms strongly regular codes [10], cross-bifix-free codes [1, 6], mutually uncorrelated codes (MU-codes) [12, 15], and strong comma-free codes [7], are block codes with the property that no prefix of a codeword occurs as a suffix of any not necessarily distinct codeword. Over the past sixty years, they have provided solutions to various problems in different fields. They construct a special class of finite automata [11], provide codes for frame synchronization [1], and even build addresses for DNA-storage [12, 15]. For most purposes, non-overlapping codes with large cardinalities are favorable and therefore researchers have developed several constructions of such codes [2,3,4,5,6].

Identifying non-overlapping codes with the largest cardinality, however, appears to be a hard nut to crack and little progress has been done in this direction. Chee et al. [6] searched through binary codes of lengths up to sixteen to show that known constructions of non-overlapping codes obtain the maximum size, but for larger alphabet sizes and code lengths, their approach cannot be used due to its double exponential time complexity and exponential space complexity. Later, Blackburn [5] provided formulas for the size of maximum two- and three-letter non-overlapping codes, and for the sizes of maximum q-ary n-letter non-overlapping codes where n divides q. The number of maximal two- and three-letter, and maximum three-letter non-overlapping codes was recently computed [9].

Fimmel et al. [9] also proposed an algorithm that they believe generates all maximal non-overlapping codes. After proving the statement holds, it could theoretically be used to find the maximum q-ary n-letter non-overlapping codes, but the approach is impractical due to its double exponential time and exponential space complexity. Nevertheless, by studying some equivalence relations we show that the time complexity of their algorithm can be reduced to exponential and the space complexity of their algorithm to polynomial in n. We further prove some necessary conditions for optimality. We employ the results to compute optimal values for the size and number of maximum non-overlapping codes that were not yet known, and demonstrate that other existing constructions fail to generate a maximum code for many alphabet sizes and codeword lengths.

This article is arranged as follows: Sect. 2 introduces definitions and briefly presents existing methods to construct non-overlapping codes. In Sect. 3 we prove that the algorithm proposed by Fimmel et al. generates non-overlapping codes only, determine the sizes of the constructed codes, show that the algorithm produces all maximal non-overlapping codes and characterize when a code is maximal. We formulate an integer optimization problem for non-overlapping codes of maximum size and determine necessary conditions for its optimality in Sect. 4. Section 5 provides exact formulas for the number of maximal four-letter non-overlapping codes, and the size and the number of maximum four-letter non-overlapping codes. In Sect. 6 we compare the largest codes produced by existing constructions and the newly computed optimal solutions for small parameter values.

2 Definitions and constructions

Throughout this paper, we let $\Sigma $ be a finite alphabet with $q \ge 2$ elements and n a positive integer.

Definition 1

Words $c_1 = x_1\cdots x_n \in \Sigma ^n$ and $c_2 = y_1\cdots y_n \in \Sigma ^n$ are non-overlapping if

$$\begin{aligned} \forall k \in \{1,\ldots ,n-1\}:\quad&x_{n+1-k}\cdots x_n \ne y_1\cdots y_k \\ \text {and } \quad&y_{n+1-k}\cdots y_n \ne x_1\cdots x_k. \end{aligned}$$

In other words, no prefix of $c_1$ occurs as a suffix of $c_2$ and vice versa.

Example 1

The word VRV is not non-overlapping because V occurs as a prefix and a suffix. The words VRT and KRT are non-overlapping because the prefixes V, VR, K and KR never occur as suffixes.

Definition 2

$X \subseteq \Sigma ^n$ is a non-overlapping code if all (not necessarily distinct) elements $c_1, c_2 \in X$ are non-overlapping.

Definition 3

A non-overlapping code X is maximal, if it cannot be expanded, meaning that every n-letter word over $\Sigma $ that is not in X either has a prefix that overlaps with a suffix in X or a suffix that overlaps with a prefix in X.

Example 2

The non-overlapping code $\{\text {VRT}, \text {KRT}\}$ is not maximal, because it could be expanded by adding the word RRT. The non-overlapping code $\{\text {VRT}, \text {VVT}, \text {RVT}, \text {RRT}\}$ is maximal over the alphabet $\{\text {V}, \text {R}, \text {T}\}$, because no other three-letter word exists that starts with V or R, ends with T, and includes neither prefix VT nor prefix RT.

Definition 4

A non-overlapping code X is maximum if for all non-overlapping codes $Y \subseteq \Sigma ^n$:

$$\begin{aligned} |X |\ge |Y |. \end{aligned}$$

We denote this greatest cardinality with S(q, n) and define N(q, n) to be the number of codes that achieve it.

Remark 1

Every maximum non-overlapping code is maximal.

Non-overlapping codes were first defined by Levenshtein [10] who found that they coincide with a class of codes for which there exists a decoding automaton that correctly decodes a sequence of input letters independently of the choice of the initial state of the automaton [11]. This property ensures that errors in the input word and random transitions between states of the automaton do not affect the decoding of subsequent input words. He proposed Construction 1 (below) that generates non-overlapping codes of not necessarily optimal sizes. He also established a lower bound $S(q,n) \gtrsim \frac{q-1}{qe} \frac{q^n}{n}$ [10] and an upper bound $S(q,n) \le \left( \frac{n-1}{n}\right) ^{n-1}\frac{q^n}{n}$ [11] for the size of a maximum non-overlapping code.

Construction 1

Let $n > 1$ and $q > 1$ be integers and $1 \le k \le n - 1$. Denote by C the set of all codewords $s = (s_1,s_2,\dots ,s_n) \in {\mathbb {Z}}_q^n$ such that $s_1 \ne 0$, $s_{n-k} \ne 0$, $s_{n-k+1} = s_{n-k+2} = \cdots = s_n = 0$, and $(s_{1},\dots ,s_{n-k})$ does not contain k consecutive 0’s. Then C is a non-overlapping code.

His construction was later rediscovered by Chee et al. [6], who reversed the codewords. They showed that the size of a code C obtained using this construction and parameter $ 2 \le k \le n - 2$ equals $|C |= (q-1)^2 F_{k,q}(n-k-2)$ with initialization $F_{k,q} (i) = q^i\; \forall i \in \{0, \dots , k-1\}$ and a $(q-1)$-weighted k-generalized Fibonacci recurrence relation $F_{k,q}(n) = (q-1) \sum _{l=1}^k F_{k,q}(n-l)$. Additionally, they searched for maximum binary non-overlapping codes with $n \le 16$ by determining a maximum clique in a graph with vertices that correspond to words in ${\mathbb {Z}}_2^n$ that are non-overlapping and edges that correspond to pairs of non-overlapping words. Note that this approach cannot be efficiently applied to larger alphabets and code lengths because computing a maximum clique of a graph with m vertices requires $O(2^{\frac{m}{4}})$ time [13] and the graph has $O(q^n)$ vertices.

Another construction that generates binary non-overlapping codes was provided by Bilotta et al. [4]. It uses Dyck sequences as described in Construction 2, and for both even length $2n+2$ and odd length $2n+1$ it generates codes with n-th Catalan number of codewords, $|C_{2n+2} |= |C_{2n+1} |= \frac{1}{n+1}\left( {\begin{array}{c}2n\\ n\end{array}}\right) $.

Construction 2

Let ${\mathcal {D}}$ be a set of Dyck sequences of even length 2n, i.e. binary sequences composed of n zeros and n ones such that no prefix of the sequences has more zeros than ones. The set of sequences of even length $C_{2n+2} = \{1a0: a \in {\mathcal {D}}\}$ and the set of sequences of odd length $C_{2n+1} = \{1a: a \in {\mathcal {D}}\}$ are non-overlapping codes.

This class of codes is not restricted to binary codes as Wang and Wang [14] showed that any binary non-overlapping code can be generalized to q-ary non-overlapping codes using Construction 3 (below). The size of the generalized codes was, however, not further theoretically analysed.

Construction 3

Let S be a binary non-overlapping code and (I, J) a partition of ${\mathbb {Z}}_q$ with $q \ge 2$. For $w = w_1\cdots w_n \in S$ we define a mapping $\Phi (w) = \{y_1\cdots y_n \mid y_i \in I\text { if } w_i = 0\text { and }y_i \in J\text { if } w_i= 1\}$. The set $\Phi (S) = \bigcup _{w \in S} \Phi (w)$ is a non-overlapping code.

Blackburn [5] provided the class of non-overlapping codes described in Construction 4 (below) and proved that it contains a maximum non-overlapping code when n divides q, as a code with $k=n-1$ and $S=I^k$ reaches the upper bound set by Levenshtein. He also proved that $S(q,2) = \lfloor \frac{q}{2} \rfloor \lceil \frac{q}{2} \rceil $ and $S(q,3) = \left[ \frac{2q}{3}\right] ^2 \left( q - \left[ \frac{2q}{3}\right] \right) $, where $\left[ x\right] $ denotes the rounding of x to the closest integer. He conjectured that for $n > 2$ and sufficiently large q, a maximum non-overlapping code is given by Construction 4 with $k=n-1$ and some value of l, but this conjecture is yet to be (dis)proven.

Construction 4

Let k and l be such integers that $1 \le k \le n-1$ and $1 \le l \le q - 1$. Let $\Sigma = I \cup J$ be a partition of a set $\Sigma $ of cardinality q into two parts I and J of cardinalities l and $q-l$ respectively. Let $S \subseteq I^k \subseteq \Sigma ^k$. Let $C_{S,J}(n)$ be the set of all words $c \in \Sigma ^n$ such that $c_1 c_2 \cdots c_k \in S$, $c_{k+1} \in J$, $c_n \in J$ and no subword of the word $c_{k+2}c_{k+3}\cdots c_{n-1}$ lies in S. Then $C_{S,J}(n)$ is a non-overlapping code.

The size of a general code obtained using Blackburn’s construction is still an open problem, but Wang and Wang [14] provided a formula for the case $S = I^k$. It states that for $k \in \{n-1, n-2\}:\;$ $|C_{I^k,J}(n)|= |I |^k |J |^{n-k}$ and for $k \le n - 3:\;$ $|C_{I^k,J}(n) |= q \cdot |C_{I^k,J}(n-1)|- |I |^k |J|\cdot |C_{I^k,J}(n-k-1)|$.

Barcucci et al. [3] defined a set of non-overlapping codes given in Construction 5 (below) that are generated using colored Motzkin words and showed they are all maximal. They did not compute the size of the codes.

Definition 5

A sequence in $ {\mathbb {Z}}_{q+2}^n$ that contains the same number of zeros and ones, such that no prefix of the sequences has more zeros than ones is called a q-colored Motzkin word of length n. We denote the set of all such sequences with ${\mathcal {M}}_q(n)$.

Definition 6

An elevated q-colored Motzkin word of length n is a sequence $1\alpha 0$, where $\alpha \in {\mathcal {M}}_{q}(n-2)$. We denote the set of all such words with $\hat{{\mathcal {M}}}_{q}(n)$.

Construction 5

$CBFS_q(n) = A_q(n) \cup B_q(n) \cup C_q(n)$ is a maximal non-overlapping code where $A_q(n) = \{\alpha \beta : \alpha \in {\mathcal {M}}_{q-2}(i), \beta \in \hat{{\mathcal {M}}}_{q-2}(n-i)\} {\setminus } \{\alpha \beta : \alpha , \beta \in \hat{{\mathcal {M}}}_{q-2}(\frac{n}{2})\}$, $B_q(n) = \{1\alpha \beta : \beta \in {\mathcal {M}}_{q-2}(i), \beta \in \hat{{\mathcal {M}}}_{q-2}(n-i-1)\}$, and $C_q(n) = \{\gamma 0: \gamma \in {\mathcal {M}}_{q-2}(n-1), \gamma \ne u\beta v, \beta \in \hat{{\mathcal {M}}}_{q-2}(j) \}$.

Fimmel, Michel and Strüngmann [7] proposed that three-letter non-overlapping codes emerged naturally. In living cells the ribosome behaves as a decoding automata that decodes a sequence of codons to a sequence of amino acids using the genetic code as a decoding function. Data suggests that ancestral genetic code was indeed a non-overlapping code. Inspired by Blackburn’s work, they recently showed that the set of maximal two-letter non-overlapping codes over $\Sigma $ is exactly the set of partitions of $\Sigma $ into two non-empty parts and counted that there are $2^q - 2$ such codes [9]. Moreover, they characterized maximal three-letter non-overlapping codes (see Proposition 1 below) and counted that there are $\sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right) 2^{m(q-m)}$ such codes. They also determined $N(q,3) = 2\left( {\begin{array}{c}q\\ \left[ \frac{2q}{3}\right] \end{array}}\right) $.

Definition 7

Let L and R be sets of strings.

(i)
$\left( LR\right) $ denotes the concatenation of sets L and R, i.e. the set of all strings of the form lr, where $l \in L$ and $r \in R$,
(ii)
$L^i$ denotes the concatenation of the set L with itself i times,
(iii)
the Kleene star denotes the smallest superset that is closed under concatenation and includes the empty set $L^* = \bigcup _{i \ge 0} L^i$ and
(iv)
the Kleene plus denotes the smallest superset that is closed under concatenation and does not include the empty set $L^+ = \bigcup _{i > 0} L^i$.

Proposition 1

( [9], Theorem 5.3) The set of maximal three-letter non-overlapping codes is exactly the set of all three-letter codes $X_3 = (L_1R_2) \cup (L_2R_1)$, where

(i)
$(L_1, R_1)$ is a partition of $\Sigma $ into two non-empty parts, and
(ii)
$(L_2, R_2)$ is a partition of $(L_1 R_1)$.

3 Constructing maximal non-overlapping codes

Fimmel et al. [9] show that a straightforward generalisation of Proposition 1 does not hold. The underlying construction can, however, be generalized as described in Construction 6 (below). They posit that it is composed of non-overlapping codes only and that it contains all maximal non-overlapping codes.

Construction 6

Let $n \ge 3$. We define ${\mathcal {M}}_{q,n}$ to be the set of all codes $C = \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) \subseteq \Sigma ^n$, where

(i)
$(L_1, R_1)$ is a partition of $\Sigma $ into two non-empty parts, and
(ii)
$(L_i, R_i)$ is a partition of $\bigcup _{j=1}^{i-1} \left( L_j R_{i-j} \right) $ for every $i \in \{2,\dots , n-1\}$.

Before proving both statements, let us observe that the mapping from the partitions of Construction 6 to the set of codes $C \subseteq \Sigma ^n$ is not injective. Two distinct collections of partitions $(L_i,R_i)_{i=1,\dots ,n-1}$ can determine the same code as demonstrated by Example 3.

Example 3

Set $n=4$ and take a partition $(L_1,R_1)$ of $\Sigma $. If $L_2 = (L_1R_1)$, $R_2 =\emptyset $ and $L_3 = \emptyset $, $R_3 = (L_1R_2) \cup (L_2R_1) = (L_1R_1^2)$, we get a code $C_1 = (L_1R_3) \cup (L_2R_2) \cup (L_3R_1) = (L_1^2 R_1^2)$. Now observe $L_2 = \emptyset $, $R_2 =(L_1R_1)$ and $L_3 = (L_1R_2) \cup (L_2R_1) = (L_1^2R_1)$, $R_3 = \emptyset $. We get a code $C_2 = (L_1R_3) \cup (L_2R_2) \cup (L_3R_1) = (L_1^2 R_1^2)$. In particular, $C_1 = C_2$.

Proposition 2

Let $C=\bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) \in {\mathcal {M}}_{q,n}$. Define $X:= \bigcup _{i=1}^{2n+1} X_i$, such that

$$\begin{aligned} X_{2i-1}&:= L_i, \\ X_{2i}&:= R_i, \\ X_{2n+1}&:= C. \end{aligned}$$

No proper prefix in X occurs as a proper suffix in X.

Proof

Assume, for sake of contradiction, that the statement does not hold. Let p be the length of the shortest proper prefix in X that occurs as a proper suffix in X. Let i be the smallest index such that there exists a word $w \in X_i$ with a prefix $x_1\cdots x_p$ that occurs as a suffix in X. Let j be the smallest index such that $X_j$ contains a word $w'$ that ends in $x_1\cdots x_p$. Clearly $i,j > 2$, as $X_1$ and $X_2$ have no proper prefixes nor suffixes. Therefore $w \in (L_k R_{\lceil \frac{i}{2} \rceil - k}) \subseteq X_i$ for some $k < \lceil \frac{i}{2}\rceil $.

If $k < p$, then $x_{k+1}\cdots x_p$ is a prefix in $R_{\lceil \frac{i}{2} \rceil - k} = X_{2{\lceil \frac{i}{2} \rceil } - 2k}$ and suffix in $X_j$ that is shorter than p. Since $\lceil \frac{i}{2} \rceil - k > 1$ (otherwise $x_1\cdots x_p = w$ is not a proper prefix of w), this contradicts the minimality of p. If $p < k$, then $x_1\cdots x_p$ is a prefix in $L_k = X_{2k-1}$. Since for $l > 2$ every one-letter prefix in X is from $L_1$ and every one-letter suffix from $R_1$, $p > 1$ and $k > 2$. Therefore $2k - 1 > 1$ and $2k - 1 < i$. This contradicts minimality of i. So $p = k$ and hence $x_1\cdots x_p \in L_k = L_p$. Repeat the symmetric procedure on $X_j$ to obtain a word $w' \in (L_{\lceil \frac{j}{2} \rceil - l} R_l)$ ending in $x_1\cdots x_p$ and $p=l$. So $x_1\cdots x_p \in R_l = R_p$. Then $x_1\cdots x_p \in L_p \cap R_p$, but $L_p \cap R_p = \emptyset $ by definition. $\square $

Theorem 3

$C \in {\mathcal {M}}_{q,n}$ is a non-overlapping code.

Proof

The theorem follows directly from Proposition 2. $\square $

Corollary 1

(i)
$L_k \cup R_k$ is a non-overlapping code for all $1 \le k \le n - 1$.
(ii)
If $w_1\cdots w_k \in L_k \cup R_k$ for some $k > 1$, then $w_1 \in L_1$ and $w_k \in R_1$.

Proof

The corollary follows directly from Construction 6 and Theorem 3. $\square $

3.1 Size of $C \in {\mathcal {M}}_{q,n}$

Now that we know that every $C \in {\mathcal {M}}_{q,n}$ is non-overlapping, we want to determine the size of C. Theorem 6 (below) shows that it depends on the sizes of the sets partitions $(L_i,R_i)_{i=1,\dots ,n-1}$ only. Before stating and proving the formula, we define a finite sequence of decompositions due to Fimmel et al. [8] and prove some of its properties.

Definition 8

Let $w \in \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ and let us set $L:= \bigcup _{i<n} L_i$ and $R:= \bigcup _{i<n} R_i$. Define $\{p_k(w)\}_{k \ge 0}$ to be a sequence of decompositions of w in $\left( L\left( L\cup R\right) ^* R\right) $ encoded with a binary word ${p_k}$ over the alphabet $\{\text {l},\text {r}\}$ such that

(i)
$p_0(w) \in \left( \text {l}\{\text {l},\text {r}\}^{n-2}\text {r}\right) $ with $p_0(w)_i = \text {l}$ if $w_i \in L_1$ and $p_0(w)_i=\text {r}$ if $w_i \in R_1$,
(ii)
if $p_k(w) \in \left( \text {l}\{\text {l},\text {r}\}^{+}\text {r}\right) $, then $p_{k+1}(w) \in \left( \text {l}\{\text {l},\text {r}\}^* \text {r}\right) $ is obtained by replacing each occurrence of lr in $p_{k}(w)$ by l if lr corresponds to a subword $w'$ of w in $(L_i R_j) \subseteq L_{i+j}$, or by r if it corresponds to a subword $w'$ of w in $(L_i R_j) \subseteq R_{i+j}$ for some positive integers i and j.

Proposition 4

Let $x \in L$ and $y \in R$. If the length of xy is at most $n-1$, then $xy \in L \cup R$.

Proof

Since $x \in L$ there exists some i, $0< i < n$, such that $x \in L_i$. Since $y \in R$ there exists some j, $0< j < n$, such that $y \in R_j$. Therefore $xy \in (L_iR_j) \subseteq \bigcup _{k < i + j} (L_k R_{i+j-k})$. If $i + j < n$, the latter is partitioned into $(L_{i+j}, R_{i+j})$, so either $xy \in L_{i+j} \subseteq L$ or $xy \in R_{i+j} \subseteq R$. $\square $

Proposition 5

The sequence $\{p_k(w)\}_{k \ge 0}$ is

(i)
well-defined,
(ii)
finite and its last element is lr.

Proof

(i)
Every letter in w belongs to $\Sigma = L_1 \cup R_1$. If lr occurs as a proper substring of $p_k(w)$, then by Proposition 4 it corresponds to some k-letters long substring $w'$ of w, such that $w' \in L \cup R$. Therefore lr can be replaced by either l or r. If lr occurs at the beginning of $p_k(w)$, then it is replaced by l. Otherwise $w' \in R_i$ for some $i < n - 1$ and there is a word in $(L_1R_i) \in L \cup R$ that ends in $w'$ which contradicts Proposition 2. A symmetric observation guarantees that a lr at the end of $p_k(w)$ is replaced by r.
(ii)
If $p_k(w) \in (\text {l}\{\text {l},\text {r}\}^+\text {r})$, then the length of $p_{k+1}(w)$ is strictly smaller then the length of $p_k(w)$. If $p_k(w) \not \in (\text {l}\{\text {l},\text {r}\}^+\text {r})$, then $p_{k+1}$ is not defined and $p_k(w)$ is the last element of the sequence. The previous step reveals that then $p_k(w) = \text {lr}$. $\square $

Corollary 2

Let $1< l < n$ and $w \in L_l \cup R_l$.

The sequence $\{p_k(w)\}_{k \ge 0}$ is finite. The last element of the sequence is lr.

Proof

If w belongs to $L_l$ (alternatively to $R_l$), then it also belongs to $L_l \cup R_l$. The latter is a non-overlapping code as noted in Corollary 1, so the statement follows from Proposition 5. $\square $

Theorem 6

$|\bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) |= \sum _{i=1}^{n-1} |L_i||R_{n-i}|$.

Proof

It is sufficient to explain that for every word $w \in \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ there is a unique pair of sets $L_i$ and $R_{n-i}$ such that $w \in \left( L_i R_{n-i} \right) $.

Assume, for sake of contradiction, that there exists a word $w = w_1\cdots w_n \in \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ and indices $i < j$ such that $w \in (L_iR_{n-i})$ and $w \in (L_jR_{n-j})$. Observe the sequence $\{p_k(w_{i+1}\cdots w_j)\}_{k \ge 0}$.

STEP 1: The sequence $\{p_k(w_{i+1}\cdots w_j)\}_{k \ge 0}$ is well-defined, i.e. $\forall k \ge 0:\; p_k \in (\text {l}\{\text {l},\text {r}\}^*\text {r})$.

(i) $p_0 (w_{i+1} \cdots w_j) \in (\text {l}\{\text {l},\text {r}\}^{j-i-2}\text {r})$.

Since $1 \le i < j$ and $w_1 \cdots w_j \in L_j$, Corollary 1 implies $w_{j} \in R_1$. $i+1 \le j < n$ and $w_{i+1} \cdots w_n \in R_{n-i}$, so Corollary 1 implies $w_{i+1} \in L_1$.

(ii) If $p_k(w_{i+1}\cdots w_j) \in (\text {l}\{\text {l},\text {r}\}^+\text {r})$, then $p_{k+1}(w_{i+1}\cdots w_j) \in (\text {l}\{\text {l},\text {r}\}^*\text {r})$.

Suppose lr occurs in a decomposition $p_k(w_{i+1}\cdots w_j) \in (\text {l}\{\text {l},\text {r}\}^+\text {r})$ and corresponds to a substring $w'$. Then the same lr occurs in the part of the decomposition $p_k(w_1 \cdots w_j)$ that corresponds to the substring $w'$ of $w_{i+1}\cdots w_j$. Therefore, $w' \in L \cup R$. If $w'$ is a suffix of $w_{i+1}\cdots w_j$, then lr is also a suffix of $p_k(w_1 \cdots w_j)$ and $w' \in R$. At the same time lr occurs in the part of the decomposition $p_{k}(w_{i+1}\cdots w_n)$ that corresponds to $w'$. If $w'$ is a prefix of $w_{i+1}\cdots w_j$, then lr occurs as a prefix of $p_{k}(w_{i+1}\cdots w_n)$ and $w' \in L$. Therefore $p_{k+1}(w_{i+1}\cdots w_j) \in (\text {l}\{\text {l},\text {r}\}^*\text {r})$.

STEP 2: The sequence $\{p_k(w_{i+1}\cdots w_j)\}_{k \ge 0}$ is finite, the last element $p_{{\hat{k}}}(w_{i+1}\cdots w_j) = \text {lr}$.

We noticed in step 1 that every decomposition $p_k(w_{i+1}\cdots w_j)$ is a substring of $p_k(w_{i+1} \cdots w_n)$. Since $\{p_k(w_{i+1} \cdots w_n)\}_{k \ge 0}$ is finite due to Corollary 2, $\{p_k(w_{i+1}\cdots w_j)\}_{k \ge 0}$ is also finite. From part (ii) of step 1 it follows that the last element $p_{{\hat{k}}}(w_{i+1}\cdots w_j)$ equals lr.

STEP 3: $w_{i+1}\cdots w_j \in L_{j-i} \cap R_{j-i}$.

The process described in part (ii) of Step 1 shows that $p_{{\hat{k}}}(w_{1}\cdots w_j) \in (\text {l}\{\text {l},\text {r}\}^*\text {lr})$ and the last lr corresponds to $w_{i+1}\cdots w_j$, so $w_{i+1}\cdots w_j \in R_{j-i}$. Due to the same argument $w_{i+1}\cdots w_j$ corresponds to the first lr of $p_{{\hat{k}}}(w_{i+1}\cdots w_n) \in (\text {lr}\{\text {l},\text {r}\}^*\text {r})$, so $w_{i+1}\cdots w_j \in L_{j-i}$. $(L_{j-i}, R_{j-i})$ is a partition, so $L_{j-i} \cap R_{j-i} = \emptyset $ and $w_{i+1}\cdots w_j = \epsilon $, but this contradicts the assumption that $i < j$. $\square $

3.2 Characterization of maximal non-overlapping codes

We will now show that ${\mathcal {M}}_{q,n}$ contains all maximal non-overlapping codes and determine which collections of partitions $(L_i, R_i)_{i=1,\dots ,n-1}$ generate maximal non-overlapping codes.

Proposition 7

Every maximal non-overlapping code is contained in ${\mathcal {M}}_{q,n}$.

Proof

Let X be a maximal q-ary n-letter non-overlapping code. Now construct the sets $(L_i,R_i)$ corresponding to X for $i < n$ as follows

$$\begin{aligned} L_1&:= \{x \in \Sigma \mid \text {there exists a word in}\ X\ \text {beginning with}\ x\}, \\ R_1&:= \Sigma \setminus L_1, \end{aligned}$$

and for $1< i < n$

$$\begin{aligned} L_i&:= \{x_1\cdots x_i \in \bigcup _{j< i} (L_j R_{i-j}) \mid \text {there exists a word in }\ X\ \text {beginning with}\ x_1\cdots x_i\}, \\ R_i&:= \bigcup _{j < i} (L_j R_{i-j}) \setminus L_i. \end{aligned}$$

Note that if a word in X ends in $x_1\cdots x_i \in \bigcup _{j < i} (L_j R_{i-j})$, then $x_1\cdots x_i \in R_i$ (otherwise $x_1\cdots x_i \in L_i$ and there exists a word in X that starts in $x_1 \cdots x_i$ by definition of $L_i$ which contradicts the fact that X is non-overlapping). Define $Z:= \bigcup _{j < i} (L_j R_{i-j})$. We will prove that $X \subseteq Z$. Since X is maximal and Z is non-overlapping by Theorem 3, it immediately follows that $X = Z$.

Let $w \in X$. We will show that $\{p_k(w)\}_{k\ge 0}$ is well-defined. Every letter in w belongs to $L_1 \cup R_1 = \Sigma $ since $w\in \Sigma ^n$. The first letter of w belongs to $L_1$ by definition of $L_1$ and we explained earlier that the last letter belongs to $R_1$. Decomposition $p_0(w)$ is therefore well-defined. Now suppose $p_k(w) \in \{\text {l}\{\text {l},\text {r}\}^*\text {r}\}$ for some $k \ge 0$. If $p_k(w) = \text {lr}$, then there exists some i such that $w \in (L_iR_{n-i})$ and $w \in Z$. Otherwise $p_k(w) \in \{\text {l}\{\text {l},\text {r}\}^+\text {r}\}$ and there exists some lr in $p_k(w)$ that corresponds to a proper substring $w'$ of w. The length of $w'$ is at most $n-1$ and $w' \in L\cup R$, so lr can be replaced by either l or r in $p_{k+1}(w)$. If $p_k(w)$ starts with lr that corresponds to a k-letters long substring of w, $w_L$, then by definition of $L_k$ $w_L \in L_k$ and lr is replaced by l in $p_{k+1}(w)$. If $p_k(w)$ ends with lr that corresponds to a k-letters long substring of w $w_R$, then $w_R\not \in L_k$ since X is a non-overlapping code. So $w_R \in R_k$ and lr is replaced by r in $p_{k+1}(w)$. The length of $p_{k+1}(w)$ is strictly smaller than the length of $p_k(w)$ so the sequence $\{p_k(w)\}_{k \ge 0}$ is indeed finite and its last element is lr. As shown earlier this implies that $w \in Z$. $\square $

Theorem 8

$C \in {\mathcal {M}}_{q,n}$ is maximal if and only if all the following statements hold.

(i)
If $L_i$ is non-empty and $R_{n-i} = \emptyset $, then every $x \in L_i$ is a prefix in some $L_j$ such that $i< j < n$ and $R_{n-j}$ is non-empty or $q=2$, $i = \frac{n}{2}$ and $|L_i|= 1$.
(ii)
If $R_i$ is non-empty and $L_{n-i} = \emptyset $, then every $x \in R_i$ is a suffix in some $R_j$ such that $i< j < n$ and $L_{n-j}$ is non-empty or $q=2$, $i = \frac{n}{2}$ and $|R_i|= 1$.
(iii)
If $q=2$ and $L_{\frac{n}{2}} = \{u\}$ such that u is not a prefix in C, then for every non-empty $L_j$, $2 \le j \le \frac{n}{2}-2$, the word in $(L_jL_{\frac{n}{2}})$ is a prefix in C. Moreover, the word in $(L_1L_{\frac{n}{2}})$ is a prefix in C or $L_{\frac{n}{2}-1}$ is empty and for every non-empty $L_j$, $1 \le j \le n/2 - 2$, the word in $(L_jL_1L_{\frac{n}{2}})$ is a prefix in C.
(iv)
If $q=2$ and $R_{\frac{n}{2}} = \{u\}$ such that u is not a suffix in C, then for every non-empty $R_j$, $2 \le j \le \frac{n}{2}-2$, the word in $(R_{\frac{n}{2}}R_j)$ is a suffix in C. Moreover, the word in $(R_{\frac{n}{2}}R_1)$ is a suffix in C or $R_{\frac{n}{2}-1}$ is empty and for every non-empty $R_j$, $1 \le j \le n/2 - 2$, the word in $(R_{\frac{n}{2}}R_1R_j)$ is a suffix in C.

Proof

STEP 1: If C is maximal, then statements (i) to (iv) hold.

Suppose $C = \bigcup _{i<n} (L_i R_{n-i})$ is maximal.

(i) Let $x \in L_i$. If for every j, $i< j < n$, x is not a prefix in $L_j$ or $R_{n-j}$ is empty, then x is not a prefix in $\bigcup _{j = i}^{n-1} (L_j R_{n-j}) \subseteq C$. It is also not a prefix in $\bigcup _{j < i} (L_j R_{n-j})$, otherwise $R_{n-j}$ has a prefix that is a suffix of $x \in L_i$. So x is not a prefix in C. It is also not a suffix in C due to Proposition 2.

CASE 1: $i \ne \frac{n}{2}$

$R_{n-i}$ is empty, therefore $L_{n-i} = \bigcup _{j< n-i} (L_j R_{n-i-j}) {\setminus } R_{n-i} = \bigcup _{j < n-i} (L_j R_{n-i-j})$ is non-empty. Observe the set $(L_{n-i}x)$. If yx is a suffix in $(L_{n-i}x)$, then y is a suffix in $L_{n-i}$, and it is neither a prefix in C nor in $L_{n-i}$ by Proposition 2. Since $n - i \ne \frac{n}{2}$, then x itself is not a prefix in $(L_{n-i}x)$, and we already showed that it is not a prefix in C. If y is a suffix of length k in $(L_{n-i}x)$ that is shorter than x, then it is not a prefix in C by Proposition 2. If $k \le n-i$ then y is not a prefix in $(L_{n-i}x)$ as it cannot be a prefix in $L_{n-i}$ by Proposition 2. If $k > n -i$ then y can be a prefix in $(L_{n-i}x)$ only if it has a suffix that is a prefix in x, but x is itself non-overlapping. Therefore $C \cup (L_{n-i}x)$ is a non-overlapping code. This contradicts the maximality of C.

CASE 2: $i = \frac{n}{2}$

Suppose there exists $y \in L_{\frac{n}{2}} \setminus \{x\}$. Observe the word yx. If z is a suffix of x, then it is not a prefix of y nor of any word in C by Proposition 2. If z is a suffix of yx longer than $\frac{n}{2}$, then z is not a prefix of yx nor of any word in C, otherwise y is not non-overlapping or it has a suffix that is a prefix of a word in C which contradicts Proposition 2. So $C \cup \{yx\}$ is a non-overlapping code, but this contradicts the maximality of C. Now notice that for $j > 2$, $|L_j \cup R_j |= \sum _{k < j} |L_k ||R_{j-k} |\ge |L_1 ||R_{j-1} |+ |L_{j-1} ||R_1 |\ge |L_{j-1}|+ |R_{j-1}|= |L_{j-1} \cup R_{j-1}|$. If $j = 2$, then $|L_j \cup R_j |= |L_1 ||R_1 |$. In both cases $|L_\frac{n}{2} |= 1$ is only possible if $|L_1 \cup R_1 |= 2$.

(ii) The proof is symmetric to the proof of statement (i).

(iii) The case $|L_{\frac{n}{2}}|$ is only possible if either $L_2 = \cdots = L_{\frac{n}{2}-2} = \emptyset $ or $R_2 = \cdots = R_{\frac{n}{2}-2} = \emptyset $. Suppose that for some $k \in \{1,\dots ,\frac{n}{2}-2\}$ the word in $(L_kL_\frac{n}{2})$ is not a prefix in X. Then if $L_{\frac{n}{2}-k}$ is non-empty, $(L_{\frac{n}{2}-k}L_kL_{\frac{n}{2}}) \cup C$ is a non-overlapping code larger than C. If $k > 1$, $L_k$ is non-empty if and only if $L_{\frac{n}{2}-k}$ is non-empty. If $k=1$, $L_{\frac{n}{2} -1} = \emptyset $ and $L_l$ non-empty for some $l < \frac{n}{2}-1$ it follows that $(L_{\frac{n}{2}-l-1}L_lL_1L_{\frac{n}{2}}) \cup C$ is a non-overlapping code larger than C if the word in $(L_lL_1L_{\frac{n}{2}})$ is not a prefix in C.

(iv) The proof is symmetric to the proof of statement (iii).

STEP 2: If statements (i) to (iv) hold, then C is maximal.

Let $C = \bigcup _{i<n} (L_i R_{n-i}) \in {\mathcal {M}}_{q,n}$ that satisfies (i) and (ii). Suppose C is not maximal. Then there exists a word $w \in \Sigma ^n \setminus C$ such that no proper prefix of w is a suffix in C and no proper suffix of w is a prefix in C. Let $\{{\hat{p}}_k(w)\}_{k \ge 0}$ be a sequence of decompositions of $w \in \Sigma ^n \setminus C$ in $(L\cup R)^+$ encoded with a binary word ${\hat{p}}_k$ over the alphabet $\{\text {l},\text {r}\}$ such that

(a)
${\hat{p}}_0(w) \in \{\text {l},\text {r}\}^n$ with ${\hat{p}}_0(w)_i = \text {l}$ if $w_o \in L_1$ and ${\hat{p}}_0(w) = \text {r}$ if $w_i \in R_1$,
(b)
if ${\hat{p}}_k(w)$ contains lr as a proper substring, then obtain ${\hat{p}}_{k+1}(w) \in \{\text {l},\text {r}\}^+$ by replacing each occurrence of lr in ${\hat{p}}_k(w)$ by l if lr corresponds to a subword $w'$ of w in $(L_iR_j) \subseteq L_{i+j}$ or by r if it corresponds to a subword $w'$ of w in $(L_iR_j) \subseteq R_{i+j}$ for some positive integers i and j.

Decomposition ${\hat{p}}_0(w)$ is well-defined as every letter of w is an element of $L_1 \cup R_1 = \Sigma $. If ${\hat{p}}_k(w)$ contains a substring lr that corresponds to a substring $w'$ of w then there exist integers i and j such that $w' \in L_{i+j} \cup R_{i+j}$. The sequence is therefore well-defined. The length of decompositions is strictly decreasing, so the sequence $\{{\hat{p}}_k(w)\}_{k \ge 0}$ is finite. Denote the last element with ${\hat{p}}_{{\hat{k}}}(w)$. The sequences in $\{\text {l},\text {r}\}^+$ that have no lr as a proper substring are of the forms lr, r$^+$l$^+$, l$^+$, r$^+$. Decomposition ${\hat{p}}_{{\hat{k}}}(w)$ cannot be lr, otherwise $w \in C$.

If n is odd or every word in $L_{\frac{n}{2}}$ is a prefix in C and every word in $R_{\frac{n}{2}}$ is a suffix in C, it follows from (i) that for every $x \in L$ there exists a word $w'\in C$ that starts in x and from (ii) that for every $x \in R$ there exists a word $w' \in C$ that ends in x. Therefore $C \cup \{w\}$ is not non-overlapping if ${\hat{p}}_{{\hat{k}}}(w) \in \text {r}^+\text {l}^+ \cup \text {r}^+ \cup \text {l}^+$.

Otherwise $q = 2$ and $|L_\frac{n}{2} \cup R_\frac{n}{2} |= 1$. The latter is only possible if either $L_2=\cdots =L_{\frac{n}{2}-2} = \emptyset $ or $R_2=\cdots =R_{\frac{n}{2}-2} = \emptyset $. Since every word in L of length distinct from $\frac{n}{2}$ is a prefix in C by (i) and every word in R of length distinct from $\frac{n}{2}$ is a suffix in C by (ii), ${\hat{p}}_{{\hat{k}}}(w) \in \text {r}^+\text {l}^+$ implies ${\hat{p}}_{{\hat{k}}}(w) = \text {rl}$ and $w \in (R_{\frac{n}{2}}L_{\frac{n}{2}}) = \emptyset $. Therefore ${\hat{p}}_{{\hat{k}}}(w) \in \text {r}^+ \cup \text {l}^+$ and $w \in (L^+LL_{\frac{n}{2}}) \cup (R_{\frac{n}{2}}RR^+)$ since w itself is non-overlapping. Suppose $R_{\frac{n}{2}}$ is empty (the proof when $L_{\frac{n}{2}}$ is empty is symmetric using statement (iv) instead of (ii)). The set $(R_\frac{n}{2}RR^+)$ is empty and hence $w \in (L^+LL_{\frac{n}{2}})$. If $(L_kL_\frac{n}{2})$ is non-empty for some $k \in \{2,\dots ,\frac{n}{2}-2\}$, then $w \in (L_kL_\frac{n}{2})$ is a prefix in C by (iii). Also the word in $(L_1L_{\frac{n}{2}})$ is a prefix in C or $L_{\frac{n}{2}-1}$ is empty and for every nonempty $L_k$, $k \in \{1,\dots ,\frac{n}{2}-2\}$, the word in $(L_kL_1L_\frac{n}{2})$ is a prefix in C. Hence $w \in (L_1L_{\frac{n}{2}-1}L_\frac{n}{2})$ and $L_{\frac{n}{2}-1}$ is non-empty, but the word in $(L_{\frac{n}{2}-1}L_\frac{n}{2})$ is a prefix in $(L_{\frac{n}{2}-1}R_{\frac{n}{2}+1}) \subseteq C$ since $L_{\frac{n}{2}}$ is not a prefix in C. No word can therefore be added to C, so C is maximal. $\square $

We showed earlier that the mapping from collections $(L_i, R_i)_{i=1,\dots ,n-1}$ to non-overlapping codes is not injective. Nevertheless, we will show that almost every maximal non-overlapping code corresponds to exactly one collection of partitions. Unfortunately, a characterization of maximal non-overlapping codes in terms of partition sizes cannot be given as demonstrated by Example 4, and we will therefore not use it directly when computing S(q, n).

Proposition 9

Let $C = \bigcup _{i< n} (L_iR_{n-i}) = \bigcup _{i < n} ({\hat{L}}_i{\hat{R}}_{n-i}) \in {\mathcal {M}}_{q,n}$ be a maximal non-overlapping code. Then

$$\begin{aligned} \forall i < \frac{n}{2}:\; L_i&= {\hat{L}}_i \text { and } R_i = {\hat{R}}_i.\\ \end{aligned}$$

Moreover, exactly one of the following statements holds.

(i)
$\forall i \ge \frac{n}{2}:\; L_i = {\hat{L}}_i$ and $R_i = {\hat{R}}_i$,
(ii)
$q = 2$, $L_{\frac{n}{2}} = {\hat{R}}_\frac{n}{2}$ and ${\hat{L}}_{\frac{n}{2}} = R_\frac{n}{2} = \emptyset $,
(iii)
$q = 2$, $L_{\frac{n}{2}} = {\hat{R}}_\frac{n}{2} = \emptyset $ and ${\hat{L}}_{\frac{n}{2}} = R_\frac{n}{2}$.

Proof

Suppose there exists some i such that $L_i \ne {\hat{L}}_i$ and $\forall j < i: L_j = {\hat{L}}_j$. $L_j \cup R_j = {\hat{L}}_j \cup {\hat{R}}_j$ for all $j \le i$ by definition, so $R_j = {\hat{R}}_j$ for all $j < i$. Furthermore at least one of the sets $L_i \cap {\hat{R}}_i$ and $R_i \cap {\hat{L}}_i$ is non-empty. Without loss of generality suppose $L_i \cap {\hat{R}}_i \ne \emptyset $ (otherwise $R_i \cap {\hat{L}}_i$ is non-empty and a symmetric argument follows). Let $x \in L_i \cap {\hat{R}}_i$. From Proposition 2 we know that x is neither a prefix nor a suffix in C. If $i \ne \frac{n}{2}$ then Theorem 8 implies that C is not maximal, so $L_j = R_j$ for every j, $1 \le j < n$. If $i = \frac{n}{2}$ then Theorem 8 implies $q = 2$, $|L_\frac{n}{2}|= 1 = |{\hat{R}}_\frac{n}{2} |$ and ${\hat{L}}_\frac{n}{2} = R_\frac{n}{2} = \emptyset $, so $L_\frac{n}{2} = {\hat{R}}_\frac{n}{2}$. $\square $

Corollary 3

If n is odd or $q \ge 3$, then every maximal non-overlapping code corresponds to exactly one partition $(L_i, R_i)_{i=1,\dots ,n-1}$.

Proof

The second and third case of Proposition 9 cannot hold for an odd n nor for $q \ge 3$. $\square $

Example 4

Let $n=6$ and $\Sigma = \{0,1,2\}$. Set $L_1 = \{0\}$, $R_1 = \{1,2\}$, $L_2 = \emptyset $, $R_2 = \{01,02\}$, $L_3 = \{001\}$, $R_3 = \{002\}$, $L_4 = \{0002\}$, $R_4 = \{0011,0012\}$. If we set $L_5 = \{00011,00012\}$ and $R_5 = \{00101,00102,00021,00022\}$, we obtain

$$\begin{aligned} C_1 =\{&000101,000102,000021, 000022,001002,000201,000202,000111,000112, \\&000121,000122 \}, \end{aligned}$$

which is not maximal as $C_1 \cup \{001101\}$ is a non-overlapping code. Now take $L_5 = \{00101,00102\}$ and $R_5 = \{00011,00012,00021,00022\}$. Code

$$\begin{aligned} C_2 = \{&000011,00012,000021,000022,001002,000201,000202,001011,001012, \\&001021,001022\} \end{aligned}$$

has partitions of the same sizes as $C_1$ but is maximal due to Theorem 8.

4 An integer optimization problem

Now that we know that $M_{q,n}$ contains all maximal non-overlapping codes and we have a formula to compute their sizes, we can clearly determine the maximum non-overlapping codes by maximizing the value $\sum _{i=1}^{n-1} |L_i ||R_{n-i} |$ over all constraints given by Construction 6. An integer optimization problem is formulated as follows (see Proposition 10 below). We call this formulation SQN(q, n) throughout the paper.

Proposition 10

(SQN)

$$\begin{aligned} S(q,n) =&\max \sum _{i=1}^{n-1} x_i y_{n-i} \\ \text {subject to }\quad&x_1 + y_1 = q \\&x_i + y_i = \sum _{j=1}^{i-1} x_j y_{i-j} \quad \forall i> 1 \\&x_1 , y_1> 0 \\&x_i, y_i \ge 0 \quad \forall i > 1 \\&x_i, y_i \in {\mathbb {Z}} \quad \forall i \ge 1. \end{aligned}$$

If $(x^*,y^*)$ is an optimal solution to the above optimization problem, then $C = \bigcup _{i=1}^{n-1} \left( L_iR_{n-i}\right) $ is a maximum non-overlapping code if it satisfies

(i)
$\left( L_1,R_1\right) $ is a partition of $\Sigma $ with $|L_1 |= x_1^*$ and
(ii)
for $n> i > 1: \; \left( L_i,R_i\right) $ is a partition of $C = \bigcup _{j=1}^{i-1} \left( L_jR_{i-j}\right) $ with $|L_i |= x_i^*$.

Proof

The proposition follows directly from the definition of S(q, n), Construction 6, Theorem 3 and Theorem 6 if we denote the size of the set $L_i$ by $x_i$ and the size of the set $R_i$ by $y_i$. $\square $

The evaluation of the objective function for all feasible solutions to SQN requires $O\left( q^{n^2}\right) $ time and $\Theta \left( n\right) $ space. If we also want to store all the optimal solutions, space requirement increases to $\Theta \left( n m\right) $ where m denotes the number of optimal solutions. To determine N(q, n) from the optimal solutions of SQN(q, n), we have to evaluate whether any pair of solutions of SQN(q, n) constructs the same maximum code to prevent double counting. Proposition 11 shows that for most parameter values one can compute the value N(q, n) directly.

Proposition 11

$$\begin{aligned}N(q, n) = \sum _{\begin{array}{c} x,y \text { optimal}\\ \text {solution of SQN(q,n)} \end{array}} \left( {\begin{array}{c}q\\ x_1\end{array}}\right) \cdot \prod _{i=2}^{n-1}\left( {\begin{array}{c}\sum _{j=1}^{i-1} x_j y_{i-j} \\ x_i\end{array}}\right) \end{aligned}$$

if at least one of the following holds:

(i)
$q > 2$,
(ii)
n is odd,
(iii)
no pair of solutions $(x,y), ({\hat{x}}, {\hat{y}})$ of SQN(q,n) satisfies $x_i = {\hat{x}}_i$ for all $i < \frac{n}{2}$, $x_\frac{n}{2} = {\hat{y}}_\frac{n}{2} = 0$, ${\hat{x}}_\frac{n}{2} = y_\frac{n}{2} = 1$.

Proof

Let (x, y) be an optimal solution of SQN. There are $\left( {\begin{array}{c}q\\ x_1\end{array}}\right) $ choices for a partition of $\Sigma $ into $\left( L_1,R_1\right) $ such that $|L_1|= x_1$ and $|R_1|= y_1$. For fixed partitions $\left( L_1,R_1\right) , \dots , \left( L_{i-1}, R_{i-1}\right) $ there are $\left( {\begin{array}{c}\sum _{j=1}^{i-1}|L_j||R_{i-j}|\\ x_i\end{array}}\right) = \left( {\begin{array}{c}\sum _{j=1}^{i-1} x_j y_{i-j} \\ x_i\end{array}}\right) $ choices for a partition of $\bigcup _{j=1}^{i-1} \left( L_j R_{i-j}\right) $ with $|L_i |= x_i$ and $|R_i |= y_i$. No code is double counted as the non-overlapping codes corresponding to solutions of SQN(q, n) are distinct due to Proposition 9 and Corollary 3. $\square $

4.1 Reduction of SQN

The set of feasible solutions of SQN can be reduced. We provide some equivalence transformations on (x, y) that preserve the value of the objective function. These enable us to evaluate only one feasible solution for each equivalence set. Moreover, we will show that every maximum $C\in {\mathcal {M}}_{q,n}$ either satisfies the property

$$\begin{aligned} |L_i||R_i|= 0 \qquad \forall i > \frac{n}{2}, \end{aligned}$$

(1)

or it can be mapped to a non-overlapping code that satisfies property (1) using a transformation we will define later.

Proposition 12

(x, y) is an optimal solution of SQN if and only if (y, x) is optimal for SQN.

Proof

Feasible solutions (x, y) and (y, x) have the same value of the objective functions $\sum _{i=1}^{n-1}x_iy_{n-i} = \sum _{i=1}^{n-1}y_ix_{n-i}$. $\square $

Proposition 13

Let $((x_1,\dots , x_{n-1}), (y_1,\dots , y_{n-1}))$ be a feasible solution of SQN. If there exists some $i > 1$ such that $x_k = y_k$ for all $k \le i$, then $((x_1,\dots , x_{n-1}), (y_1,\dots , y_{n-1}))$ is an optimal solution of SQN if and only if $((x_1,\dots ,x_i,y_{i+1},\dots , y_{n-1}), (y_1,\dots ,y_i,x_{i+1},\dots x_{n-1}))$ is an optimal solution.

Proof

Clearly $((x_1,\dots ,x_i,y_{i+1},\dots , y_{n-1}), (y_1,\dots ,y_i,x_{i+1},\dots , x_{n-1}))$ is feasible. Let o denote the objective function of $((x_1,\dots , x_{n-1}), (y_1,\dots , y_{n-1}))$ and ${\hat{o}}$ the objective function of $((x_1,\dots ,x_i,y_{i+1},\dots , y_{n-1}), (y_1,\dots ,y_i,x_{i+1},\dots , x_{n-1}))$.

$$\begin{aligned} {\hat{o}} =&\sum _{\begin{array}{c} 1 \le j \le i \\ n - j> i \end{array}}x_jx_{n-j} + \sum _{\begin{array}{c} 1 \le j \le i \\ n - j \le i \end{array}} x_j y_{n-j} + \sum _{\begin{array}{c} i< j< n \\ n - j> i \end{array}} y_j x_{n-j} + \sum _{\begin{array}{c} i< j< n \\ n - j \le i \end{array}} y_jy_{n-j} \\ =&\sum _{\begin{array}{c} 1 \le j \le i \\ n - j> i \end{array}}y_jx_{n-j} + \sum _{\begin{array}{c} 1 \le j \le i \\ n - j \le i \end{array}} y_j x_{n-j} + \sum _{\begin{array}{c} i< j< n \\ n - j > i \end{array}} y_j x_{n-j} + \sum _{\begin{array}{c} i< j< n \\ n - j \le i \end{array}} y_jx_{n-j} \\ =&\sum _{1 \le j \le i} y_jx_{n-j} + \sum _{i< j < n} y_jx_{n-j} = o. \end{aligned}$$

$\square $

Before providing a formula for expressing the sizes of the sets $ L_{i+1}, R_{i+1}, \dots , L_{n-1}, R_{n-1}$ in terms of $|L_1|, |R_1|, \dots , |L_i|, |R_i|$ when property (1) holds, let us first introduce few more definitions.

Definition 9

Let i be an integer such that $n> i > \frac{n}{2}$ and $|L_{k}||R_{k}|= 0$ for all $k > i$.

(i)
Coefficients $\delta _{R,k}$ and $\delta _{L,k}$ for $k > i$ express which of the sets $R_k$ and $L_k$ is non-empty:
$$\begin{aligned} \delta _{R,k}&:= {\left\{ \begin{array}{ll} 1 &{} \text {if } L_k=\emptyset \\ 0 &{} \text {if } R_k=\emptyset , \end{array}\right. } \\ \delta _{L,k}&:= {\left\{ \begin{array}{ll} 1 &{} \text {if } R_k=\emptyset \\ 0 &{} \text {if } L_k=\emptyset . \end{array}\right. } \end{aligned}$$
(ii)
The coefficients $p_{jk}$ satisfy the following relations:
$$\begin{aligned} p_{jj}&:= 1, \\ p_{jk}&:= \sum _{m=k}^{j-1} \left( \delta _{L,i+m} |R_{j-m}|+ \delta _{R,i+m}|L_{j-m}|\right) p_{mk}. \end{aligned}$$
(iii)
Condition $c_i$ is defined as follows. We will use its value to determine which part of partition $(L_i, R_i)$ is empty.
$$\begin{aligned} c_i := \sum _{j=1}^{n-1-i} \left( \delta _{L,i+j}|R_{n-i-j}|+ \delta _{R,i+j}|L_{n-i-j}|\right) \left( \sum _{l=1}^j p_{jl}(|R_l|- |L_l|)\right) +|R_{n-i}|\\ - |L_{n-i}|. \end{aligned}$$

Proposition 14

Let i be an integer such that $n> i > \frac{n}{2}$ and let $|L_{i+j}||R_{i+j}|= 0$ for all $1 \le j < n - i$. Then

$$\begin{aligned} |L_{i+j}|= & {} \delta _{L, i+j} \sum _{l=1}^j\sum _{k=l}^i p_{jl} |L_k||R_{i+l-k}|\\ |R_{i+j}|= & {} \delta _{R, i+j} \sum _{l=1}^j\sum _{k=l}^i p_{jl} |L_k||R_{i+l-k}|. \end{aligned}$$

Proof

We will prove the proposition by induction on j. Let $j=1$. Because $|L_{i+1} ||R_{i+1}|= 0,$

$$\begin{aligned} |L_{i+1}|&= \delta _{L,i+1} \sum _{k=1}^{i} |L_{k}||R_{i+1-k}|\\&=\delta _{L,i+1} \sum _{k=1}^{i} p_{11} |L_{k}||R_{i+1-k}|\\&= \delta _{L,i+1} \sum _{l=1}^1 \sum _{k=1}^{i} p_{1l} |L_{k}||R_{i+1-k}|, \end{aligned}$$

and following the same procedure $|R_{i+1}|= \delta _{R,i+1} \sum _{l=1}^1 \sum _{k=1}^{n} p_{1\,l} |L_{k}||R_{i+1-k}|$. Now let us suppose that the statement holds for all positive integers smaller than j. The set $\bigcup _{k=1}^{i+j-1} \left( L_k R_{i+j-k}\right) $ can be expressed as a union of those words that end in $R_{i+k}$, words that start in $L_{i+k}$, and words that start and end in substrings shorter or equal to i. Hence

$$\begin{aligned} \bigcup _{k=1}^{i+j-1} \left( L_k R_{i+j-k}\right) =&\bigcup _{k=i+1}^{i+j-1} \left( L_{i+j-k} R_k\right) \cup \bigcup _{k=i+1}^{i+j-1} \left( L_k R_{i+j-k}\right) \\&\cup \bigcup _{k=j}^i \left( L_{k}R_{i+j-k}\right) . \end{aligned}$$

Since $i > \frac{n}{2}$, these three sets never overlap, so the size of their union is equal to the sum of their sizes. The size of $\bigcup _{k=j}^i \left( L_{k}R_{i+j-k}\right) $ can be expressed using the coefficients $p_{jj}$ as follows.

$$\begin{aligned} |\bigcup _{k=j}^i \left( L_{k}R_{i+j-k}\right) |= \sum _{k=j}^i |L_{k}||R_{i+j-k}|= \sum _{k=j}^i p_{jj} |L_{k}||R_{i+j-k}|. \end{aligned}$$

To compute the size of the sets $\bigcup _{k=i+1}^{i+j-1} \left( R_k L_{i+j-k}\right) \cup \bigcup _{k=i+1}^{i+j-1} \left( L_k R_{i+j-k}\right) $, we apply the induction hypothesis, then change the order of summation, introduce ${\bar{k}}:= k - i$, and recognise the formula for $p_{jl}$.

$$\begin{aligned}&\sum _{k=i+1}^{i+j-1} |R_k||L_{i+j-k}|+ \sum _{k=i+1}^{i+j-1} |L_k||R_{i+j-k}|\\&\quad = \sum _{k=i+1}^{i+j-1} \sum _{l=1}^{k-i} \sum _{m=l}^i p_{k-i,l} |L_m||R_{i+l-m}|\left( |L_{i+j-k}|\delta _{R,k} + |R_{i+j-k}|\delta _{L,k}\right) \\&\quad = \sum _{l=1}^{j-1}\sum _{m=l}^{i}\sum _{k=i+l}^{i+j-1}p_{k-i,l} |L_m||R_{i+l-m}|\left( |L_{i+j-k}|\delta _{R,k} + |R_{i+j-k}|\delta _{L,k}\right) \\&\quad = \sum _{l=1}^{j-1}\sum _{m=l}^{i} \left( \sum _{{\bar{k}}=l}^{j-1}p_{{\bar{k}},l} \left( |L_{j-{\bar{k}}}|\delta _{R,i+{\bar{k}}} + |R_{j-{\bar{k}}}|\delta _{L,i+{\bar{k}}}\right) \right) |L_m||R_{i+l-m}|\\&\quad = \sum _{l=1}^{j-1}\sum _{m=l}^{i} p_{jl} |L_m||R_{i+l-m}|. \end{aligned}$$

We now combine the results to obtain

$$\begin{aligned} |L_{i+j}|&= \delta _{L,i+j} \sum _{k=1}^{i+j-1}|L_{k}||R_{i+j-k}|\\&= \delta _{L,i+j} \left( \sum _{k=j}^i p_{jj} |L_k||R_{i+j-k}|+ \sum _{l=1}^{j-1}\sum _{m=l}^i p_{jl} |L_m||R_{i+l-m}|\right) \\&= \delta _{L,i+j} \sum _{l=1}^{j}\sum _{m=l}^i p_{jl} |L_m||R_{i+l-m}|, \end{aligned}$$

and equivalently

$$\begin{aligned} |R_{i+j}|= \delta _{R,i+j} \sum _{l=1}^{j}\sum _{m=l}^i p_{jl} |L_m||R_{i+l-m}|. \end{aligned}$$

$\square $

Theorem 15

There exists a maximum non-overlapping code $C = \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ satisfying the property $\forall i > \frac{n}{2}: \; \vert L_i||R_i|= 0 $.

Moreover, when $i > \frac{n}{2}$ the condition $c_i$ from Definition 9 satisfies:

(i)
$c_i \ge 0$ if $R_i = \emptyset $ and
(ii)
$c_i \le 0$ if $L_i = \emptyset $.

Proof

Let $C = \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ be a maximum non-overlapping code satisfying the property that there exists an integer j, $\frac{n}{2} < j \le n - 1$, such that $|L_{j}||R_{j}|\ne 0$. Let us denote the largest such integer by m. Now rearrange the formula $|C|= \sum _{j=1}^{n-1} |L_j ||R_{n-j} |$ in order to apply Proposition 14 to it.

$$\begin{aligned} |C|=&\sum _{j=m+1}^{n-1} \left( |L_j||R_{n-j}|+ |R_j||L_{n-j}|\right) + \sum _{j=n-m}^m |L_j||R_{n-j}|\\ =&\sum _{j=1}^{n-1-m} \left( |L_{m+j}||R_{n-m-j}|+ |R_{m+j}||L_{n-m-j}|\right) + \sum _{j=n-m}^m |L_j||R_{n-j}|\\ =&\sum _{j=1}^{n-1-m} \left( \delta _{L,m+j}|R_{n-m-j}|+ \delta _{R,m+j}|L_{n-m-j}|\right) \left( \sum _{l=1}^j \sum _{k=l}^m p_{jl} |L_k||R_{m+l-k}|\right) \\&+ \sum _{j=n-m}^m |L_j||R_{n-j}|. \end{aligned}$$

Define ${\bar{C}}:= \bigcup _{i=1}^{n-1} \left( {\bar{L}}_i {\bar{R}}_{n-i} \right) $ such that

$$\begin{aligned} \forall i < m: {\bar{L}}_i:= L_i \text { and } {\bar{R}}_i:= R_i,\\{\bar{L}}_{m}:= {\left\{ \begin{array}{ll} \emptyset &{} \text {if } c_m \le 0 \\ \bigcup _{i=1}^{m-1} \left( {\bar{L}}_i {\bar{R}}_{m-i}\right) &{} \text {if } c_m> 0, \\ \end{array}\right. } \\{\bar{R}}_{m}:= {\left\{ \begin{array}{ll} \bigcup _{i=1}^{m-1} \left( {\bar{L}}_i {\bar{R}}_{m-i}\right) &{} \text {if } c_m \le 0 \\ \emptyset &{} \text {if } c_m > 0,\\ \end{array}\right. } \end{aligned}$$

and $\forall j \in \{1, \dots , n-1-m\}$:

$$\begin{aligned}{\bar{L}}_{m+j} = {\left\{ \begin{array}{ll} \emptyset &{} \text {if } L_{m+j} = \emptyset \\ \bigcup _{i=0}^{m+j-1} ({\bar{L}}_i{\bar{R}}_{m+j-i}) &{} \text {if } R_{m+j} = \emptyset , \end{array}\right. }\\{\bar{R}}_{m+j} = {\left\{ \begin{array}{ll} \bigcup _{i=0}^{m+j-1} ({\bar{L}}_i{\bar{R}}_{m+j-i}) &{} \text {if } L_{m+j} = \emptyset \\ \emptyset &{} \text {if } R_{m+j} = \emptyset . \end{array}\right. }\end{aligned}$$

${\bar{C}}$ is non-overlapping due to Theorem 3. Now we observe the size of the code ${\bar{C}}$. From Proposition 14 we get

$$\begin{aligned} |{\bar{L}}_{m+j}|:= \delta _{L, m+j} \sum _{l=1}^j\sum _{k=l}^m p_{jl} |{\bar{L}}_k||{\bar{R}}_{m+l-k}|, \\|{\bar{R}}_{m+j}|:= \delta _{R, m+j} \sum _{l=1}^j\sum _{k=l}^m p_{jl} |{\bar{L}}_k||{\bar{R}}_{m+l-k}|, \end{aligned}$$

and from Theorem 6

$$\begin{aligned} |{\bar{C}}|=&\sum _{j=1}^{n-1} |{\bar{L}}_j||{\bar{R}}_{n-j}|=\sum _{j=m+1}^{n-1} \left( |{\bar{L}}_j||{\bar{R}}_{n-j}|+ |{\bar{R}}_j||{\bar{L}}_{n-j}|\right) + \sum _{j=n-m}^m |{\bar{L}}_j||{\bar{R}}_{n-j}|\\ =&\sum _{j=1}^{n-1-m} \left( |{\bar{L}}_{m+j}||{\bar{R}}_{n-m-j}|+ |{\bar{R}}_{m+j}||{\bar{L}}_{n-m-j}|\right) + \sum _{j=n-m}^m |{\bar{L}}_j||{\bar{R}}_{n-j}|\\ =&\sum _{j=1}^{n-1-m} \left( \delta _{L,m+j}|{\bar{R}}_{n-m-j}|+ \delta _{R,m+j}|{\bar{L}}_{n-m-j}|\right) \left( \sum _{l=1}^j \sum _{k=l}^m p_{jl} |{\bar{L}}_k||{\bar{R}}_{m+l-k}|\right) \\&+ \sum _{j=n-m}^m |{\bar{L}}_j||{\bar{R}}_{n-j}|. \end{aligned}$$

We notice that in the above sum only ${\bar{L}}_i$ and ${\bar{R}}_i$ with $1 \le i \le m$ occur. We know that unless $i=m$, it holds that ${\bar{L}}_i = L_i$ and ${\bar{R}}_i = R_i$. Therefore

$$\begin{aligned} |{\bar{C}}|=&\sum _{j=1}^{n-1-m} \delta _{L,m+j}|R_{n-m-j}|\left( \sum _{l=1}^j p_{jl} \left( |{\bar{L}}_m|- |L_m|\right) |R_{l}|+ \sum _{l=1}^j p_{jl} |L_{l}|\left( |{\bar{R}}_m|- |R_m|\right) \right) \\&\quad + \sum _{j=1}^{n-1-m} \delta _{R,m+j}|L_{n-m-j}|\left( \sum _{l=1}^j p_{jl} \left( |{\bar{L}}_m|- |L_m|\right) |R_{l}|+ \sum _{l=1}^j p_{jl} |L_{l}|\left( |{\bar{R}}_m|- |R_m|\right) \right) \\&\quad + \left( |{\bar{L}}_m|- |L_m|\right) |R_{n-m}|+ |L_{n-m}|\left( |{\bar{R}}_m|- |R_m|\right) + |C|. \end{aligned}$$

We know that

$$\begin{aligned} |R_m |= \sum _{i=1}^{m-1} |L_i||R_{m-1-i}|- |L_m|,\\|{\bar{R}}_m |= \sum _{i=1}^{m-1} |L_i||R_{m-1-i}|- |{\bar{L}}_m|, \end{aligned}$$

and therefore

Recall that we set

$$\begin{aligned} |{\bar{L}}_m |= 0< |L_m |\quad \text { if } \; c_m < 0, \end{aligned}$$

and

$$\begin{aligned} |{\bar{L}}_m |= \sum _{i=1}^{m-1} |L_i ||R_{m-1-i} |> |L_m |\quad \text {if } \; c_m > 0, \end{aligned}$$

so $|{\bar{C}}|> |C|,$ as soon as $c_m \ne 0$ which contradicts the fact that C is maximum. In the other case, when $c_m = 0$, we get $|{\bar{C}}|= |C|$. Notice that in this case both $|{\bar{C}}|$ and $|C|$ are expressed as the same function in $|L_1|,|R_1|, \dots , |L_{m-1}|, |R_{m-1}|$, and indeed $|{\bar{L}}_j||{\bar{R}}_j|= 0$ for all $\frac{n}{2}<j\le m-1$ if and only if $|L_j||R_j|= 0$ for all $\frac{n}{2}<j\le m-1$. The theorem therefore holds. $\square $

The proof revealed that it is easy to determine all maximum non-overlapping codes if we know all solutions that satisfy Property (1). The algorithm to generate them is summarized in the following corollary.

Corollary 4

Let $C = \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ be a maximum non-overlapping code with $c_m = 0$ for some $m > \frac{n}{2}$. Define ${\hat{L}}_i:= L_i$ for $i < m$, $({\hat{L}}_m, {\hat{R}}_m)$ a partition of $\bigcup _{i <m} (L_i R_{m-i})$, and for $i > m$

$$\begin{aligned} {\hat{L}}_i&:= {\left\{ \begin{array}{ll} \bigcup _{j<i } ({\hat{L}}_j {\hat{R}}_{i-j}) &{} \text {if } c_i \ge 0 \\ \emptyset &{} \text {if } c_i< 0, \end{array}\right. } \\ {\hat{R}}_i&:= {\left\{ \begin{array}{ll} \emptyset &{} \text {if } c_i \ge 0 \\ \bigcup _{j<i } ({\hat{L}}_j {\hat{R}}_{i-j}) &{} \text {if } c_i < 0. \end{array}\right. } \\ \end{aligned}$$

$C = \bigcup _{i=1}^{n-1} \left( {\hat{L}}_i {\hat{R}}_{n-i} \right) $ is a maximum non-overlapping code.

5 Exact formulas for four-letter codes

Theorem 16

Let $q \ge 3$. The number of distinct maximal four-letter q-ary non-overlapping codes equals

$$\begin{aligned} \sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right) \left( 2(2^{q-m} - 1)^{m(q-m)} + (2^m + 2^{q-m})^{m(q-m)} - 2^{m^2(q-m)} - 2^{m(q-m)^2} \right) . \end{aligned}$$

Proof

By Corollary 3 it is sufficient to count the number of distinct partitions $(L_1, R_1), (L_2, R_2), (L_3, R_3)$ that satisfy the requirements of Theorem 8. Sets $L_1$ and $R_1$ cannot be empty by definition. If $L_2$ is empty, then for every $y\in R_2$ there exists $x \in L_1$ such that $xy \in R_3$. $R_3$ is therefore non-empty, but $L_3$ can be empty as every $y \in R_1$ occurs as a suffix in $R_2 = (L_1R_1)$. Therefore there are

$$\begin{aligned} \sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right) \prod _{j=1}^{m(q-m)} \left( \sum _{i=1}^{q-m} \left( {\begin{array}{c}q-m\\ i\end{array}}\right) \right)&= \sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right) \prod _{j=1}^{m(q-m)} \left( 2^{q-m} - 1\right) \\&=\sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right) \left( 2^{q-m} - 1\right) ^{m(q-m)} \end{aligned}$$

maximal non-overlapping codes with empty $L_2$. There are also as many maximal codes when $R_2$ is empty due to a symmetric observation.

Suppose $L_2$ and $R_2$ are both non-empty. If $L_3$ and $R_3$ are non-empty, then the code is clearly maximal. If $L_3$ is empty, then every $x \in R_1$ is a suffix in $(L_2R_1) \subseteq R_3$, so the code is maximal. If $R_3$ is empty, then every $x \in L_1$ is a prefix in $(L_2R_1) \subseteq L_3$, so the code is maximal. Therefore there are

$$\begin{aligned} \sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right)&\sum _{i=1}^{q(m-q)-1} \left( {\begin{array}{c}m(q-m)\\ i\end{array}}\right) \sum _{j=0}^{m^2(q-m) + i(q-2m)} \left( {\begin{array}{c}m^2(q-m) + i(q-2m)\\ j\end{array}}\right) = \\&= \sum _{m=1}^{q-1} \left( {\begin{array}{c}q\\ m\end{array}}\right) \sum _{i=1}^{q(m-q)-1} \left( {\begin{array}{c}m(q-m)\\ i\end{array}}\right) 2^{m^2(q-m) + i(q-2m)} \\&= \sum _{m=1}^{q-1} 2^{m^2(q-m)}\left( {\begin{array}{c}q\\ m\end{array}}\right) \sum _{i=1}^{q(m-q)-1} \left( {\begin{array}{c}m(q-m)\\ i\end{array}}\right) 2^{(q-2m)i} \\&= \sum _{m=1}^{q-1} 2^{m^2(q-m)}\left( {\begin{array}{c}q\\ m\end{array}}\right) \left( (1 + 2^{q-2m})^{m(q-m)} - 1 - 2^{m(q-m)(q-2m)}\right) \end{aligned}$$

maximal non-overlapping codes with $|L_2||R_2|\ne 0$. $\square $

Blackburn [5] computed S(q, 2) and S(q, 3) by proving that Construction 4 with $k = 1$ and $k = 2$ respectively gives us codes of maximum sizes. Using the results from the previous section, we further prove that this construction with $k = 3$ always yields the optimum for four-letter codes. Before showing the result, we will provide Lemma 17 and Proposition 18 regarding the sizes of codes generated by Construction 4.

Lemma 17

Let $q, n, m \le \frac{n}{2}$ and $2 \le j \le n$ be positive integers, and f the following function

$$\begin{aligned} f(j)= (n-m-1) \left( \frac{(n-1)q - m}{n}\right) ^{n-j+1}\prod _{i=2}^{j-2}\frac{n-i}{i} \\+ \frac{n-q-m}{n}\sum _{i=0}^{n-j}\left( {\begin{array}{c}n-1\\ i\end{array}}\right) \left( \frac{(n-1)q-m}{n}\right) ^i. \end{aligned}$$

Then $f(j) \ge f(j+1)$.

Proof

The coefficient in front of $\left( \frac{(n-1)q-m}{n}\right) ^{n-j}$ in f(j) is $\left( \frac{(n-m-1)((n-1)q - m)}{n} + \frac{(n-q-m)(n-1)(n-j)}{nj}\right) \prod _{i=2}^{j-2}\frac{n-i}{i}$. Because $m \le \frac{n}{2}$, it holds that $n - m -1 \ge \frac{n - 2}{2} \ge \frac{n - j}{j}$, so the coefficient in front of $\left( \frac{(n-1)q-m}{n}\right) ^{n-j}$ in f(j) is greater or equal to $\left( \frac{(n-1)q - m}{n} + \frac{(n-q-m)(n-1)}{n}\right) \prod _{i=2}^{j-1}\frac{n-i}{i}$. We now compute the sum

$$\begin{aligned} \frac{(n-1)q - m}{n} + \frac{(n-q-m)(n-1)}{n} = n - m - 1, \end{aligned}$$

and see that the statement follows. $\square $

Proposition 18

Let C be the largest q-ary non-overlapping code of length n obtained by Construction 4 with $k = n - 1$. Then the value of the parameter l equals:

Remark 2

Note that we skip the case $\frac{n}{2}< (n - 1)q \mod n < n - 1 $, as we do not need it throughout our paper. Although one would wish that in that case , this does not always hold. If we take for example $n=11$ and $q=16$, then the code with parameter $l=14$ contains 576 650 390 625 codewords, but the code with parameter $l=15$ contains 578 509 309 952 codewords.

Proof

The largest q-ary non-overlapping code of length n obtained by Construction 4 with $k = n - 1$ has size $\max _{l \in \{1,\dots , q-1\}} f(l),$ where $f(l) = (q - l) l^{n-1}$. First we observe the first derivative of f, $f'(l) = l^{n-2}((n-1)q - nl)$ and notice that f(l) is strictly increasing on the interval $\left[ 1, \frac{(n-1)q}{n}\right) $ and strictly decreasing on the interval $\left( \frac{(n-1)q}{n}, q-1\right] $. If n divides q, the maximum $\max _{l \in \{1,\dots , q-1\}} f(l)$ is achieved at $l=\frac{(n-1)q}{n}$.

Now, let us suppose q is not a multiplier of n. The maximum $\max _{l \in \{1,\dots , q-1\}} f(l)$ is achieved either at $l = \left\lfloor {\frac{(n-1)q}{n}}\right\rfloor $ or . We denote the values of f in these two points with $F = f\left( \left\lfloor {\frac{(n-1)q}{n}}\right\rfloor \right) $ and . Using the binomial theorem we obtain

$$\begin{aligned} F - C = \sum _{i=0}^{n-1}\left( {\begin{array}{c}n-1\\ i\end{array}}\right) \left\lfloor {\frac{(n-1)q}{n}}\right\rfloor ^i - \left( q- \left\lfloor {\frac{(n-1)q}{n}}\right\rfloor \right) \sum _{i=0}^{n-2}\left( {\begin{array}{c}n-1\\ i\end{array}}\right) \left\lfloor {\frac{(n-1)q}{n}}\right\rfloor ^i. \end{aligned}$$

Now let us introduce m, such that $(n-1)q \equiv m \pmod {n}$, to rewrite the difference

$$\begin{aligned} F - C =&\left( \frac{(n-1)q - m}{n}\right) ^{n-1} + \frac{n - q- m}{n} \sum _{i=0}^{n-2}\left( {\begin{array}{c}n-1\\ i\end{array}}\right) \left( \frac{(n-1)q - m}{n}\right) ^i \\ =&(n-k-1)\left( \frac{(n-1)q - m}{n}\right) ^{n-2} +\frac{n - q -m}{n} \sum _{i=0}^{n-3} \left( {\begin{array}{c}n-1\\ i\end{array}}\right) \left( \frac{(n-1)q - m}{n}\right) ^i. \end{aligned}$$

First let us explain that the coefficient $\frac{n - q- m}{n}$ is non-positive. Suppose for contradiction that $q < n - m$. Then $\frac{(n-1)q}{n} < n - m + 1 + \frac{m}{n}$. Since $(n-1)q \equiv m \pmod {n}$, it follows $\frac{(n-1)q}{n} \le n - m + 1 + \frac{m}{n} - n = \frac{m}{n} - m - 1 < 0$ but $\frac{(n-1)q}{n} < 0$ cannot hold for $n > 1$ and $q > 1$. We also see that $\frac{n - q- m}{n} = 0$ if and only if . In this case, $F - C > 0$ and therefore $\max _{i \in \{1,\dots , q-1\}} f(i) = F$. We immediately see that for $m = n-1$, $F < C$ since $q > 1$ always holds. To get a lower bound on $F - C$ when $m \le \frac{n}{2}$ we apply lemma 17 for $j=3, \dots , n$ consecutively. We obtain $F - C \ge (n - m -1) \prod _{i=2}^{n-1}\frac{n-i}{i} > 0$. $\square $

Now we can compute the sizes of maximum four-letter non-overlapping codes. The proof of Theorem 19 reveals that they can indeed be generated using Construction 4.

Theorem 19

$$\begin{aligned} S(q,4) = \left[ {\frac{3q}{4}}\right] ^3 \left( q - \left[ {\frac{3q}{4}}\right] \right) , \end{aligned}$$

where $\left[ x\right] $ is the rounding of x to the closest natural number that rounds half down.

Proof

Let C be a maximum code that satisfies property (1). Without loss of generality we can assume $|R_1|\ge |L_1|$ (otherwise we can switch L’s and R’s and obtain a code of the same size), meaning that $R_3 = \emptyset $ and $|L_3|= |L_1||R_2|+ |L_2||R_1|$. We also know that $|R_2|= |L_1||R_1|- |L_2|$, so the size of C equals

$$\begin{aligned} |C|=&|L_1||R_3|+ |L_2||R_2|+ |L_3||R_1|\\ =&|L_2||R_2|+ |L_1||R_2||R_1|+ |L_2||R_1|^2 \\ =&- |L_2|^ 2 + |L_2||R_1 |^2 + |L_1|^2 |R_1|^2. \end{aligned}$$

Using the derivative test, we know that for fixed sets $L_1$ and $R_1$ with $|R_1|\le 2|L_1|$ the above function achieves its maximum when $|L_2|= \frac{|R_1|^2}{2}$. A feasible partition $(L_1, R_1)$ with $|R_1|\le 2|L_1|$ exists for every $q \ge 2$ (take $|R_1|= \lceil \frac{q}{2} \rceil $), so $|C |\le \frac{|R_1|^4}{4} + |R_1|^2 |L_1|^2 \le 8 |L_1|^4 \le \frac{8q^4}{3^4}$. If $|R_1|> 2 |L_1|$, the maximum code size is achieved when $|L_2|= |L_1||R_1|$. Then C is constructed using Construction 4 with $k = 3$. A feasible partition $(L_1, R_1)$ with $|R_1|> 2|L_1|$ exists for every $q > 3$ (take $|L_1 |= 1$).

If $q = 2$ then all feasible partitions $(L_1,R_1)$ satisfy $|L_1 |= |R_1 |= 1$. Therefore $|L_2 |\in \{0,1\}$ and $|L_2 |^2 = |L_2|$, so

$$\begin{aligned} S(2,4)&= - |L_2 |^2 + |L_2|+ 1 = 1 \\&= 1^3 \left( 2-1\right) \\&= \left[ \frac{3\cdot 2}{4}\right] ^3\left( 2 - \left[ \frac{3\cdot 2}{4}\right] \right) . \end{aligned}$$

If $q = 3$ then for all feasible partitions $(L_1,R_1)$ such that $|R_1 |\ge |L_1 |$, $|L_1 |= 1$ and $|R_1 |= 2$ holds. Therefore $|R_1 |= 2 |L_1 |$. A feasible partition $L_2 = \left( L_1R_1\right) $ with $|L_2 |= \frac{|R_1|^2}{2}$ exists, so

$$\begin{aligned} S(3,4)&= 8 |L_1 |= 8 \\&= 2^3 \left( 3 - 2 \right) \\&= \left[ \frac{3\cdot 3}{4}\right] ^3\left( 3 - \left[ \frac{3\cdot 3}{4}\right] \right) . \end{aligned}$$

We are left to prove that $\frac{8q^4}{3^4} \le \max _{i \in \{1, \dots , q-1\}} (q - i) i^3$ for $q \ge 4$. We use Proposition 18 to divide the problem into four subproblems.

(i)
If $4 \mid q$, we already know that Blackburn’s construction yields the optimum. Still we can check that $\frac{8q^4}{3^4} < \frac{27q^4}{4^4}$ or equivalently $2^{11} = 2048 < 2187 = 3^7$.
(ii)
If $ 3q \equiv 1 \pmod {4}$, we have to check that $\frac{\left( q + 1\right) \left( 3q-1\right) ^3}{4^4} > \frac{8q^4}{3^4}$ or equivalently $\frac{139}{2048}q^4 + \frac{81}{2048}\left( -18q^2 + 8q - 1\right) > 0$. The two real roots of this polynomial are $q \approx - 3.4482$ and $q = 3$. A strong inequality therefore holds for $q \ge 4$.
(iii)
If $ 3q \equiv 2 \pmod {4}$, we have to check that $\frac{\left( q + 2\right) \left( 3q-2\right) ^3}{4^4} > \frac{8q^4}{3^4}$ or equivalently $\frac{139}{2048}q^4 + \frac{81}{2048}\left( -18q^2 + 32q - 4\right) > 0$. The two real roots of this polynomial are $q \approx -3.9235$ and $q \approx 0.13528$, so the inequality holds.
(iv)
If $ 3q \equiv 3 \pmod {4}$, $\frac{\left( q - 1\right) \left( 3q+1\right) ^3}{4^4} > \frac{8q^4}{3^4}$ or equivalently $\frac{139}{2048}q^4 - \frac{81}{2048}\left( 18q^2 + 8q + 1\right) > 0$. The two real roots of this polynomial are $q = -3$ and $q \approx 3.4482$, so the inequality holds for $q \ge 4$. $\square $

Theorem 20

$$\begin{aligned} N(q,4) = {\left\{ \begin{array}{ll} 2 \left( {\begin{array}{c}q\\ \left[ {\frac{3q}{4}}\right] \end{array}}\right) &{} q \ge 3 \\ 6 &{} q = 2. \end{array}\right. } \end{aligned}$$

Proof

The proof of the Theorem 19 showed that for $q \ge 3$ all maximum non-overlapping codes with $|R_1 |\ge |L_1 |$ are given by partitions $\left( L_1, R_1 \right) $ with $|L_1 |= \left[ \frac{3q}{4}\right] $. There are $\left( {\begin{array}{c}q\\ \left[ {\frac{3q}{4}}\right] \end{array}}\right) $ such partitions. Since $|R_1 |> |L_1 |$, all maximum non-overlapping codes satisfy Property 1. For each partition $(L_1,R_1)$ partitions $(L_2,R_2), (L_3,R_3)$ are uniquely determined. From Proposition 12 we know that there are also $\left( {\begin{array}{c}q\\ \left[ {\frac{3q}{4}}\right] \end{array}}\right) $ maximum non-overlapping codes with $|R_1 |< |L_1 |$.

If $q = 2$ then every four-letter non-overlapping code is maximum as we know from Theorem 19 that $S(2,4) = 1$. Hence $|L_1 |= |R_1 |= 1$ and $|L_i |+ |R_i |= 1$ for $i \in \{2,3\}$. Although there are 8 choices for the selection of partitions, we obtain only 6 distinct codes $\{0111\}$, $\{0001\}$, $\{0011\}$, $\{1000\}$, $\{1110\}$, and $\{1100\}$. Each of the codes $\{0011\}$ and $\{1100\}$ is given by two collections of partitions. $\square $

Unfortunately Construction 4 fails to generate a maximum five-letter non-overlapping code, as we will see later.

6 Comparison to other constructions

At the moment we are not able to provide any simple formula to compute the size of a maximum non-overlapping code with a larger codeword length. However, we solved SQN for $5 \le n \le 30$ when $q = 2$, and $5 \le n \le 16$ when $3 \le q \le 6$ using a search algorithm that incorporates the observations from previous sections (see Appendix A). For each pair of parameter values q and n, a single optimal solution of SQN was found up to the transformations of Propositions 12, 13 and Corollary 4. We computed the full set of optimal solutions of SQN by applying those propositions, and N(q, n) from Proposition 11. The results are presented in Tables 1 and 2.

We checked that for $q=2, n \le 16$ the computed values of S(q, n) agree with the values obtained by Chee et al [6], and for $n > 16$ the determined values are not smaller than the largest code obtained by Constructions 1 and 2. For $16 < n \le 30$ the gap between S(q, n) and the size of the largest code given by these two constructions is given in Table 2. Note that for this set of parameters Construction 1 always yields larger codes than Construction 2 [6].

Table 1 S(q, n) and N(q, n) for $2 \le q \le 6$ and $3 \le n \le 16$

Full size table

Table 2 S(2, n), N(2, n) and the gap between S(2, n) and the largest non-overlapping code given by Construction 1 for $17 \le n \le 30$

Full size table

For $3 \le q \le 6$ and $5 \le n \le 16$, we compared our values with the optimal values of Construction 1, Wang and Wang’s [14] modification of Construction 4, Wang and Wang’s [14] generalization of Constructions 2, and 5 that have been included as a supplement to [14]. The gap between the optimal values of these constructions and the newly determined values of S(q, n) are given in Table 3. Our implementation of the algorithm always returned a value for S(q, n) that is larger or equal to the size of the largest code given by any of those constructions. Whenever we got a strictly better solution for some q and n, we also checked that none of the maximum codes can be obtained using Construction 4. This can be done by observing the set of optimal solutions of SQN and Proposition 21.

Proposition 21

Let $C = \bigcup _{i=1}^{n-1} \left( L_i R_{n-i} \right) $ be a maximum non-overlapping code given by Construction 4 for some value of k. If $R_{k+1} = \emptyset $, then $S = I^k = L_1^k$ and C can be obtained by Wang and Wang’s modification.

Proof

Since $S \subseteq I^k$, $L_2=\cdots =L_k = \emptyset $ and $L_{k+1} = \bigcup _{i=1}^{k} \left( L_i R_{k+1-i} \right) = \left( L_1 R_k \right) $. Similarly, $R_k = \left( L_1^{k-1} R_1 \right) $, so $L_{k+1} = \left( L_1^{k} R_1 \right) $ and $S = L_1^k$. $\square $

Table 3 Gap between S(q, n) and the largest non-overlapping codes given by Construction 1 (C1), Wang and Wang’s generalization of Construction 2 (W2), Wang and Wang’s modification of Construction 4 (W4), and Construction 5 (C5)

Full size table

In spite of the reduction of the optimization problem, finding all solutions to SQN still requires years of CPU time for some parameter values. We therefore analysed the sets of optimal solutions to find some common properties. The results lead to the following conjecture.

Conjecture 1

For every $q \ge 2$ and $n \ge 3$ there exists a maximum non-overlapping code $C = \bigcup _{i < n} \left( L_i R_{n-i} \right) \in {\mathcal {M}}_{q,n}$ that satisfies $R_i = \emptyset $ for every $i > \frac{n}{q+1} + 1$.

Remark 3

If Conjecture 1 holds, then Blackburn’s conjecture holds for $q \ge n$.

We reduced the set of feasible solutions accordingly to find a large non-overlapping code for the parameter values, for which we could not solve the original optimization problem. As shown in Table 3 for all three parameter values a code was obtained that is much larger than the hitherto known ones. In the binary case, $n=30$, a code with exactly the same size as the one given by Construction 1 was found.

Data availibility

The set of optimal solutions of SQN that were determined in this study is available at: https://github.com/magdevska/nono-codes.

References

Bajic D., Stojanovic J.: Distributed sequences and search process. In: 2004 IEEE International Conference on Communications (IEEE Cat. No. 04CH37577), vol. 1, pp. 514–518 (2004).
Bajic D., Loncar-Turukalo T.: A simple suboptimal construction of cross-bifix-free codes. Cryptogr. Commun. 6(1), 27–37 (2014).
Article Google Scholar
Barcucci E., Bilotta S., Pergola E., Pinzani R., Succi J.: Cross-bifix-free sets generation via Motzkin paths. RAIRO-Theoret. Inf. Appl. 50(1), 81–91 (2016).
Article MathSciNet Google Scholar
Bilotta S., Pergola E., Pinzani R.: A new approach to cross-bifix-free sets. IEEE Trans. Inf. Theory 58(6), 4058–4063 (2012).
Article MathSciNet Google Scholar
Blackburn S.R.: Non-overlapping codes. IEEE Trans. Inf. Theory 61(9), 4890–4894 (2015).
Article MathSciNet Google Scholar
Chee Y.M., Kiah H.M., Purkayastha P., Wang C.: Cross-bifix-free codes within a constant factor of optimality. IEEE Trans. Inf. Theory 59(7), 4668–4674 (2013).
Article MathSciNet Google Scholar
Fimmel E., Michel C.J., Strüngmann L.: Strong comma-free codes in genetic information. Bull. Math. Biol. 79(8), 1796–1819 (2017).
Article MathSciNet Google Scholar
Fimmel E., Michel C.C., Pirot F., Sereni J.-S., Strüngmann L.: Comma-Free Codes Over Finite Alphabets. Preprint (2019). https://hal.archives-ouvertes.fr/hal-02376793.
Fimmel E., Michel C.J., Pirot F., Sereni J.-S., Strüngmann L.: Diletter and triletter comma-free codes over finite alphabets. Austral. J. Comb. 86(2), 233–270 (2023).
MathSciNet Google Scholar
Levenshtein V.: Decoding automata, invariant with respect to the initial state. Problemy Kibernet 12(1), 125–136 (1964).
MathSciNet Google Scholar
Levenshtein V.I.: Maximum number of words in codes without overlaps. Problemy Peredachi Informatsii 6(4), 88–90 (1970).
Google Scholar
Levy M., Yaakobi E.: Mutually uncorrelated codes for DNA storage. IEEE Trans. Inf. Theory 65(6), 3671–3691 (2018).
Article MathSciNet Google Scholar
Robson J.M.: Finding a maximum independent set in time O (2n/4). Technical report, Technical Report 1251-01, LaBRI, Université Bordeaux I (2001).
Wang G., Wang Q.: Q-ary non-overlapping codes: a generating function approach. IEEE Trans. Inf. Theory 68, 8 (2022).
Article MathSciNet Google Scholar
Yazdi S.T., Kiah H.M., Gabrys R., Milenkovic O.: Mutually uncorrelated primers for DNA-based data storage. IEEE Trans. Inf. Theory 64(9), 6283–6296 (2018).
Article MathSciNet Google Scholar

Download references

Funding

The research was partially supported by the scientific-research program P2-0359 and by the basic research project J1–50024, both financed by the Slovenian Research and Innovation Agency, and by the infrastructure program ELIXIR-SI RI-SI-2 financed by the European Regional Development Fund and by the Ministry of Education, Science and Sport of Republic of Slovenia.

Author information

Authors and Affiliations

Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, Ljubljana, 1000, Slovenia
Lidija Stanovnik, Miha Moškon & Miha Mraz

Authors

Lidija Stanovnik
View author publications
You can also search for this author in PubMed Google Scholar
Miha Moškon
View author publications
You can also search for this author in PubMed Google Scholar
Miha Mraz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lidija Stanovnik.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Code availability

All code is available for download at: https://github.com/magdevska/nono-codes.

Additional information

Communicated by K.-U. Schmidt.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: The algorithm for SQN

Here we provide a pseudo-code of an algorithm that searches for a maximum non-overlapping code. Procedure compute_size is due to Theorem 6. We obtain procedures compute_conditions and compute_parameters from Definition 9. Theorem 15 raises procedure determine_upper_half. Procedure branch_at_level combines Construction 6, Propositions 12 and 13, and Theorem 15. The solutions are compared with procedure update_maximum. Procedure solve_sqn initializes the optimal solution and starts the branching algorithm.

In order to obtain the results in a reasonable amount of time, our implementation uses multi-threading. A depth of parallelization d is given as parameter and should be selected depending on the available number of CPU cores and q. We divide the branching stage into subproblems as follows. If i is smaller than d, then the branching procedure creates a new thread for each feasible value of $x_i$. After the child threads are joined, the parent thread selects the maximum values of its children $s^*$ and the corresponding solution set $C^*$. If i equals d, then the branching stage runs sequentially.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Stanovnik, L., Moškon, M. & Mraz, M. In search of maximum non-overlapping codes. Des. Codes Cryptogr. (2024). https://doi.org/10.1007/s10623-023-01344-z

Download citation

Received: 25 October 2022
Revised: 03 November 2023
Accepted: 22 November 2023
Published: 08 January 2024
DOI: https://doi.org/10.1007/s10623-023-01344-z

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

In search of maximum non-overlapping codes

Abstract

Similar content being viewed by others

Minimum-Density Identifying Codes in Square Grids

Efficient Construction of Hierarchical Overlap Graphs

Constructions of r-identifying codes and (r, ≤ l)-identifying codes

1 Introduction

2 Definitions and constructions

Definition 1

Example 1

Definition 2

Definition 3

Example 2

Definition 4

Remark 1

Construction 1

Construction 2

Construction 3

Construction 4

Definition 5

Definition 6

Construction 5

Definition 7

Proposition 1

3 Constructing maximal non-overlapping codes

Construction 6

Example 3

Proposition 2

Proof

Theorem 3

Proof

Corollary 1

Proof

3.1 Size of \(C \in {\mathcal {M}}_{q,n}\)

Definition 8

Proposition 4

Proof

Proposition 5

Proof

Corollary 2

Proof

Theorem 6

Proof

3.2 Characterization of maximal non-overlapping codes

Proposition 7

Proof

Theorem 8

Proof

Proposition 9

Proof

Corollary 3

Proof

Example 4

4 An integer optimization problem

Proposition 10

Proof

Proposition 11

Proof

4.1 Reduction of SQN

Proposition 12

Proof

Proposition 13

Proof

Definition 9

Proposition 14

Proof

Theorem 15

Proof

Corollary 4

5 Exact formulas for four-letter codes

Theorem 16

Proof

Lemma 17

Proof

Proposition 18

Remark 2

Proof

Theorem 19

Proof

Theorem 20