Abstract
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this article, we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way, we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.
- [1] . 1998. Marked ancestor problems. In Proc. 39th FOCS. 534–543.Google Scholar
- [2] . 2010. Dynamic Z-fast tries. In Proc. 17th SPIRE. 159–172.Google Scholar
- [3] . 2014. Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10, 4 (2014), 23.Google ScholarDigital Library
- [4] . 2018. Time–space trade-offs for Lempel–Ziv compressed indexing. Theor. Comput. Sci. 713 (2018), 66–77.Google ScholarCross Ref
- [5] . 2015. Longest common extensions in sublinear space. In Proc. 26th CPM. 65–76.Google Scholar
- [6] . 2020. String indexing with compressed patterns. In Proc. 37th STACS. 10:1–10:13.Google Scholar
- [7] . 1977. Universal classes of hash functions (extended abstract). In Proc. 9th STOC. 106–112.Google Scholar
- [8] . 2005. The smallest grammar problem. IEEE Trans. Inf. Theory 51, 7 (2005), 2554–2576.Google ScholarDigital Library
- [9] . 2012. Improved grammar-based compressed indexes. In Proc. 19th SPIRE. 180–192.Google Scholar
- [10] . 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987–1011.Google ScholarDigital Library
- [11] . 2000. Opportunistic data structures with applications. In Proc. 41st FOCS. 390–398.Google Scholar
- [12] . 2001. An experimental study of an opportunistic index. In Proc. 12th SODA. 269–278.Google Scholar
- [13] . 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552–581.Google ScholarDigital Library
- [14] . 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3, 2 (2007), 20.Google ScholarDigital Library
- [15] . 2016. On the benefit of merging suffix array intervals for parallel pattern matching. In Proc. 27th CPM. 26:1–26:11.Google Scholar
- [16] . 1984. Storing a sparse table with 0(1) worst case access time. J. ACM 31, 3 (1984), 538–544.Google ScholarDigital Library
- [17] . 2014. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN. 731–742.Google Scholar
- [18] . 2012. Indexed multi-pattern matching. In Proc. 10th LATIN. 399–407.Google Scholar
- [19] . 2015. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3 (2015), 12.Google ScholarCross Ref
- [20] . 2020. Fast preprocessing for optimal orthogonal range reporting and range successor with applications to text indexing. In Proc. 28th ESA. 54:1–54:18.Google Scholar
- [21] . 1999. Almost optimal fully LZW-compressed pattern matching. In Proc. 9th DCC. 316–325.Google Scholar
- [22] . 2012. Tying up the loose ends in fully LZW-compressed pattern matching. In Proc. 29th STACS. 624–635.Google Scholar
- [23] . 2003. High-order entropy-compressed text indexes. In Proc. 14th SODA. 841–850.Google Scholar
- [24] . 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proc. 15th SODA. 636–645.Google Scholar
- [25] . 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378–407.Google ScholarDigital Library
- [26] . 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 (1984), 338–355.Google ScholarDigital Library
- [27] . 2000. Fully compressed pattern matching algorithm for balanced straight-line programs. In Proc. 7th SPIRE. 132–138.Google Scholar
- [28] . 2005. A fully compressed pattern matching algorithm for simple collage systems. Int. J. Found. Comput. Sci. 16, 6 (2005), 1155–1166.Google ScholarCross Ref
- [29] . 2015. Faster fully compressed pattern matching by recompression. ACM Trans. Algorithms 11, 3 (2015), 20:1–20:43.Google ScholarDigital Library
- [30] . 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918–936.Google ScholarDigital Library
- [31] . 1998. Lempel-Ziv index for q-Grams. Algorithmica 21, 1 (1998), 137–154.Google ScholarCross Ref
- [32] . 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd WSP. 141–155.Google Scholar
- [33] . 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev 31, 2 (1987), 249–260.Google ScholarDigital Library
- [34] . 2014. Generalized substring compression. Theor. Comput. Sci. 525 (2014), 42–54.Google ScholarDigital Library
- [35] . 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115–133.Google ScholarDigital Library
- [36] . 2000. Compact suffix array. In Proc. 11th CPM. 305–319.Google Scholar
- [37] . 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281–308.Google ScholarCross Ref
- [38] . 2013. ESP-index: A compressed index based on edit-sensitive parsing. J. Discrete Algorithms 18 (2013), 100–112.Google ScholarDigital Library
- [39] . 2012. Indexing highly repetitive collections. In Proc. 23rd IWOCA. 274–279.Google Scholar
- [40] . 2016. Compact Data Structures: A Practical Approach. Cambridge University Press.Google ScholarCross Ref
- [41] . 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007), 2.Google ScholarDigital Library
- [42] . 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928–951.Google ScholarDigital Library
- [43] . 1973. Linear pattern matching algorithms. In Proc. 14th FOCS. 1–11.Google Scholar
- [44] . 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337–343.Google ScholarDigital Library
- [45] . 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530–536.Google ScholarDigital Library
Index Terms
- String Indexing with Compressed Patterns
Recommendations
String indexing for top-k close consecutive occurrences
AbstractThe classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this ...
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following ...
Indexing compressed text
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ ...
Comments