research-article

String Indexing with Compressed Patterns

Authors:
Philip Bille

Technical University of Denmark, DTU Compute, Denmark

Technical University of Denmark, DTU Compute, Denmark

0000-0002-1120-5154
View Profile

,
Inge Li Gørtz

Technical University of Denmark, DTU Compute, Denmark

Technical University of Denmark, DTU Compute, Denmark

0000-0002-8322-4952
View Profile

,
Teresa Anna Steiner

Technical University of Denmark, DTU Compute, Denmark

Technical University of Denmark, DTU Compute, Denmark

0000-0003-1078-4075
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 19 Issue 4Article No.: 32pp 1–19https://doi.org/10.1145/3607141

Published:26 September 2023Publication History

ACM Transactions on Algorithms

Abstract

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this article, we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way, we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

REFERENCES

[1] Alstrup Stephen, Husfeldt Thore, and Rauhe Theis. 1998. Marked ancestor problems. In Proc. 39th FOCS. 534–543.Google Scholar
[2] Belazzougui Djamal, Boldi Paolo, and Vigna Sebastiano. 2010. Dynamic Z-fast tries. In Proc. 17th SPIRE. 159–172.Google Scholar
[3] Belazzougui Djamal and Navarro Gonzalo. 2014. Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10, 4 (2014), 23.Google ScholarDigital Library
[4] Bille Philip, Ettienne Mikko Berggren, Gørtz Inge Li, and Vildhøj Hjalte Wedel. 2018. Time–space trade-offs for Lempel–Ziv compressed indexing. Theor. Comput. Sci. 713 (2018), 66–77.Google ScholarCross Ref
[5] Bille Philip, Gørtz Inge Li, Knudsen Mathias Bæk Tejs, Lewenstein Moshe, and Vildhøj Hjalte Wedel. 2015. Longest common extensions in sublinear space. In Proc. 26th CPM. 65–76.Google Scholar
[6] Bille Philip, Gørtz Inge Li, and Steiner Teresa Anna. 2020. String indexing with compressed patterns. In Proc. 37th STACS. 10:1–10:13.Google Scholar
[7] Carter Larry and Wegman Mark N.. 1977. Universal classes of hash functions (extended abstract). In Proc. 9th STOC. 106–112.Google Scholar
[8] Charikar Moses, Lehman Eric, Liu Ding, Panigrahy Rina, Prabhakaran Manoj, Sahai Amit, and Shelat Abhi. 2005. The smallest grammar problem. IEEE Trans. Inf. Theory 51, 7 (2005), 2554–2576.Google ScholarDigital Library
[9] Claude Francisco and Navarro Gonzalo. 2012. Improved grammar-based compressed indexes. In Proc. 19th SPIRE. 180–192.Google Scholar
[10] Farach-Colton Martin, Ferragina Paolo, and Muthukrishnan S.. 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987–1011.Google ScholarDigital Library
[11] Ferragina Paolo and Manzini Giovanni. 2000. Opportunistic data structures with applications. In Proc. 41st FOCS. 390–398.Google Scholar
[12] Ferragina Paolo and Manzini Giovanni. 2001. An experimental study of an opportunistic index. In Proc. 12th SODA. 269–278.Google Scholar
[13] Ferragina Paolo and Manzini Giovanni. 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552–581.Google ScholarDigital Library
[14] Ferragina Paolo, Manzini Giovanni, Mäkinen Veli, and Navarro Gonzalo. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3, 2 (2007), 20.Google ScholarDigital Library
[15] Fischer Johannes, Köppl Dominik, and Kurpicz Florian. 2016. On the benefit of merging suffix array intervals for parallel pattern matching. In Proc. 27th CPM. 26:1–26:11.Google Scholar
[16] Fredman Michael L., Komlós János, and Szemerédi Endre. 1984. Storing a sparse table with 0(1) worst case access time. J. ACM 31, 3 (1984), 538–544.Google ScholarDigital Library
[17] Gagie Travis, Gawrychowski Paweł, Kärkkäinen Juha, Nekrich Yakov, and Puglisi Simon J. 2014. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN. 731–742.Google Scholar
[18] Gagie Travis, Karhu Kalle, Kärkkäinen Juha, Mäkinen Veli, Salmela Leena, and Tarhio Jorma. 2012. Indexed multi-pattern matching. In Proc. 10th LATIN. 399–407.Google Scholar
[19] Gagie Travis and Puglisi Simon J.. 2015. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3 (2015), 12.Google ScholarCross Ref
[20] Gao Younan, He Meng, and Nekrich Yakov. 2020. Fast preprocessing for optimal orthogonal range reporting and range successor with applications to text indexing. In Proc. 28th ESA. 54:1–54:18.Google Scholar
[21] Gasieniec Leszek and Rytter Wojciech. 1999. Almost optimal fully LZW-compressed pattern matching. In Proc. 9th DCC. 316–325.Google Scholar
[22] Gawrychowski Pawel. 2012. Tying up the loose ends in fully LZW-compressed pattern matching. In Proc. 29th STACS. 624–635.Google Scholar
[23] Grossi Roberto, Gupta Ankur, and Vitter Jeffrey Scott. 2003. High-order entropy-compressed text indexes. In Proc. 14th SODA. 841–850.Google Scholar
[24] Grossi Roberto, Gupta Ankur, and Vitter Jeffrey Scott. 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proc. 15th SODA. 636–645.Google Scholar
[25] Grossi Roberto and Vitter Jeffrey Scott. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378–407.Google ScholarDigital Library
[26] Harel Dov and Tarjan Robert Endre. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 (1984), 338–355.Google ScholarDigital Library
[27] Hirao Masahiro, Shinohara Ayumi, Takeda Masayuki, and Arikawa Setsuo. 2000. Fully compressed pattern matching algorithm for balanced straight-line programs. In Proc. 7th SPIRE. 132–138.Google Scholar
[28] Inenaga Shunsuke, Shinohara Ayumi, and Takeda Masayuki. 2005. A fully compressed pattern matching algorithm for simple collage systems. Int. J. Found. Comput. Sci. 16, 6 (2005), 1155–1166.Google ScholarCross Ref
[29] Jez Artur. 2015. Faster fully compressed pattern matching by recompression. ACM Trans. Algorithms 11, 3 (2015), 20:1–20:43.Google ScholarDigital Library
[30] Kärkkäinen Juha, Sanders Peter, and Burkhardt Stefan. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918–936.Google ScholarDigital Library
[31] Kärkkäinen Juha and Sutinen Erkki. 1998. Lempel-Ziv index for q-Grams. Algorithmica 21, 1 (1998), 137–154.Google ScholarCross Ref
[32] Kärkkäinen Juha and Ukkonen Esko. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd WSP. 141–155.Google Scholar
[33] Karp Richard M. and Rabin Michael O.. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev 31, 2 (1987), 249–260.Google ScholarDigital Library
[34] Keller Orgad, Kopelowitz Tsvi, Feibish Shir Landau, and Lewenstein Moshe. 2014. Generalized substring compression. Theor. Comput. Sci. 525 (2014), 42–54.Google ScholarDigital Library
[35] Kreft Sebastian and Navarro Gonzalo. 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115–133.Google ScholarDigital Library
[36] Mäkinen Veli. 2000. Compact suffix array. In Proc. 11th CPM. 305–319.Google Scholar
[37] Mäkinen Veli, Navarro Gonzalo, Sirén Jouni, and Välimäki Niko. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281–308.Google ScholarCross Ref
[38] Maruyama Shirou, Nakahara Masaya, Kishiue Naoya, and Sakamoto Hiroshi. 2013. ESP-index: A compressed index based on edit-sensitive parsing. J. Discrete Algorithms 18 (2013), 100–112.Google ScholarDigital Library
[39] Navarro Gonzalo. 2012. Indexing highly repetitive collections. In Proc. 23rd IWOCA. 274–279.Google Scholar
[40] Navarro Gonzalo. 2016. Compact Data Structures: A Practical Approach. Cambridge University Press.Google ScholarCross Ref
[41] Navarro Gonzalo and Mäkinen Veli. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007), 2.Google ScholarDigital Library
[42] Storer James A and Szymanski Thomas G. 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928–951.Google ScholarDigital Library
[43] Weiner Peter. 1973. Linear pattern matching algorithms. In Proc. 14th FOCS. 1–11.Google Scholar
[44] Ziv Jacob and Lempel Abraham. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337–343.Google ScholarDigital Library
[45] Ziv Jacob and Lempel Abraham. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530–536.Google ScholarDigital Library

Index Terms

String Indexing with Compressed Patterns
1. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Data compression
      2. Pattern matching

Recommendations

String indexing for top-k close consecutive occurrences
Abstract
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this ...
Read More
String Indexing for Patterns with Wildcards

We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following ...
Read More
Indexing compressed text

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Algorithms Volume 19, Issue 4
October 2023
255 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3614237
Editor:
Edith Cohen
Google Research, USA and Tel Aviv University, Israel
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 September 2023
- Online AM: 21 July 2023
- Accepted: 15 June 2023
- Revised: 3 February 2022
- Received: 19 March 2020
Published in talg Volume 19, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
String indexing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 134
  Total Downloads
- Downloads (Last 12 months)134
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

String Indexing with Compressed Patterns

ACM Transactions on Algorithms

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

String indexing for top-k close consecutive occurrences

String Indexing for Patterns with Wildcards

Indexing compressed text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

String Indexing with Compressed Patterns

ACM Transactions on Algorithms

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

String indexing for top-k close consecutive occurrences

String Indexing for Patterns with Wildcards

Indexing compressed text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media