skip to main content
research-article

String Indexing with Compressed Patterns

Published:26 September 2023Publication History
Skip Abstract Section

Abstract

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this article, we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way, we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

REFERENCES

  1. [1] Alstrup Stephen, Husfeldt Thore, and Rauhe Theis. 1998. Marked ancestor problems. In Proc. 39th FOCS. 534543.Google ScholarGoogle Scholar
  2. [2] Belazzougui Djamal, Boldi Paolo, and Vigna Sebastiano. 2010. Dynamic Z-fast tries. In Proc. 17th SPIRE. 159172.Google ScholarGoogle Scholar
  3. [3] Belazzougui Djamal and Navarro Gonzalo. 2014. Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10, 4 (2014), 23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bille Philip, Ettienne Mikko Berggren, Gørtz Inge Li, and Vildhøj Hjalte Wedel. 2018. Time–space trade-offs for Lempel–Ziv compressed indexing. Theor. Comput. Sci. 713 (2018), 6677.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bille Philip, Gørtz Inge Li, Knudsen Mathias Bæk Tejs, Lewenstein Moshe, and Vildhøj Hjalte Wedel. 2015. Longest common extensions in sublinear space. In Proc. 26th CPM. 6576.Google ScholarGoogle Scholar
  6. [6] Bille Philip, Gørtz Inge Li, and Steiner Teresa Anna. 2020. String indexing with compressed patterns. In Proc. 37th STACS. 10:1–10:13.Google ScholarGoogle Scholar
  7. [7] Carter Larry and Wegman Mark N.. 1977. Universal classes of hash functions (extended abstract). In Proc. 9th STOC. 106112.Google ScholarGoogle Scholar
  8. [8] Charikar Moses, Lehman Eric, Liu Ding, Panigrahy Rina, Prabhakaran Manoj, Sahai Amit, and Shelat Abhi. 2005. The smallest grammar problem. IEEE Trans. Inf. Theory 51, 7 (2005), 25542576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Claude Francisco and Navarro Gonzalo. 2012. Improved grammar-based compressed indexes. In Proc. 19th SPIRE. 180192.Google ScholarGoogle Scholar
  10. [10] Farach-Colton Martin, Ferragina Paolo, and Muthukrishnan S.. 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 9871011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Ferragina Paolo and Manzini Giovanni. 2000. Opportunistic data structures with applications. In Proc. 41st FOCS. 390398.Google ScholarGoogle Scholar
  12. [12] Ferragina Paolo and Manzini Giovanni. 2001. An experimental study of an opportunistic index. In Proc. 12th SODA. 269278.Google ScholarGoogle Scholar
  13. [13] Ferragina Paolo and Manzini Giovanni. 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Ferragina Paolo, Manzini Giovanni, Mäkinen Veli, and Navarro Gonzalo. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3, 2 (2007), 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Fischer Johannes, Köppl Dominik, and Kurpicz Florian. 2016. On the benefit of merging suffix array intervals for parallel pattern matching. In Proc. 27th CPM. 26:1–26:11.Google ScholarGoogle Scholar
  16. [16] Fredman Michael L., Komlós János, and Szemerédi Endre. 1984. Storing a sparse table with 0(1) worst case access time. J. ACM 31, 3 (1984), 538544.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Gagie Travis, Gawrychowski Paweł, Kärkkäinen Juha, Nekrich Yakov, and Puglisi Simon J. 2014. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN. 731742.Google ScholarGoogle Scholar
  18. [18] Gagie Travis, Karhu Kalle, Kärkkäinen Juha, Mäkinen Veli, Salmela Leena, and Tarhio Jorma. 2012. Indexed multi-pattern matching. In Proc. 10th LATIN. 399407.Google ScholarGoogle Scholar
  19. [19] Gagie Travis and Puglisi Simon J.. 2015. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3 (2015), 12.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Gao Younan, He Meng, and Nekrich Yakov. 2020. Fast preprocessing for optimal orthogonal range reporting and range successor with applications to text indexing. In Proc. 28th ESA. 54:1–54:18.Google ScholarGoogle Scholar
  21. [21] Gasieniec Leszek and Rytter Wojciech. 1999. Almost optimal fully LZW-compressed pattern matching. In Proc. 9th DCC. 316325.Google ScholarGoogle Scholar
  22. [22] Gawrychowski Pawel. 2012. Tying up the loose ends in fully LZW-compressed pattern matching. In Proc. 29th STACS. 624635.Google ScholarGoogle Scholar
  23. [23] Grossi Roberto, Gupta Ankur, and Vitter Jeffrey Scott. 2003. High-order entropy-compressed text indexes. In Proc. 14th SODA. 841850.Google ScholarGoogle Scholar
  24. [24] Grossi Roberto, Gupta Ankur, and Vitter Jeffrey Scott. 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proc. 15th SODA. 636645.Google ScholarGoogle Scholar
  25. [25] Grossi Roberto and Vitter Jeffrey Scott. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Harel Dov and Tarjan Robert Endre. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 (1984), 338355.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Hirao Masahiro, Shinohara Ayumi, Takeda Masayuki, and Arikawa Setsuo. 2000. Fully compressed pattern matching algorithm for balanced straight-line programs. In Proc. 7th SPIRE. 132138.Google ScholarGoogle Scholar
  28. [28] Inenaga Shunsuke, Shinohara Ayumi, and Takeda Masayuki. 2005. A fully compressed pattern matching algorithm for simple collage systems. Int. J. Found. Comput. Sci. 16, 6 (2005), 11551166.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Jez Artur. 2015. Faster fully compressed pattern matching by recompression. ACM Trans. Algorithms 11, 3 (2015), 20:1–20:43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kärkkäinen Juha, Sanders Peter, and Burkhardt Stefan. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Kärkkäinen Juha and Sutinen Erkki. 1998. Lempel-Ziv index for q-Grams. Algorithmica 21, 1 (1998), 137154.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kärkkäinen Juha and Ukkonen Esko. 1996. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd WSP. 141155.Google ScholarGoogle Scholar
  33. [33] Karp Richard M. and Rabin Michael O.. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev 31, 2 (1987), 249260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Keller Orgad, Kopelowitz Tsvi, Feibish Shir Landau, and Lewenstein Moshe. 2014. Generalized substring compression. Theor. Comput. Sci. 525 (2014), 4254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Kreft Sebastian and Navarro Gonzalo. 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Mäkinen Veli. 2000. Compact suffix array. In Proc. 11th CPM. 305319.Google ScholarGoogle Scholar
  37. [37] Mäkinen Veli, Navarro Gonzalo, Sirén Jouni, and Välimäki Niko. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281308.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Maruyama Shirou, Nakahara Masaya, Kishiue Naoya, and Sakamoto Hiroshi. 2013. ESP-index: A compressed index based on edit-sensitive parsing. J. Discrete Algorithms 18 (2013), 100112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Navarro Gonzalo. 2012. Indexing highly repetitive collections. In Proc. 23rd IWOCA. 274279.Google ScholarGoogle Scholar
  40. [40] Navarro Gonzalo. 2016. Compact Data Structures: A Practical Approach. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Navarro Gonzalo and Mäkinen Veli. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007), 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Storer James A and Szymanski Thomas G. 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928951.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Weiner Peter. 1973. Linear pattern matching algorithms. In Proc. 14th FOCS. 111.Google ScholarGoogle Scholar
  44. [44] Ziv Jacob and Lempel Abraham. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Ziv Jacob and Lempel Abraham. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530536.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. String Indexing with Compressed Patterns

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Algorithms
        ACM Transactions on Algorithms  Volume 19, Issue 4
        October 2023
        255 pages
        ISSN:1549-6325
        EISSN:1549-6333
        DOI:10.1145/3614237
        • Editor:
        • Edith Cohen
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 September 2023
        • Online AM: 21 July 2023
        • Accepted: 15 June 2023
        • Revised: 3 February 2022
        • Received: 19 March 2020
        Published in talg Volume 19, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Author Tags

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)134
        • Downloads (Last 6 weeks)6

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text