Abstract
We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path from the root to the node. The goal is to compactly represent the entire collection while supporting fast random access to any part of a string in the collection. This problem captures natural scenarios such as representing the past history of an edited document or representing highly-repetitive collections. Given a tree with n nodes, we show how to represent the corresponding collection in O(n) space and \(O(\log n/ \log \log n)\) query time. This improves the previous time-space trade-offs for the problem. Additionally, we show a lower bound proving that the query time is optimal for any solution using near-linear space. To achieve our bounds for random access in persistent strings we show how to reduce the problem to the following natural geometric selection problem on line segments. Consider a set of horizontal line segments in the plane. Given parameters i and j, a segment selection query returns the j th smallest segment (the segment with the j th smallest y-coordinate) among the segments crossing the vertical line through x-coordinate i. The segment selection problem is to preprocess a set of horizontal line segments into a compact data structure that supports fast segment selection queries. We present a solution that uses O(n) space and support segment selection queries in \(O(\log n/ \log \log n)\) time, where n is the number of segments. Furthermore, we prove that that this query time is also optimal for any solution using near-linear space.
Similar content being viewed by others
References
Agarwal, P.K., Arge, L., Kaplan, H., Molad, E., Tarjan, R.E., Yi, K.: An optimal dynamic data structure for stabbing-semigroup queries. SIAM J. Comput. 41(1), 104–127 (2012)
Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)
Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proc. 18Th SODA, pp. 680–689 (2007)
Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: Proc. 23Rd ESA, pp. 142–154 (2015)
Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans Algorithms 11(4), 1–21 (2015)
Bille, P., Christiansen, A.R., Cording, P.H., Gørtz, I.L., Skjoldjensen, F.R., Vildhøj, H.W., Vind, S.: Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica 80(11), 3207–3224 (2018). Announced at ISAAC 2016
Bille, P., Christiansen, A.R., Prezza, N., Skjoldjensen, F.R.: Succinct partial sums and fenwick trees. In: Proc. 24Th SPIRE, pp. 91–96 (2017)
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time–space trade-offs for lempel–ziv compressed indexing. Theoret. Comput. Sci. 713, 66–77 (2018)
Bille, P., Gørtz, I.L.: Random access in persistent strings. In: Proc. 31St ISAAC (2020)
Bille, P., Gørtz, I.L., Landau, G.M., Weimann, O.: Tree compression with top trees. Inform. Comput. 243, 166–177 (2015)
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015). Announced at SODA 2011
Chan, T.M.: Persistent predecessor search and orthogonal point location on the word RAM. ACM Trans. Algorithms 9(3), 1–22 (2013)
Chan, T.M., Pǎtraşcu, M.: Transdichotomous results in computational geometry, i: Point location in sublogarithmic time. SIAM J. Comput. 39(2), 703–729 (2009)
Chan, T.M., Tsakalidis, K.: Dynamic planar orthogonal point location in sublogarithmic time. In: Proc 34Th SoCG 2018 (2018)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inform. Theory 51(7), 2554–2576 (2005)
Chazelle, B.: Filtering search: A new approach to query-answering. SIAM J. Comput. 15(3), 703–724 (1986)
Chern, B., Ochoa, I., Manolakos, A., No, A., Venkat, K., Weissman, T.: Reference based genome compression. In: Proc. 12Th ITW, pp. 427–431 (2012)
De Berg, M., Vankreveld, M., Snoeyink, J.: Two-dimensional and three-dimensional point location in rectangular subdivisions. J. Algorithms 18(2), 256–277 (1995)
Dietz, P.F.: Fully persistent arrays (extended array). In: Proceedings of the Workshop on Algorithms and Data Structures, Lecture Notes in Computer Science, vol. 382, pp. 67–74 (1989)
Dietz, P.F.: Optimal algorithms for list indexing and subset rank. In: Proc. 1St WADS, pp. 39–46 (1989)
Do, H.H., Jansson, J., Sadakane, K., Sung, W.K.: Fast relative lempel–Ziv self-index for similar sequences. Theoret. Comput. Sci. 532, 14–30 (2014)
Driscoll, J., Sarnak, N., Sleator, D., Tarjan, R.: Making data structures persistent. J. Comput. System Sci. 38, 86–124 (1989)
Fenwick, P.M.: A new data structure for cumulative frequency tables. Software: Pract. Exper. 24(3), 327–336 (1994)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoret. Comput. Sci. 372(1), 115–121 (2007)
Fredman, M., Saks, M.: The cell probe complexity of dynamic data structures. In: Proc. 21St STOC, pp. 345–354 (1989)
Fredman, M.L., Willard, D.E.: Surpassing the information theoretic bound with fusion trees. J. Comput. System Sci. 47(3), 424–436 (1993)
Fredman, M.L., Willard, D.E.: Trans-dichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. System Sci. 48(3), 533–551 (1994)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Proc. 6Th LATA, pp. 240–251 (2012)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Proc. 14Th LATIN, pp. 731–742 (2014)
Gagie, T., Gawrychowski, P., Puglisi, S.J.: Approximate pattern matching in lz77-compressed texts. J. Discrete Algorithms 32, 64–68 (2015)
Gagie, T., Karhu, K., Navarro, G., Puglisi, S.J., Sirén, J.: Document listing on repetitive collections. In: Proc. 24Th CPM, pp. 107–119 (2013)
Ganardi, M., Jez, A., Lohrey, M.: Balancing straight-line programs. In: Proc. 60Th FOCS, pp. 1169–1183 (2019)
Golynski, A., Munro, J.I., Rao, S.S.: Rank/Select operations on large alphabets: a tool for text indexing. In: Proc. 17Th SODA, pp. 368–373 (2006)
Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Proc. 11Th SWAT, pp. 148–159 (2008)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proc. 14Th SODA, pp. 841–850 (2003)
Grossi, R., Raman, R., Rao, S.S., Venturini, R.: Dynamic compressed strings with random access. In: Proc. 40Th ICALP, pp. 504–515 (2013)
Hon, W.K., Sadakane, K., Sung, W.K.: Succinct data structures for searchable partial sums with optimal worst-case performance. Theoret. Comput. Sci. 412(39), 5176–5186 (2011)
Hoobin, C., Puglisi, S.J., Zobel, J.: Relative lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endowment 5(3), 265–273 (2011)
Jørgensen, A.G., Larsen, K.G.: Range selection and median: tight cell probe lower bounds and adaptive data structures. In: Proc. 22Nd SODA, pp. 805–813 (2011)
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proc. 50Th STOC, pp. 827–840 (2018)
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Proc. 17Th SPIRE, pp. 201–206 (2010)
Kuruppu, S., Puglisi, S.J., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Proc. 34Th ACSC, pp. 91–98 (2011)
Liao, S.Y., Devadas, S., Keutzer, K.: A text-compression-based method for code size minimization in embedded systems. Trans. Design Autom. Electr. Syst. 4(1), 12–38 (1999)
Liao, S.Y., Devadas, S., Keutzer, K., Tjiang, S.W.K., Wang, A.: Code optimization techniques in embedded DSP microprocessors. Design. Autom. Emb. Sys. 3(1), 59–73 (1998)
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17 (3), 281–308 (2010)
Munro, J.I., Nekrich, Y.: Compressed data structures for dynamic sequences. In: Proc. 23Rd ESA, pp. 891–902 (2015)
Navarro, G.: Indexing highly repetitive collections. In: Proc. 23Rd IWOCA, pp. 274–279 (2012)
Navarro, G.: Document listing on repetitive collections with guaranteed performance. Theoret. Comput. Sci. 772, 58–72 (2019)
Nekrich, Y.: A dynamic stabbing-max data structure with sub-logarithmic query time. In: Proc. 22Nd ISAAC, pp. 170–179 (2011)
Pătraşcu, M., Thorup, M.: Dynamic integer sets with optimal rank, select, and predecessor search. In: Proc. 55Th FOCS, pp. 166–175 (2014)
Pǎtraşcu, M., Demaine, E.D.: Logarithmic lower bounds in the cell-probe model. SIAM J. Comput. 35(4), 932–963 (2006). Announced at SODA 2004
Raman, R., Raman, V., Rao, S.S.: Succinct dynamic data structures. In: Proc. 7Th WADS, pp. 426–437 (2001)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302(1-3), 211–222 (2003)
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proc. 17Th SODA, pp. 1230–1239 (2006)
Sarnak, N., Tarjan, R.E.: Planar point location using persistent search trees. Commun. ACM 29(7), 669–679 (1986)
Storer, J.A., Szymanski, T.G.: The macro model for data compression. In: Proc. 10Th STOC, pp. 30–39 (1978)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)
Tarjan, R.E., Vishkin, U.: Finding biconnected componemts and computing tree functions in logarithmic parallel time. In: Proc. 25Th FOCS, pp. 12–20 (1984)
Verbin, E., Yu, W.: Data structure lower bounds on random access to grammar-compressed strings. In: Proc. 24Th CPM, pp. 247–258 (2013)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)
Acknowledgements
We thank Jesper Jansson for pointing out segment selection as a problem of independent interest and the anonymous reviewers for their helpful comments that improved the presentation of earlier versions of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
An extended abstract appeared at the 31st International Symposium on Algorithms and Computation [9]
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bille, P., Gørtz, I.L. Random Access in Persistent Strings and Segment Selection. Theory Comput Syst 67, 694–713 (2023). https://doi.org/10.1007/s00224-022-10109-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-022-10109-5