当前位置: X-MOL 学术ACM Trans. Algorithms › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generic Non-recursive Suffix Array Construction
ACM Transactions on Algorithms ( IF 1.3 ) Pub Date : 2024-04-13 , DOI: 10.1145/3641854
Jannik Olbrich 1 , Enno Ohlebusch 1 , Thomas Büchler 1
Affiliation  

The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world performance. There is a super-linear algorithm DSH, which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (eBWT) and a bijective version of the Burrows-Wheeler transform (BBWT) in linear time. We call the algorithm “generic,” since it can be used to compute the regular suffix array and the variants used for the BBWT and eBWT. Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our BBWT-algorithm is faster than or competitive with all other tested BBWT construction implementations on large or repetitive data, and our eBWT-algorithm is faster than all other programs on data that is not extremely repetitive.



中文翻译:

通用非递归后缀数组构造

后缀数组可以说是序列分析中最重要的数据结构之一,因此有多种后缀排序算法。然而,迄今为止GSAC协会2015 年推出的算法是唯一已知的非递归线性时间后缀数组构造算法(SACA)。尽管其理论特性很有趣,但在改进方面却几乎没有做出任何努力GSAC协会的非竞争性现实世界表现。有一种超线性算法DSH,它依赖于相同的排序原理并且比DivSufSort,十多年来最快的 SACA。本文的目的有两个:我们分析了中使用的排序原理GSAC协会DSH并利用其属性给出优化的线性时间算法,我们表明它可以非常优雅地用于计算原始扩展的 Burrows-Wheeler 变换(电子BWT)和 Burrows-Wheeler 变换的双射版本(BBWT)在线性时间内。我们将该算法称为“通用”,因为它可用于计算常规后缀数组以及用于BBWT电子BWT。我们的后缀数组构造算法不仅明显快于GSAC协会但也优于DivSufSortDSH。我们的BBWT- 算法比所有其他测试的算法更快或具有竞争力BBWT对大型或重复数据的构建实现,以及我们的电子BWT-对于不是极其重复的数据,该算法比所有其他程序都要快。

更新日期:2024-04-13
down
wechat
bug