当前位置: X-MOL 学术Bioinformatics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets
Bioinformatics ( IF 5.8 ) Pub Date : 2024-01-11 , DOI: 10.1093/bioinformatics/btae014
Pinglu Zhang 1, 2 , Huan Liu 3 , Yanming Wei 4 , Yixiao Zhai 1, 2 , Qinzhong Tian 1, 2 , Quan Zou 1, 2
Affiliation  

Motivation In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. Results FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By employing a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. Availability Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770. Contact pingluzhang@outlook.com Supplementary information Supplementary data are available at Bioinformatics online.

中文翻译:

FMAlign2:一种用于超长数据集的新型快速多核苷酸序列比对方法

动机在生物信息学中,多序列比对(MSA)是一项至关重要的任务。然而,传统方法常常难以对齐超长序列。为了解决这个问题,研究人员设计了基于垂直划分策略的 MSA 方法,该方法对序列数据进行分段以进行并行比对。这种方法的一个主要例子是 FMAlign,它利用 FM 索引来提取常见种子并相应地对序列进行分段。结果 FMAlign2 利用后缀数组来识别最大精确匹配,重新定义了 FMAlign 的方法,从搜索全局链到搜索部分链。通过采用垂直划分策略,大规模问题被解构为可管理的任务,从而实现 subMSA 的并行执行。此外,序列图比对和细化被纳入连接子集,无缝地产生最终结果。与 FMAlign 相比,FMAlign2 显着增强了序列分割,并在保持准确性的同时显着减少了时间,尤其是在超长数据集上。重要的是,FMAlign2 能够在可接受的时间范围内处理长度达到数十亿的序列,从而增强了现有的 MSA 方法。可用性 源代码和数据集可从 https://github.com/malabz/FMAlign2 和 https://zenodo.org/records/10435770 获取。联系 pingluzhang@outlook.com 补充信息 补充数据可在生物信息学在线获取。
更新日期:2024-01-11
down
wechat
bug