Locality-sensitive bucketing functions for the edit distance,Algorithms for Molecular Biology

当前位置： X-MOL 学术 › Algorithms Mol. Biol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Locality-sensitive bucketing functions for the edit distance
Algorithms for Molecular Biology ( IF 1 ) Pub Date : 2023-07-24 , DOI: 10.1186/s13015-023-00234-2
Ke Chen ₁ , Mingfu Shao _{1,

2}

Affiliation

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be $$(d_1, d_2)$$ -sensitive if any two sequences within an edit distance of $$d_1$$ are mapped into at least one shared bucket, and any two sequences with distance at least $$d_2$$ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of $$(d_1,d_2)$$ and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.

中文翻译：

用于编辑距离的局部敏感分桶函数

许多生物信息学应用涉及对一组序列进行分桶，其中允许将每个序列分配到多个桶中。为了同时实现高灵敏度和高精度，分桶方法需要将相似的序列分配到同一桶中，同时将不相似的序列分配到不同的桶中。现有的基于 k 聚体的分桶方法在处理低错误率的测序数据方面非常有效，但对高错误率数据的敏感性大大降低。局部敏感哈希（LSH）方案能够通过容忍相似序列中的编辑来缓解这个问题，但最先进的方法仍然存在很大差距。在本文中，我们通过允许将一个序列散列到多个桶中来概括 LSH 函数。形式上，将一个序列（固定长度）映射到桶子集的存储函数被定义为 $$(d_1, d_2)$$ 敏感，如果任何两个序列在 $$d_1$$ 编辑距离内被映射到至少一个共享桶中，并且距离至少为 $$d_2$$ 的任何两个序列被映射到桶的不相交子集。我们构建具有各种 $$(d_1,d_2)$$ 值的局部敏感分桶 (LSB) 函数，并分析其相对于所需桶总数以及特定序列所包含的桶数的效率映射到. 我们还证明了这两个参数在不同设置下的下界，并表明我们构建的一些 LSB 函数是最优的。这些结果为其在分析高错误率序列中的实际应用奠定了理论基础，同时也为设计无间隙 LSH 函数的难度提供了见解。

更新日期：2023-07-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>