MinJoin++: a fast algorithm for string similarity joins under edit distance,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MinJoin++: a fast algorithm for string similarity joins under edit distance
The VLDB Journal ( IF 4.2 ) Pub Date : 2023-08-21 , DOI: 10.1007/s00778-023-00806-z
Nikolai Karpov , Haoyu Zhang , Qin Zhang

We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.

中文翻译：

MinJoin++：编辑距离下字符串相似连接的快速算法

我们研究在一组字符串的编辑距离下计算相似性连接的问题。编辑相似性连接是数据库、数据挖掘和生物信息学中的一个基本问题。它在数据清理和集成、协同过滤、基因组序列组装等方面有很多应用。这个问题在过去二十年引起了很多关注。然而，以前的所有算法要么无法扩展到长字符串和大相似性阈值，要么准确性不完善。在本文中，我们提出了一种使用新颖的基于字符串分区的方法来编辑相似性连接的新算法。我们表明，理论上，我们的算法以高概率找到所有相似对，并在线性时间内运行（加上依赖于数据的验证步骤）。该算法还可以轻松并行化。对现实世界数据集的实验表明，我们的算法在运行时间上比编辑相似性连接的最先进算法要好几个数量级，并且在我们测试的大多数数据集上实现了完美的准确性。

更新日期：2023-08-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>