当前位置: X-MOL 学术ACM Trans. Database Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Constant-Delay Enumeration for Nondeterministic Document Spanners
ACM Transactions on Database Systems ( IF 1.8 ) Pub Date : 2021-04-14 , DOI: 10.1145/3436487
Antoine Amarilli 1 , Pierre Bourhis 2 , Stefan Mengel 3 , Matthias Niewerth 4
Affiliation  

We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.

中文翻译:

非确定性文档 Spanner 的恒定延迟枚举

我们认为信息提取框架称为文件扳手并研究从输入文档中有效计算提取结果的问题,其中提取任务被描述为顺序变集自动机(弗吉尼亚州)。我们在枚举算法的设置中提出了这个问题,我们可以首先运行一个预处理阶段,然后必须在任意两个连续结果之间产生一小段延迟的结果。我们的目标是拥有一种算法,该算法在组合复杂度(即输入文档和 VA 的大小)方面易于处理,同时确保输入文档大小的最佳数据复杂度界限,即文档大小的恒定延迟。PODS'18 最近的几项工作提出了此类算法,但文档大小具有线性延迟或(通常是非确定性的)输入 VA 的大小具有指数依赖性。特别是,弗洛伦扎诺等人。表明我们期望的运行时保证不能满足一般的顺序 VA。我们反驳了这一点并表明,给定一个不确定的顺序 VA 和一个输入文档,我们可以用以下界限枚举 VA 在文档上的映射:预处理在文档大小上是线性的,在 VA 大小上是多项式的,延迟与文档和 VA 大小上的多项式无关。由此产生的算法在组合复杂度和最佳可能的数据复杂度范围内实现了易处理性。此外,它很容易描述,特别是对于所谓的扩展 VA 的受限情况。最后,我们使用原型实现凭经验评估我们的算法。由此产生的算法在组合复杂度和最佳可能的数据复杂度范围内实现了易处理性。此外,它很容易描述,特别是对于所谓的扩展 VA 的受限情况。最后,我们使用原型实现凭经验评估我们的算法。由此产生的算法在组合复杂度和最佳可能的数据复杂度范围内实现了易处理性。此外,它很容易描述,特别是对于所谓的扩展 VA 的受限情况。最后,我们使用原型实现凭经验评估我们的算法。
更新日期:2021-04-14
down
wechat
bug