Sparsifying Count Sketch,Information Processing Letters

当前位置： X-MOL 学术 › Inf. Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sparsifying Count Sketch
Information Processing Letters ( IF 0.5 ) Pub Date : 2024-02-29 , DOI: 10.1016/j.ipl.2024.106490
Bhisham Dev Verma , Rameshwar Pratap , Punit Pankaj Dubey

The seminal work of Charikar et al. called suggests a sketching algorithm for real-valued vectors that has been used in frequency estimation for data streams and pairwise inner product estimation for real-valued vectors etc. One of the major advantages of over other similar sketching algorithms, such as random projection, is that its running time, as well as the sparsity of sketch, depends on the sparsity of the input. Therefore, sparse datasets enjoy space-efficient (sparse sketches) and faster running time. However, on dense datasets, these advantages of might be negligible over other baselines. In this work, we address this challenge by suggesting a simple and effective approach that outputs (asymptotically) a sparser sketch than that obtained via , and as a by-product, we also achieve a faster running time. Simultaneously, the quality of our estimate is closely approximate to that of . For frequency estimation and pairwise inner product estimation problems, our proposal provides unbiased estimates. These estimations, however, have slightly higher variances than their respective estimates obtained via . To address this issue, we present improved estimators for these problems based on maximum likelihood estimation (MLE) that offer smaller variances even . We suggest a rigorous theoretical analysis of our proposal for frequency estimation for data streams and pairwise inner product estimation for real-valued vectors.

中文翻译：

稀疏计数草图

Charikar 等人的开创性工作。称为提出了一种实值向量的草图算法，该算法已用于数据流的频率估计和实值向量的成对内积估计等。与其他类似的草图算法（例如随机投影）相比，主要优点之一是它的运行时间以及草图的稀疏性取决于输入的稀疏性。因此，稀疏数据集具有空间效率（稀疏草图）和更快的运行时间。然而，在密集数据集上，与其他基线相比，这些优势可能可以忽略不计。在这项工作中，我们通过提出一种简单而有效的方法来解决这一挑战，该方法输出（渐进）比通过获得的草图更稀疏的草图，并且作为副产品，我们还实现了更快的运行时间。同时，我们的估计质量非常接近的质量。对于频率估计和成对内积估计问题，我们的建议提供了无偏估计。然而，这些估计的方差比通过获得的各自估计稍高。为了解决这个问题，我们基于最大似然估计（MLE）提出了针对这些问题的改进估计器，它甚至提供了更小的方差。我们建议对数据流频率估计和实值向量成对内积估计的建议进行严格的理论分析。

更新日期：2024-02-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>