当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data distribution tailoring revisited: cost-efficient integration of representative data
The VLDB Journal ( IF 4.2 ) Pub Date : 2024-04-12 , DOI: 10.1007/s00778-024-00849-w
Jiwon Chang , Bohan Cui , Fatemeh Nargesian , Abolfazl Asudeh , H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.



中文翻译:

重新审视数据分布定制:具有成本效益的代表性数据集成

数据科学家经常利用可用的数据源来开发用于分析的数据集。一个主要挑战是确保用于分析的数据集充分代表相关的人口群体或其他变量。无论数据是从实验还是数据提供者获得,单一的数据源可能无法满足期望的分布要求。因此,通常有必要结合多个来源的数据。数据分布裁剪(DT)问题旨在经济高效地从多个来源收集统一的数据集。在本文中,我们针对该问题提出了对先前算法的主要优化和概括。在来源中已知组分布的情况下,我们基于优惠券收集者的问题提出了一种优于现有算法的新颖算法RatioColl 。如果分布未知,我们提出衰减探索率多臂老虎机算法,与用于未知 DT 的现有算法不同,该算法不需要先验信息。通过理论分析和广泛的实验,我们证明了我们提出的算法的有效性。

更新日期:2024-04-12
down
wechat
bug