当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Tabular data synthesis with generative adversarial networks: design space and optimizations
The VLDB Journal ( IF 4.2 ) Pub Date : 2023-08-15 , DOI: 10.1007/s00778-023-00807-y
Tongyu Liu , Ju Fan , Guoliang Li , Nan Tang , Xiaoyong Du

The proliferation of big data has brought an urgent demand for privacy-preserving data publishing. Traditional solutions to this demand have limitations on effectively balancing the trade-off between privacy and utility of the released data. To address this problem, the database community and machine learning community have recently studied a new problem of tabular data synthesis using generative adversarial networks (GANs) and proposed various algorithms. However, a comprehensive comparison between GAN-based methods and conventional approaches is still lacking, making it unclear why and how GANs can outperform conventional approaches in synthesizing tabular data. Moreover, it is difficult for practitioners to understand which components are necessary when building a GAN model for tabular data synthesis. To bridge this gap, we conduct a comprehensive experimental study that investigates applying GAN to tabular data synthesis. We introduce a unified GAN-based framework and define a space of design solutions for each component in the framework, including neural network architectures and training strategies. We provide optimization techniques to handle difficulties in training GAN in practice. We conduct extensive experiments to explore the design space, comparing with traditional data synthesis approaches. Through extensive experiments, we find that GAN is very promising for tabular data synthesis and provide guidance for selecting appropriate design choices. We also point out limitations of GAN and identify future research directions. We make all code and datasets public for future research.



中文翻译:

生成对抗网络的表格数据合成:设计空间和优化

大数据的激增带来了保护隐私的数据发布的迫切需求。针对这一需求的传统解决方案在有效平衡发布数据的隐私性和实用性之间存在局限性。为了解决这个问题,数据库社区和机器学习社区最近研究了使用生成对抗网络(GAN)进行表格数据合成的新问题,并提出了各种算法。然而,基于 GAN 的方法和传统方法之间仍然缺乏全面的比较,这使得人们不清楚 GAN 在合成表格数据方面为何以及如何优于传统方法。而且,从业者很难理解在构建用于表格数据合成的 GAN 模型时需要哪些组件。为了弥补这一差距,我们进行了一项全面的实验研究,研究将 GAN 应用于表格数据合成。我们引入了一个基于 GAN 的统一框架,并为框架中的每个组件定义了一个设计解决方案空间,包括神经网络架构和训练策略。我们提供优化技术来解决实践中训练 GAN 的困难。我们进行了大量的实验来探索设计空间,并与传统的数据合成方法进行比较。通过大量的实验,我们发现 GAN 在表格数据合成方面非常有前景,并为选择适当的设计选择提供了指导。我们还指出了 GAN 的局限性并确定了未来的研究方向。我们公开所有代码和数据集以供未来研究。

更新日期:2023-08-15
down
wechat
bug