Assisted design of data science pipelines,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Assisted design of data science pipelines
The VLDB Journal ( IF 4.2 ) Pub Date : 2024-02-13 , DOI: 10.1007/s00778-024-00835-2
Sergey Redyuk , Zoi Kaoudi , Sebastian Schelter , Volker Markl

When designing data science (DS) pipelines, end-users can get overwhelmed by the large and growing set of available data preprocessing and modeling techniques. Intelligent discovery assistants (IDAs) and automated machine learning (AutoML) solutions aim to facilitate end-users by (semi-)automating the process. However, they are expensive to compute and yield limited applicability for a wide range of real-world use cases and application domains. This is due to (a) their need to execute thousands of pipelines to get the optimal one, (b) their limited support of DS tasks, e.g., supervised classification or regression only, and a small, static set of available data preprocessing and ML algorithms; and (c) their restriction to quantifiable evaluation processes and metrics, e.g., tenfold cross-validation using the ROC AUC score for classification. To overcome these limitations, we propose a human-in-the-loop approach for the assisted design of data science pipelines using previously executed pipelines. Based on a user query, i.e., data and a DS task, our framework outputs a ranked list of pipeline candidates from which the user can choose to execute or modify in real time. To recommend pipelines, it first identifies relevant datasets and pipelines utilizing efficient similarity search. It then ranks the candidate pipelines using multi-objective sorting and takes user interactions into account to improve suggestions over time. In our experimental evaluation, the proposed framework significantly outperforms the state-of-the-art IDA tool and achieves similar predictive performance with state-of-the-art long-running AutoML solutions while being real-time, generic to any evaluation processes and DS tasks, and extensible to new operators.

中文翻译：

数据科学管道的辅助设计

在设计数据科学 (DS) 管道时，最终用户可能会对大量且不断增长的可用数据预处理和建模技术感到不知所措。智能发现助手 (IDA) 和自动化机器学习 (AutoML) 解决方案旨在通过（半）自动化流程来为最终用户提供便利。然而，它们的计算成本昂贵，并且对各种实际用例和应用领域的适用性有限。这是因为 (a) 他们需要执行数千个管道才能获得最佳管道，(b) 他们对 DS 任务的支持有限，例如仅监督分类或回归，以及一小部分静态的可用数据预处理和 ML算法； (c) 它们对可量化评估过程和指标的限制，例如使用 ROC AUC 分数进行分类的十倍交叉验证。为了克服这些限制，我们提出了一种人机交互方法，用于使用先前执行的管道来辅助设计数据科学管道。基于用户查询，即数据和 DS 任务，我们的框架输出管道候选者的排名列表，用户可以从中选择实时执行或修改。为了推荐管道，它首先利用高效的相似性搜索来识别相关的数据集和管道。然后，它使用多目标排序对候选管道进行排名，并考虑用户交互以随着时间的推移改进建议。在我们的实验评估中，所提出的框架显着优于最先进的 IDA 工具，并实现了与最先进的长期运行的 AutoML 解决方案类似的预测性能，同时具有实时性、对任何评估流程通用，并且DS 任务，并可扩展到新的操作员。

更新日期：2024-02-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>