当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A systematic evaluation of machine learning on serverless infrastructure
The VLDB Journal ( IF 4.2 ) Pub Date : 2023-09-20 , DOI: 10.1007/s00778-023-00813-0
Jiawei Jiang , Shaoduo Gan , Bo Du , Gustavo Alonso , Ana Klimovic , Ankit Singla , Wentao Wu , Sheng Wang , Ce Zhang

Recently, the serverless paradigm of computing has inspired research on its applicability to data-intensive tasks such as ETL, database query processing, and machine learning (ML) model training. Recent efforts have proposed multiple systems for training large-scale ML models in a distributed manner on top of serverless infrastructures (e.g., AWS Lambda). Yet, there is so far no consensus on the design space for such systems when compared with systems built on top of classical “serverful” infrastructures. Indeed, a variety of factors could impact the performance of training ML models in a distributed environment, such as the optimization algorithm used and the synchronization protocol followed by parallel executors, which must be carefully considered when designing serverless ML systems. To clarify contradictory observations from previous work, in this paper we present a systematic comparative study of serverless and serverful systems for distributed ML training. We present a design space that covers design choices made by previous systems on aspects such as optimization algorithms and synchronization protocols. We then implement a platform, LambdaML , that enables a fair comparison between serverless and serverful systems by navigating the aforementioned design space. We further improve LambdaML toward automatic support by designing a hyper-parameter tuning framework that leverages the ability of serverless infrastructure. We present empirical evaluation results using LambdaML on both single training jobs and multi-tenant workloads. Our results reveal that there is no “one size fits all” serverless solution given the current state of the art—one must choose different designs for different ML workloads. We also develop an analytic model based on the empirical observations to capture the cost/performance tradeoffs that one has to consider when deciding between serverless and serverful designs for distributed ML training.



中文翻译:

无服务器基础设施上机器学习的系统评估

最近,无服务器计算范式激发了对其在数据密集型任务(例如 ETL、数据库查询处理和机器学习 (ML) 模型训练)中的适用性的研究。最近的工作提出了多个系统,用于在无服务器基础设施(例如 AWS Lambda)之上以分布式方式训练大规模机器学习模型。然而,与构建在经典“服务器化”基础设施之上的系统相比,迄今为止,对于此类系统的设计空间尚未达成共识。事实上,多种因素可能会影响分布式环境中训练机器学习模型的性能,例如使用的优化算法和并行执行器遵循的同步协议,在设计无服务器机器学习系统时必须仔细考虑这些因素。为了澄清之前工作中相互矛盾的观察结果,在本文中,我们对分布式机器学习训练的无服务器和服务器系统进行了系统的比较研究。我们提出了一个设计空间,涵盖了以前的系统在优化算法和同步协议等方面所做的设计选择。然后,我们实现一个平台LambdaML,它可以通过上述设计空间对无服务器系统和服务器系统进行公平比较。我们通过设计一个利用无服务器基础设施能力的超参数调优框架,进一步改进LambdaML以实现自动支持。我们使用LambdaML对单一训练作业和多租户工作负载提供了实证评估结果。我们的结果表明,鉴于当前的技术水平,不存在“一刀切”的无服务器解决方案,必须针对不同的 ML 工作负载选择不同的设计。我们还根据经验观察开发了一种分析模型,以捕获在决定分布式机器学习训练的无服务器设计和服务器设计时必须考虑的成本/性能权衡。

更新日期:2023-09-20
down
wechat
bug