A lightweight performance proxy for deep‐learning model training on Amazon SageMaker,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A lightweight performance proxy for deep‐learning model training on Amazon SageMaker
Concurrency and Computation: Practice and Experience ( IF 2 ) Pub Date : 2024-04-08 , DOI: 10.1002/cpe.8104
Rafael Keller Tesser _{1,

2,

3} , Alvaro Marques ₂ , Edson Borin ₂

Affiliation

SummaryCloud computing has become popular for training deep‐learning (DL) models, avoiding the costs of acquiring and maintaining on‐premise systems. SageMaker is a cloud service that automates the execution of DL workloads. Its features include automatic hyperparameter optimization and use of spot instances. Nonetheless, it does not assist in selecting the right instance type for a workload. In public clouds, rent price depends on the configuration of the chosen instance type. Advanced and faster instances are typically more expensive, but not always the best choice. To select the optimal instance type, users must compare the workload's relative performance (and hence cost) on several candidates. Building on the execution profiles of multiple DL applications, we model the performance and cost of training DL applications on SageMaker and propose a lightweight technique to estimate these at low temporal and monetary cost. This method is a performance proxy that can be used to replace more expensive performance measurement procedures. So, it could speed up any technique that relies on such measurements. We show how it can help cloud customers seeking suitable instance types to train DL models, and that it can accurately predict the performance of different instance types when training these models on SageMaker.

中文翻译：

用于在 Amazon SageMaker 上进行深度学习模型训练的轻量级性能代理

摘要云计算在训练深度学习 (DL) 模型方面已变得很流行，从而避免了获取和维护本地系统的成本。贤者创客是一项自动执行深度学习工作负载的云服务。其功能包括自动超参数优化和使用点实例。尽管如此，它无助于为工作负载选择正确的实例类型。在公共云中，租金价格取决于所选实例类型的配置。高级和更快的实例通常更昂贵，但并不总是最佳选择。要选择最佳实例类型，用户必须比较多个候选实例类型的工作负载的相对性能（以及成本）。基于多个深度学习应用程序的执行配置文件，我们对训练深度学习应用程序的性能和成本进行了建模贤者创客并提出一种轻量级技术，以较低的时间和金钱成本来估计这些。该方法是一种性能代理，可用于替代更昂贵的性能测量程序。因此，它可以加速任何依赖此类测量的技术。我们展示了它如何帮助云客户寻找合适的实例类型来训练深度学习模型，并且它可以在训练这些模型时准确预测不同实例类型的性能贤者创客。

更新日期：2024-04-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>