A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems,IEEE Transactions on Cloud Computing

当前位置： X-MOL 学术 › IEEE Trans. Cloud Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems
IEEE Transactions on Cloud Computing ( IF 6.5 ) Pub Date : 2023-11-24 , DOI: 10.1109/tcc.2023.3336540
Federica Filippini ₁ , Jonatha Anselmi ₂ , Danilo Ardagna ₁ , Bruno Gaujal ₂

Affiliation

In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17–29% on average.

中文翻译：

在基于 GPU 的系统中调度 AI 训练作业的随机方法

在这项工作中，我们从运行数据中心的云服务提供商的角度优化深度学习（DL）训练作业的调度，有效地选择用于执行每个作业的资源，以在满足时间约束的同时最小化平均能耗。为了对问题进行建模，我们首先开发混合整数非线性规划公式。不幸的是，最优解决方案的计算成本非常昂贵，为了克服这个困难，我们设计了一个启发式随机调度器（STS）。利用提前终止的概率分布，STS 确定如何在作业执行期间调整资源分配，以最大限度地降低预期能源成本，同时满足作业到期日期。广泛的实验评估结果表明，STS 保证了比文献中其他方法明显更好的结果，有效避免了到期日违规，平均总成本降低了 32% 至 80%。我们还证明了我们的方法在现实场景中的适用性，因为获得最多 100 个节点和 400 个并发作业的系统的最佳调度需要不到 5 秒。最后，我们评估了 GPU 共享（即在单个 GPU 中运行多个作业）的有效性。获得的结果表明，根据工作负载和 GPU 内存，这可以进一步平均降低 17-29% 的能源成本。

更新日期：2023-11-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>