当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2023-11-20 , DOI: 10.1109/tpds.2023.3334519
Tiangang Li 1 , Shi Ying 1 , Yishi Zhao 2 , Jianga Shang 3
Affiliation  

In cloud computing, how to reasonably allocate computing resources for batch jobs to ensure the load balance of dynamic clusters and meet user requests is an important and challenging task. Most existing studies are based on deep Q network, which utilizes neural networks to estimate the expected value of cumulative return in the scheduling process. The value-based DQN algorithms ignore the complete information contained in the value distribution and lack strong adaptability to time-varying batch jobs and dynamic cluster resources. Therefore, to capture the inherent stochasticity of the scheduling process caused by environmental stochasticity, we utilize Distributional Reinforcement Learning to model the value distribution of the cumulative return. Specifically, we formalize the load balancing scheduling as a multi-objective optimization problem and construct a Distributional Reinforcement Learning model. Then we introduce quantile regression to learn the value distribution of the cumulative return during scheduling and propose a dynamic load balancing scheduling algorithm based on Distributional Reinforcement Learning. In addition, we develop a cluster environment for real-time processing of batch jobs to simulate the arrival of batch jobs and train the Distributional Reinforcement Learning-based scheduling agent. We conduct empirical experiments and detailed analysis by using the real Alibaba Cluster cluster traces v2018 and v2020. The results show that compared to the baseline algorithms, the proposed algorithm performs better in terms of cluster load balancing, success rate of instance creation and average completion time of the tasks. The experimental results on different trace datasets also indicate that the propsoed algorithm exhibits excellent scalability.

中文翻译:

使用分布式强化学习的云计算中的批处理作业负载平衡调度

在云计算中,如何为批处理作业合理分配计算资源,保证动态集群的负载均衡,满足用户请求是一项重要且具有挑战性的任务。现有的研究大多基于深度Q网络,利用神经网络来估计调度过程中累积收益的期望值。基于值的DQN算法忽略了值分布中包含的完整信息,并且对时变批处理作业和动态集群资源缺乏强适应性。因此,为了捕捉环境随机性引起的调度过程的固有随机性,我们利用分布式强化学习对累积收益的值分布进行建模。具体来说,我们将负载平衡调度形式化为多目标优化问题,并构建分布式强化学习模型。然后引入分位数回归来学习调度过程中累积收益的值分布,并提出一种基于分布式强化学习的动态负载平衡调度算法。此外,我们开发了用于实时处理批处理作业的集群环境,以模拟批处理作业的到来并训练基于分布式强化学习的调度代理。我们使用真实的阿里巴巴集群集群轨迹v2018和v2020进行了实证实验和详细分析。结果表明,与基线算法相比,所提算法在集群负载均衡、实例创建成功率和任务平均完成时间方面表现更好。在不同轨迹数据集上的实验结果也表明所提出的算法具有良好的可扩展性。
更新日期:2023-11-20
down
wechat
bug