Dynamically Interrupting Deadlocks in Game Learning Using Multisampling Multiarmed Bandits,IEEE Transactions on Games

当前位置： X-MOL 学术 › IEEE Trans. Games › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dynamically Interrupting Deadlocks in Game Learning Using Multisampling Multiarmed Bandits
IEEE Transactions on Games ( IF 2.3 ) Pub Date : 2022-05-24 , DOI: 10.1109/tg.2022.3177598
Rendong Chen ₁ , Fa Wu ₂

Affiliation

In many reinforcement learning (RL) game tasks, an episode should be interrupted after a certain time, as the agent could sometimes fall into a deadlock state. The learning process is sensitive to the interruption length. So, it is hard to determine an optimal value of interruption. This article presents a novel multiarmed bandit (MAB) model to dynamically interrupt the deadlock state in reinforcement game learning, with the assumption that there are neither prior knowledge of optimal interruption setting, nor prior knowledge that can be used to improve the performance of the RL agent. The proposed MAB model is a nonoblivious adversarial MAB problem with a multisampling process in each round. An efficient algorithm named Exp3.P.MS is proposed for the new bandit setting, achieving an expected regret bound of

$\mathcal {O}(\sqrt{nK\ln {(K)}})$

. We run the algorithm on Sokoban . The experimental results show that the dynamic interruptions can adapt to the weak-to-strong performance of the RL agent and spur a fast learning in game training.

中文翻译：

使用多重采样多臂强盗动态中断游戏学习中的死锁

在许多强化学习（RL）游戏任务中，一个情节应该在一段时间后中断，因为智能体有时可能会陷入死锁状态。学习过程对中断长度很敏感。因此，很难确定最佳中断值。本文提出了一种新颖的多臂老虎机（MAB）模型，用于动态中断强化博弈学习中的死锁状态，假设既没有最优中断设置的先验知识，也没有可用于提高强化学习性能的先验知识代理人。所提出的 MAB 模型是一个非遗忘对抗性 MAB 问题，每轮都有多重采样过程。为新的老虎机设置提出了一种名为 Exp3.P.MS 的有效算法，实现了预期的后悔界限

$\mathcal {O}(\sqrt{nK\ln {(K)}})$

。我们运行算法推箱子。实验结果表明，动态中断可以适应 RL 代理从弱到强的表现，并促进游戏训练中的快速学习。

更新日期：2022-05-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文