Best Response Shaping,arXiv - CS - Multiagent Systems

当前位置： X-MOL 学术 › arXiv.cs.MA › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Best Response Shaping
arXiv - CS - Multiagent Systems Pub Date : 2024-04-05 , DOI: arxiv-2404.06519
Milad Aghajohari, Tim Cooijmans, Juan Agustin Duque, Shunichi Akatsuka, Aaron Courville

We investigate the challenge of multi-agent deep reinforcement learning in partially competitive environments, where traditional methods struggle to foster reciprocity-based cooperation. LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. However, there is a key limitation in these techniques. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approximating the best response, termed the "detective." To condition the detective on the agent's policy for complex games we propose a state-aware differentiable conditioning mechanism, facilitated by a question answering (QA) method that extracts a representation of the agent based on its behaviour on specific environment states. To empirically validate our method, we showcase its enhanced performance against a Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to the best response in the Coin Game. This work expands the applicability of multi-agent RL in partially competitive environments and provides a new pathway towards achieving improved social welfare in general sum games.

中文翻译：

最佳响应塑造

我们研究了部分竞争环境中多智能体深度强化学习的挑战，在这种环境中，传统方法难以促进基于互惠的合作。 LOLA 和 POLA 智能体通过对手的一些前瞻优化步骤进行差异化学习基于互惠的合作策略。然而，这些技术有一个关键的限制。因为他们考虑了一些优化步骤，所以采取许多步骤来优化其回报的学习对手可能会利用它们。作为回应，我们引入了一种新颖的方法，即最佳响应塑造（BRS），它通过接近最佳响应的对手来区分，称为“侦探”。为了使侦探适应复杂游戏的智能体策略，我们提出了一种状态感知的可微调节机制，通过问答（QA）方法来促进，该方法根据智能体在特定环境状态下的行为提取智能体的表示。为了凭经验验证我们的方法，我们展示了其针对蒙特卡罗树搜索（MCTS）对手的增强性能，该对手是硬币游戏中最佳响应的近似值。这项工作扩展了多智能体强化学习在部分竞争环境中的适用性，并为在一般和博弈中实现改善社会福利提供了一条新途径。

更新日期：2024-04-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>