当前位置:
X-MOL 学术
›
arXiv.cs.MA
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Best Response Shaping
arXiv - CS - Multiagent Systems Pub Date : 2024-04-05 , DOI: arxiv-2404.06519 Milad Aghajohari, Tim Cooijmans, Juan Agustin Duque, Shunichi Akatsuka, Aaron Courville
arXiv - CS - Multiagent Systems Pub Date : 2024-04-05 , DOI: arxiv-2404.06519 Milad Aghajohari, Tim Cooijmans, Juan Agustin Duque, Shunichi Akatsuka, Aaron Courville
We investigate the challenge of multi-agent deep reinforcement learning in
partially competitive environments, where traditional methods struggle to
foster reciprocity-based cooperation. LOLA and POLA agents learn
reciprocity-based cooperative policies by differentiation through a few
look-ahead optimization steps of their opponent. However, there is a key
limitation in these techniques. Because they consider a few optimization steps,
a learning opponent that takes many steps to optimize its return may exploit
them. In response, we introduce a novel approach, Best Response Shaping (BRS),
which differentiates through an opponent approximating the best response,
termed the "detective." To condition the detective on the agent's policy for
complex games we propose a state-aware differentiable conditioning mechanism,
facilitated by a question answering (QA) method that extracts a representation
of the agent based on its behaviour on specific environment states. To
empirically validate our method, we showcase its enhanced performance against a
Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to
the best response in the Coin Game. This work expands the applicability of
multi-agent RL in partially competitive environments and provides a new pathway
towards achieving improved social welfare in general sum games.
中文翻译:
最佳响应塑造
我们研究了部分竞争环境中多智能体深度强化学习的挑战,在这种环境中,传统方法难以促进基于互惠的合作。 LOLA 和 POLA 智能体通过对手的一些前瞻优化步骤进行差异化学习基于互惠的合作策略。然而,这些技术有一个关键的限制。因为他们考虑了一些优化步骤,所以采取许多步骤来优化其回报的学习对手可能会利用它们。作为回应,我们引入了一种新颖的方法,即最佳响应塑造(BRS),它通过接近最佳响应的对手来区分,称为“侦探”。为了使侦探适应复杂游戏的智能体策略,我们提出了一种状态感知的可微调节机制,通过问答(QA)方法来促进,该方法根据智能体在特定环境状态下的行为提取智能体的表示。为了凭经验验证我们的方法,我们展示了其针对蒙特卡罗树搜索(MCTS)对手的增强性能,该对手是硬币游戏中最佳响应的近似值。这项工作扩展了多智能体强化学习在部分竞争环境中的适用性,并为在一般和博弈中实现改善社会福利提供了一条新途径。
更新日期:2024-04-05
中文翻译:
最佳响应塑造
我们研究了部分竞争环境中多智能体深度强化学习的挑战,在这种环境中,传统方法难以促进基于互惠的合作。 LOLA 和 POLA 智能体通过对手的一些前瞻优化步骤进行差异化学习基于互惠的合作策略。然而,这些技术有一个关键的限制。因为他们考虑了一些优化步骤,所以采取许多步骤来优化其回报的学习对手可能会利用它们。作为回应,我们引入了一种新颖的方法,即最佳响应塑造(BRS),它通过接近最佳响应的对手来区分,称为“侦探”。为了使侦探适应复杂游戏的智能体策略,我们提出了一种状态感知的可微调节机制,通过问答(QA)方法来促进,该方法根据智能体在特定环境状态下的行为提取智能体的表示。为了凭经验验证我们的方法,我们展示了其针对蒙特卡罗树搜索(MCTS)对手的增强性能,该对手是硬币游戏中最佳响应的近似值。这项工作扩展了多智能体强化学习在部分竞争环境中的适用性,并为在一般和博弈中实现改善社会福利提供了一条新途径。