当前位置: X-MOL 学术Auton. Agent. Multi-Agent Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Modeling and reinforcement learning in partially observable many-agent systems
Autonomous Agents and Multi-Agent Systems ( IF 1.9 ) Pub Date : 2024-03-26 , DOI: 10.1007/s10458-024-09640-1
Keyang He , Prashant Doshi , Bikramjit Banerjee

Abstract

There is a prevalence of multiagent reinforcement learning (MARL) methods that engage in centralized training. These methods rely on all the agents sharing various types of information, such as their actions or gradients, with a centralized trainer or each other during the learning. Subsequently, the methods produce agent policies whose prescriptions and performance are contingent on other agents engaging in behavior assumed by the centralized training. But, in many contexts, such as mixed or adversarial settings, this assumption may not be feasible. In this article, we present a new line of methods that relaxes this assumption and engages in decentralized training resulting in the agent’s individual policy. The interactive advantage actor-critic (IA2C) maintains and updates beliefs over other agents’ candidate behaviors based on (noisy) observations, thus enabling learning at the agent’s own level. We also address MARL’s prohibitive curse of dimensionality due to the presence of many agents in the system. Under assumptions of action anonymity and population homogeneity, often exhibited in practice, large numbers of other agents can be modeled aggregately by the count vectors of their actions instead of individual agent models. More importantly, we may model the distribution of these vectors and its update using the Dirichlet-multinomial model, which offers an elegant way to scale IA2C to many-agent systems. We evaluate the performance of the fully decentralized IA2C along with other known baselines on a novel Organization domain, which we introduce, and on instances of two existing domains. Experimental comparisons with prominent and recent baselines show that IA2C is more sample efficient, more robust to noise, and can scale to learning in systems with up to a hundred agents.



中文翻译:

部分可观察多智能体系统中的建模和强化学习

摘要

进行集中训练的多智能体强化学习(MARL)方法很流行。这些方法依赖于所有代理在学习过程中与集中训练器或彼此共享各种类型的信息,例如它们的动作或梯度。随后,这些方法产生代理策略,其规定和性能取决于其他代理参与集中训练所假设的行为。但是,在许多情况下,例如混合或对抗性环境,这种假设可能不可行。在本文中,我们提出了一系列新的方法,放宽了这一假设,并进行去中心化训练,从而产生代理的个人策略。交互优势行动者批评家 (IA2C) 根据(嘈杂的)观察维持和更新对其他智能体候选行为的信念,从而实现智能体自身级别的学习。我们还解决了由于系统中存在许多代理而导致的 MARL 维数灾难。在实践中经常表现出的行为匿名性和群体同质性的假设下,可以通过其行为的计数向量而不是单个代理模型来对大量其他代理进行聚合建模。更重要的是,我们可以使用 Dirichlet 多项式模型对这些向量的分布及其更新进行建模,该模型提供了一种将 IA2C 扩展到多智能体系统的优雅方法。我们评估完全去中心化的 IA2C 以及我们引入的新组织域和两个现有域的实例上的其他已知基线的性能。与著名和最新基线的实验比较表明,IA2C 的样本效率更高,对噪声更稳健,并且可以扩展到具有多达 100 个代理的系统中的学习。

更新日期:2024-03-27
down
wechat
bug