当前位置: X-MOL 学术SIAM J. Control Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Stationary Nash Equilibrium Policies in [math]-Player Stochastic Games with Independent Chains
SIAM Journal on Control and Optimization ( IF 2.2 ) Pub Date : 2024-03-01 , DOI: 10.1137/22m1512880
S. Rasoul Etesami 1
Affiliation  

SIAM Journal on Control and Optimization, Volume 62, Issue 2, Page 799-825, April 2024.
Abstract. We consider a subclass of [math]-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players’ internal chains are driven by independent transition probabilities. Moreover, players can receive only realizations of their payoffs, not the actual functions, and cannot observe each others’ states/actions. For this class of games, we first show that finding a stationary Nash equilibrium (NE) policy without any assumption on the reward functions is intractable. However, for general reward functions, we develop polynomial-time learning algorithms based on dual averaging and dual mirror descent, which converge in terms of the averaged Nikaido–Isoda distance to the set of [math]-NE policies almost surely or in expectation. In particular, under extra assumptions on the reward functions such as social concavity, we derive polynomial upper bounds on the number of iterates to achieve an [math]-NE policy with high probability. Finally, we evaluate the effectiveness of the proposed algorithms in learning [math]-NE policies using numerical experiments for energy management in smart grids.


中文翻译:

学习[数学]-具有独立链的玩家随机博弈中的平稳纳什均衡策略

SIAM 控制与优化杂志,第 62 卷,第 2 期,第 799-825 页,2024 年 4 月。
摘要。我们考虑[数学]玩家随机游戏的一个子类,其中玩家拥有自己的内部状态/动作空间,同时通过其收益函数耦合。假设参与者的内部链是由独立的转移概率驱动的。此外,玩家只能收到其收益的实现,而不能收到实际功能,并且无法观察彼此的状态/行为。对于此类博弈,我们首先表明,在不对奖励函数进行任何假设的情况下找到固定纳什均衡(NE)策略是很困难的。然而,对于一般奖励函数,我们开发了基于双平均和双镜像下降的多项式时间学习算法,该算法几乎肯定或在预期中以平均 Nikaido-Isoda 距离收敛到一组[数学]-NE 策略。特别是,在对奖励函数(例如社会凹性)的额外假设下,我们推导出迭代次数的多项式上限,以高概率实现[数学]-NE策略。最后,我们使用智能电网能源管理的数值实验来评估所提出的算法在学习[数学]-NE策略方面的有效性。
更新日期:2024-03-01
down
wechat
bug