当前位置: X-MOL 学术SIAM J. Control Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Convergence of Policy Gradient Methods for Finite-Horizon Exploratory Linear-Quadratic Control Problems
SIAM Journal on Control and Optimization ( IF 2.2 ) Pub Date : 2024-03-22 , DOI: 10.1137/22m1533517
Michael Giegrich 1 , Christoph Reisinger 1 , Yufei Zhang 2
Affiliation  

SIAM Journal on Control and Optimization, Volume 62, Issue 2, Page 1060-1092, April 2024.
Abstract. We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularizers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures–Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.


中文翻译:

有限范围探索性线性二次控制问题的策略梯度方法的收敛性

SIAM 控制与优化杂志,第 62 卷,第 2 期,第 1060-1092 页,2024 年 4 月。
摘要。我们研究了有限范围连续时间探索性线性二次控制(LQC)问题的策略梯度(PG)方法的全局线性收敛。该设置包括具有不确定成本的随机 LQC 问题,并允许在目标中使用额外的熵正则化器。我们考虑连续时间高斯策略,其均值在状态变量中是线性的,并且其协方差是与状态无关的。与离散时间问题相反,策略中的成本是非强制性的,并且并非所有下降方向都会导致有界迭代。我们分别使用 Fisher 几何和 Bures-Wasserstein 几何提出策略均值和协方差的几何感知梯度下降。策略迭代被证明满足先验界限,并以线性速率全局收敛到最优策略。我们进一步提出了一种具有离散时间策略的新颖 PG 方法。该算法利用连续时间分析,并在不同的动作频率上实现稳健的线性收敛。数值实验证实了该算法的收敛性和鲁棒性。
更新日期:2024-03-23
down
wechat
bug