Convergence of Policy Gradient Methods for Finite-Horizon Exploratory Linear-Quadratic Control Problems,SIAM Journal on Control and Optimization

当前位置： X-MOL 学术 › SIAM J. Control Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Convergence of Policy Gradient Methods for Finite-Horizon Exploratory Linear-Quadratic Control Problems
SIAM Journal on Control and Optimization ( IF 2.2 ) Pub Date : 2024-03-22 , DOI: 10.1137/22m1533517
Michael Giegrich ₁ , Christoph Reisinger ₁ , Yufei Zhang ₂

Affiliation

SIAM Journal on Control and Optimization, Volume 62, Issue 2, Page 1060-1092, April 2024.
Abstract. We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularizers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures–Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.

中文翻译：

有限范围探索性线性二次控制问题的策略梯度方法的收敛性

SIAM 控制与优化杂志，第 62 卷，第 2 期，第 1060-1092 页，2024 年 4 月。
摘要。我们研究了有限范围连续时间探索性线性二次控制（LQC）问题的策略梯度（PG）方法的全局线性收敛。该设置包括具有不确定成本的随机 LQC 问题，并允许在目标中使用额外的熵正则化器。我们考虑连续时间高斯策略，其均值在状态变量中是线性的，并且其协方差是与状态无关的。与离散时间问题相反，策略中的成本是非强制性的，并且并非所有下降方向都会导致有界迭代。我们分别使用 Fisher 几何和 Bures-Wasserstein 几何提出策略均值和协方差的几何感知梯度下降。策略迭代被证明满足先验界限，并以线性速率全局收敛到最优策略。我们进一步提出了一种具有离散时间策略的新颖 PG 方法。该算法利用连续时间分析，并在不同的动作频率上实现稳健的线性收敛。数值实验证实了该算法的收敛性和鲁棒性。

更新日期：2024-03-23

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>