Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning,SIAM Journal on Control and Optimization

当前位置： X-MOL 学术 › SIAM J. Control Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning
SIAM Journal on Control and Optimization ( IF 2.2 ) Pub Date : 2024-01-17 , DOI: 10.1137/22m1515744
Lukasz Szpruch ₁ , Tanut Treetanthiploet ₂ , Yufei Zhang ₃

Affiliation

SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024.
Abstract. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.

中文翻译：

连续时间线性二次强化学习的熵正则化器的优化调度

SIAM 控制与优化杂志，第 62 卷，第 1 期，第 135-166 页，2024 年 2 月。
摘要。这项工作使用熵正则化松弛随机控制视角作为设计强化学习（RL）算法的原则框架。在这里，代理通过生成根据最优宽松策略分布的噪声控制来与环境交互。一方面，噪声策略探索空间，从而促进学习，但另一方面，它们通过为非最优行为分配正概率而引入偏差。这种探索与利用的权衡是由熵正则化的强度决定的。我们研究由两种熵正则化公式产生的算法：探索性控制方法，其中熵被添加到成本目标中，以及近端策略更新方法，其中熵惩罚连续事件之间的策略分歧。我们关注有限范围连续时间线性二次 (LQ) RL 问题，其中具有未知漂移系数的线性动力学受二次成本控制。在这种情况下，两种算法都会产生高斯宽松策略。我们量化了高斯策略的价值函数与其噪声评估之间的精确差异，并表明执行噪声必须在时间上独立。通过调整宽松策略的采样频率和控制熵正则化强度的参数，我们证明对于两种学习算法，遗憾的数量级是[数学]片段的[数学]（最多为对数因子），与文献中最著名的结果相匹配。

更新日期：2024-01-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>