当前位置: X-MOL 学术Optim. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study
Optimization Letters ( IF 1.6 ) Pub Date : 2023-09-23 , DOI: 10.1007/s11590-023-02062-0
A. Gosavi , L. H. Sneed , L. A. Spearing

Approximate policy iteration (API) is a class of reinforcement learning (RL) algorithms that seek to solve the long-run discounted reward Markov decision process (MDP), via the policy iteration paradigm, without learning the transition model in the underlying Bellman equation. Unfortunately, these algorithms suffer from a defect known as chattering in which the solution (policy) delivered in each iteration of the algorithm oscillates between improved and worsened policies, leading to sub-optimal behavior. Two causes for this that have been traced to the crucial policy improvement step are: (i) the inaccuracies in the policy improvement function and (ii) the exploration/exploitation tradeoff integral to this step, which generates variability in performance. Both of these defects are amplified by simulation noise. Deep RL belongs to a newer class of algorithms in which the resolution of the learning process is refined via mechanisms such as experience replay and/or deep neural networks for improved performance. In this paper, a new deep learning approach is developed for API which employs a more accurate policy improvement function, via an enhanced resolution Bellman equation, thereby reducing chattering and eliminating the need for exploration in the policy improvement step. Versions of the new algorithm for both the long-run discounted MDP and semi-MDP are presented. Convergence properties of the new algorithm are studied mathematically, and a post-earthquake disaster response case study is employed to demonstrate numerically the algorithm’s efficacy.



中文翻译:

用于近似策略迭代的深度强化学习:收敛分析和震后灾难响应案例研究

近似策略迭代 (API) 是一类强化学习 (RL) 算法,旨在通过策略迭代范式解决长期贴现奖励马尔可夫决策过程 (MDP),而无需学习底层贝尔曼方程中的转换模型。不幸的是,这些算法存在一种称为颤振的缺陷其中算法每次迭代中提供的解决方案(策略)在改进和恶化的策略之间振荡,导致次优行为。造成这种情况的两个原因可以追溯到关键的政策改进步骤:(i) 政策改进函数的不准确;(ii) 该步骤的探索/利用权衡,这会产生性能的可变性。这两个缺陷都会被模拟噪声放大。深度强化学习属于一类较新的算法,其中学习过程的分辨率通过经验回放和/或深度神经网络等机制进行细化以提高性能。本文为 API 开发了一种新的深度学习方法,通过增强分辨率贝尔曼方程采用更准确的策略改进函数,从而减少抖动并消除策略改进步骤中探索的需要。提出了长期贴现 MDP 和半 MDP 的新算法版本。对新算法的收敛性进行了数学研究,并通过震后灾难响应案例研究从数值上证明了算法的有效性。

更新日期:2023-09-24
down
wechat
bug