Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study

Gosavi, A.; Sneed, L. H.; Spearing, L. A.

doi:10.1007/s11590-023-02062-0

Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study

Original Paper
Published: 23 September 2023

(2023)
Cite this article

Optimization Letters Aims and scope Submit manuscript

144 Accesses
Explore all metrics

Abstract

Approximate policy iteration (API) is a class of reinforcement learning (RL) algorithms that seek to solve the long-run discounted reward Markov decision process (MDP), via the policy iteration paradigm, without learning the transition model in the underlying Bellman equation. Unfortunately, these algorithms suffer from a defect known as chattering in which the solution (policy) delivered in each iteration of the algorithm oscillates between improved and worsened policies, leading to sub-optimal behavior. Two causes for this that have been traced to the crucial policy improvement step are: (i) the inaccuracies in the policy improvement function and (ii) the exploration/exploitation tradeoff integral to this step, which generates variability in performance. Both of these defects are amplified by simulation noise. Deep RL belongs to a newer class of algorithms in which the resolution of the learning process is refined via mechanisms such as experience replay and/or deep neural networks for improved performance. In this paper, a new deep learning approach is developed for API which employs a more accurate policy improvement function, via an enhanced resolution Bellman equation, thereby reducing chattering and eliminating the need for exploration in the policy improvement step. Versions of the new algorithm for both the long-run discounted MDP and semi-MDP are presented. Convergence properties of the new algorithm are studied mathematically, and a post-earthquake disaster response case study is employed to demonstrate numerically the algorithm’s efficacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Introduction to Reinforcement Learning

Computational Performance of Deep Reinforcement Learning to Find Nash Equilibria

Article Open access 03 January 2023

Robust Reinforcement Learning with a Stochastic Value Function

References

Bertsekas, D.P.: Dynamic Programming and Optimal Control, 4th edn. Athena, Belmont (2012)
MATH Google Scholar
Bertsekas, D.P.: Feature-based aggregation and deep reinforcement learning: a survey and some new implementations. IEEE/CAA J Autom Sin 6(1), 1–31 (2018)
MathSciNet Google Scholar
Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena, Belmont (2019)
Google Scholar
Bertsekas, D.P.: Rollout, Policy Iteration, and Distributed Reinforcement Learning. Athena, Belmont (2021)
Google Scholar
Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. J Heuristics 5(1), 89–108 (1999)
Article MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena, Belmont (1996)
MATH Google Scholar
Bradtke, S.J., Duff. M.: Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems 7, MIT Press, Cambridge (1995)
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Boca Raton (2010)
MATH Google Scholar
Cao, X.R.: Stochastic Learning and Optimization: A Sensitivity-Based View. Springer, Berlin (2007)
Book Google Scholar
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: Simulation-Based Algorithms for Markov Decision Processes, 2nd edn. Springer, NY (2013)
Book MATH Google Scholar
Chang, H.S., Lee, H.-G., Fu, M.C., Marcus, S.: Evolutionary policy iteration for solving Markov decision processes. IEEE Trans. Autom. Control 50(11), 1804–1808 (2005)
Article MathSciNet MATH Google Scholar
de la Torre, L.E., Dolinskaya, I.S., Smilowitz, K.R.: Disaster relief routing: integrating research and practice. Socioecon. Plann. Sci. 46(1), 88–97 (2012)
Article Google Scholar
FEMA. 2022-2016 Strategic plan. https://www.fema.gov/about/strategic-plan, (2023)
Fern, A., Yoon, S., Givan, R.: Approximate policy iteration with a policy language bias: solving relational Markov decision processes. J Artif Intell Res 25, 75–118 (2006)
Article MathSciNet MATH Google Scholar
Fraioli, G., Gosavi, A., Sneed, L.H.: Strategic implications for civil infrastructure and logistical support systems in postearthquake disaster management: the case of St. Louis. IEEE Eng Manag Rev 49(1), 165–173 (2020)
Article Google Scholar
Ghosh, S., Gosavi, A.: A semi-Markov model for post-earthquake emergency response in a smart city. Control Theory Technol 15(1), 13–25 (2017)
Article MathSciNet Google Scholar
Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55, 5–29 (2004)
Article MATH Google Scholar
Gosavi, A.: Boundedness of iterates in ${Q}$-learning. Syst. Control Lett. 55, 347–349 (2006)
Article MathSciNet MATH Google Scholar
Gosavi, A.: On step-sizes, stochastic paths, and survival probabilities in reinforcement learning. In Proceedings of the winter simulation conference. IEEE, (2008)
Gosavi, A.: Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, 2nd edn. Springer, NY (2015)
Book MATH Google Scholar
Gosavi, A., Fraioli, G., Sneed, L.H., Tasker, N.: Discrete-event-based simulation model for performance evaluation of post-earthquake restoration in a smart city. IEEE Trans. Eng. Manag. 67(3), 582–592 (2019)
Article Google Scholar
Hoffman, M., de Freitas, N.: Inference Strategies for Solving Semi-Markov Decision Processes. In Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions, pp. 82–96. IGI Global, Pennsylvania (2012)
Howard, R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)
MATH Google Scholar
Matsui, T., Goto, T., Izumi, K., Chen, Y.: Compound reinforcement learning: theory and an application to finance. In European workshop on reinforcement learning, pp. 321–332. Springer (2011)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.C., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Puterman, M.L.: Markov Decision Processes. Wiley, NY (1994)
Book MATH Google Scholar
Puterman, M.L., Shin, M.C.: Modified policy iteration for discounted Markov decision problems. Manage. Sci. 24, 1127–1137 (1978)
Article MathSciNet MATH Google Scholar
Scherrer, B.: Performance bounds for $\lambda $ policy iteration and application to the game of tetris. J. Mach. Learn. Res. 14(4), (2013)
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Article MathSciNet MATH Google Scholar
Sutton, R., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)
MATH Google Scholar
Tadepalli, P., Ok, D.: Model-based average reward reinforcement learning algorithms. Artif. Intell. 100, 177–224 (1998)
Article MATH Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double $Q$-learning. In Proceedings of AAAI conference on Art. Intel., vol. 30, No. (1) (2016)
van Nunen, J.A.E.E.: A set of successive approximation methods for discounted Markovian decision problems. Z. Operat. Res. 20, 203–208 (1976)
MathSciNet MATH Google Scholar
van Seijen, H., Whiteson, S., van Hasselt, H., Wiering, M.: Exploiting best-match equations for efficient reinforcement learning. J. Mach. Learn. Res. 12, 2045–2094 (2011)
MathSciNet MATH Google Scholar
Yoshida, W., Ishii, S.: Model-based reinforcement learning: a computational model and an fMRI study. Neurocomputing 63, 253–269 (2005)
Article Google Scholar

Download references

Acknowledgements

The paper has benefitted significantly from responding to suggestions from the Associate Editor and the reviewers. The authors thank the first reviewer for revisions leading to performance comparisons with CAP, the second reviewer for the Robbins-Monro analysis, and the third reviewer for the added numerical details in the first computational test.

Author information

Authors and Affiliations

Missouri University of Science and Technology, 210 EMAN Bldg, Rolla, MO, 65409, USA
A. Gosavi
University of Illinois, Chicago, EIB 218, 929 West Taylor Street, Chicago, IL, 60607, USA
L. H. Sneed
University of Illinois, Chicago, ERF 3069, 842 West Taylor Street, Chicago, IL, 60607, USA
L. A. Spearing

Authors

A. Gosavi
View author publications
You can also search for this author in PubMed Google Scholar
L. H. Sneed
View author publications
You can also search for this author in PubMed Google Scholar
L. A. Spearing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Gosavi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof

(Lemma 1) Note that since $\gamma >0$ and $t(.,.,.)>0$, there exists a $\lambda \in (0,1)$ such that:

$$\begin{aligned} \max _{i,j\in \mathcal {S}; a\in \mathcal {A}(i)}\exp (-\gamma t(i,a,j))\le \lambda . \end{aligned}$$

(11)

From G(.)’s definition in Eq. (6), for any two vectors, $\textbf{J}^k$ and $\mathbf {\overline{J}}^k$:

$$\begin{aligned} G(J^k)(i)-G(\overline{J}^k)(i)=\sum _{j \in S}p(i,\mu (i),j) \exp (-\bar{\gamma t(i,\mu (i),j)}\left[ J^k(j)-\overline{J}^k(j)\right] ~~~\forall i. \end{aligned}$$

$$\begin{aligned} \text{ Then, }\ \forall i:\left| G(J^k)(i)-G(\overline{J}^k)(i)\right|&\le \sum _{j \in S}p(i,\mu (i),j) \exp (-\bar{\gamma t(i,\mu (i),j)}\max _{j\in \mathcal {S}}\left| J^k(j)-\overline{J}^k(j)\right| \\&\le \sum _{j \in \mathcal {S}}p(i,\mu (i),j)\lambda \max _{j \in \mathcal {S}}\left| J^k(j) -\overline{J}^k(j)\right| \text{ from } Eq. (11)\\&= \sum _{j \in \mathcal {S}}p(i,\mu (i),j)\lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty }\\&=\lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty } \sum _{j \in S}p(i,\mu (i),j) =\lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty }. \end{aligned}$$

Then, $\max _{i \in \mathcal {S}}\left| G(J^k)(i)-G(\overline{J}^k)(i)\right| =||G(\textbf{J}^k)-G(\mathbf {\overline{J}}^k)||_{\infty } \le \lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty }$. $\square $

Proof

(Lemma 2) It is claimed that for every $i\in \mathcal {S}$:

$$\begin{aligned} |J^k(i)|\le M(1+\lambda +\lambda ^2+\cdots +\lambda ^k), \end{aligned}$$

(12)

$$\begin{aligned} \text{ where } M=\max \left\{ \max _{i,j\in \mathcal{S}, a\in \mathcal{A}(i)}|r(i,a,j)|,\max _{i\in \mathcal{S}} |J^1(i)|\right\} \end{aligned}$$

and $\lambda $ is defined via Eq. (11). Then, the result follows from the fact that:

$$\begin{aligned} \limsup _{k\rightarrow \infty }|J^k(i)|\le M \frac{1}{1-\lambda } \text{ for } \text{ all } i\in \mathcal {S}. \end{aligned}$$

The claim in (12) is proved via induction. In asynchronous updating, two cases can occur:

Case 1 The value for a state visited in the kth iteration is updated: $J^{k+1}(i)=(1-\eta ^k)J^k(i)+\eta ^k\left[ r(i,a,j)+\exp (-\gamma t(i,a,j))J^k(j)\right] .$

Case 2 The value for a state not visited in the kth iteration is not updated: $J^{k+1}(i)=J^k(i)$.

When the update is carried out as in Case 1:

$$ \begin{aligned} |J^{2} (i)|\, & \le \,(1 - \eta ^{1} )|J^{1} (i)| + \eta ^{1} |r(i,a,j) + \exp ( - \gamma t(i,a,j)J^{1} (j)| \\ & \le (1 - \eta ^{1} )|J^{1} (i)| + \eta ^{1} |r(i,a,j) + \lambda J^{1} (j)|{\text{ from }}(11) \\ & \le (1 - \eta ^{1} )M + \eta ^{1} M + \eta ^{1} \lambda M{\text{ from }}M^{\prime}s{\text{ definition}} \\ & < (1 - \eta ^{1} )M + \eta ^{1} M + \lambda M = M(1 + \lambda ){\text{ since }}\eta ^{1} < 1 \\ \end{aligned} $$

When the update is carried out as in Case 2: $ |J^2(i)| = |J^1(i)|\le M \le M(1+\lambda ).$ The claim thus holds for $k=1$. When the claim holds for $k=m$, one has that:

$$\begin{aligned}|J^{m}(i)|\le M(1+\lambda +\lambda ^2+\cdots +\lambda ^{m})~~~~\forall i. \end{aligned}$$

Under Case 1: $\forall i$:

$$ \begin{aligned} |J^{{m + 1}} (i)|\, & \le (1 - \eta ^{m} )|J^{m} (i)| + \eta ^{m} |r(i,a,j) + \exp ( - \gamma t(i,a,j))J^{m} (j)| \\ & \le (1 - \eta ^{m} )|J^{m} (i)| + \eta ^{m} |r(i,a,j) + \lambda J^{m} (j)| \\ & \le (1 - \eta ^{m} )M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) + \eta ^{m} M + \eta ^{m} \lambda M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) \\ & = M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) + \eta ^{m} M\lambda ^{{m + 1}} \\ & \le M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) + M\lambda ^{{m + 1}} = M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} + \lambda ^{{m + 1}} ). \\ \end{aligned} $$

Under Case 2: $\forall i$:

$$\begin{aligned} |J^{m+1}(i)|= & {} |J^m(i)| \le M(1+\lambda +\lambda ^2+\cdots +\lambda ^m) \le M(1+\lambda +\lambda ^2+\cdots +\lambda ^m+\lambda ^{m+1}). \end{aligned}$$

$\square $

Data for discounted SMDP with known transition model $\textbf{P}_a$, $\textbf{R}_a$, and $\textbf{T}_a$, the transition probability, reward, and time matrices for action a, respectively, used are:

$$ {\mathbf{P}}_{1} = \left[ {\begin{array}{*{20}l} {0.6,0.2,0.1,0.1} \hfill \\ {0.2,0.3,0.1,0.4} \hfill \\ {0.1,0.6,0.2,0.1} \hfill \\ {0.2,0.4,0.2,0.2} \hfill \\ \end{array} } \right];~{\mathbf{P}}_{2} = \left[ {\begin{array}{*{20}l} {0.5,0.1,0.2,0.2} \hfill \\ {0.2,0.4,0,0.4} \hfill \\ {0.2,0.5,0.1,0.2} \hfill \\ {0.6,0.2,0.1,0.1} \hfill \\ \end{array} } \right]; {\text{ }}{\mathbf{R}}_{1} = \left[ {\begin{array}{*{20}l} {6, - 5,0,12} \hfill \\ {7,120,3,1} \hfill \\ {5,4,12,\,3} \hfill \\ {7,48,10,10} \hfill \\ \end{array} } \right];~{\mathbf{R}}_{2} = \left[ {\begin{array}{*{20}l} {100,17,0,10} \hfill \\ { - 14,13,0,1} \hfill \\ {9,40,7,12} \hfill \\ {12,12,10,14} \hfill \\ \end{array} } \right]; {\text{ }}{\mathbf{T}}_{1} = \left[ {\begin{array}{*{20}l} {1,5,2,5} \hfill \\ {120,60,30,30} \hfill \\ {2,7,12,14} \hfill \\ {135,55,45,30} \hfill \\ \end{array} } \right];~{\mathbf{T}}_{2} = \left[ {\begin{array}{*{20}l} {50,75,20,12} \hfill \\ {7,2,7,2} \hfill \\ {35,65,30,20} \hfill \\ {50,30,50,70} \hfill \\ \end{array} } \right] $$

Other constants in MAP were set to the following values: $\gamma =0.1$, $m_{\max }=k_{\max }=n_{\max }=100$, and $K=5$.

Data for post-earthquake disaster response SMDP $\textbf{PP}=[0.375,0.2,0.35,0.075]$, where PP(s) denotes the probability of going from the stable state to a primary state s. The following values were used for SP matrix:

$$ {\mathbf{SP}} = \left[ {\begin{array}{*{20}l} {0.1,0.3,0,0.5,0,0,0.1} \hfill \\ {0,0.1,0,0.35,0,0.35,0.2} \hfill \\ {0,0,0.1,0,0.5,0.2,0.2} \hfill \\ {0,0,\,0,0.1,0,0,0.9} \hfill \\ \end{array} } \right], $$

where $SP(s_1,s_2)$ denotes the probability of transition from a state, $s_1$, in the set of primary states to a state,$s_2$, in the union of the sets of the primary and secondary states. The following values were used for the hazard scores: $HS(1)=log(2)$, $HS(2)=log(4)$, $HS(3)=log(8)$, $HS(4)=log(6)$, $HS(5)=log(10)$, $HS(6)=log(12)$, and $HS(7)=log(14)$. Natural logarithms were used here for costs, as RL algorithms can suffer from computer overflow with large absolute values for costs, especially in discounted problems [24]. The response time for a state s is defined following [21]: $RT_c(s)=\sum _{d\in \mathcal {I}(s)}RT(d)\phi $, where $\mathcal {I}(s)$ denotes the incidents present in state s, RT(d) denotes the response time for an incident, d, and $\phi $ is a correction factor that equals 1 for a state that contains one incident, 1.2 for a state that contains two incidents, and 1.3 for a state that contains three incidents. The following values were used (all in hours): $RT(G)=7+\frac{5}{X}$, $RT(F)=21+\frac{15}{X}$, and $RT(BC)=35 +\frac{25}{X}$, where $X=1$ for the local agency and $X=2$ for the federal agency. Finally, for the local and federal agencies, the travel time were TRIA(0.5, 1, 1.5) and TRIA(4, 5, 6), respectively, in hrs, where TRIA denotes the triangular distribution. Other constants in MAP were set as follows: $\gamma =0.1$, $m_{\max }=k_{\max }=n_{\max }=10,000$, and $K=10$.

Steps in conservative approximate policy iteration (CAP) CAP uses Q-values instead of the future state-action values and bypasses model building. It has the same structure as MAP (see Algorithm 1 in main text) with the following exceptions: Step A (model-building) is skipped; Step B2 is replaced by evaluation of the Q-values, i.e., Eqn. (3) is replaced by: Update the Q-value for (i, a) as follows:

$$\begin{aligned} Q^{n+1}(i,a)\leftarrow (1-\beta ^n)Q^n(i,a)+\beta ^n\left[ r(i,a,j)+\exp \left( -\gamma t(i,a,j)\right) J^{\infty }(j)\right] ; \end{aligned}$$

Step C, policy is improved, i.e., for each $i\in \mathcal{S}$, select $\mu '(i)\in {argma }x_{a\in \mathcal{A}(i)}Q^{\infty }(i,a),$ where $Q^{\infty }(.,.)$ denotes the final Q-value obtained in Step B2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gosavi, A., Sneed, L.H. & Spearing, L.A. Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study. Optim Lett (2023). https://doi.org/10.1007/s11590-023-02062-0

Download citation

Received: 30 May 2022
Accepted: 21 July 2023
Published: 23 September 2023
DOI: https://doi.org/10.1007/s11590-023-02062-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study

Abstract

Access this article

Similar content being viewed by others

Introduction to Reinforcement Learning

Computational Performance of Deep Reinforcement Learning to Find Nash Equilibria

Robust Reinforcement Learning with a Stochastic Value Function

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study

Abstract

Access this article

Similar content being viewed by others

Introduction to Reinforcement Learning

Computational Performance of Deep Reinforcement Learning to Find Nash Equilibria

Robust Reinforcement Learning with a Stochastic Value Function

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation