Skip to main content
Log in

Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study

  • Original Paper
  • Published:
Optimization Letters Aims and scope Submit manuscript

Abstract

Approximate policy iteration (API) is a class of reinforcement learning (RL) algorithms that seek to solve the long-run discounted reward Markov decision process (MDP), via the policy iteration paradigm, without learning the transition model in the underlying Bellman equation. Unfortunately, these algorithms suffer from a defect known as chattering in which the solution (policy) delivered in each iteration of the algorithm oscillates between improved and worsened policies, leading to sub-optimal behavior. Two causes for this that have been traced to the crucial policy improvement step are: (i) the inaccuracies in the policy improvement function and (ii) the exploration/exploitation tradeoff integral to this step, which generates variability in performance. Both of these defects are amplified by simulation noise. Deep RL belongs to a newer class of algorithms in which the resolution of the learning process is refined via mechanisms such as experience replay and/or deep neural networks for improved performance. In this paper, a new deep learning approach is developed for API which employs a more accurate policy improvement function, via an enhanced resolution Bellman equation, thereby reducing chattering and eliminating the need for exploration in the policy improvement step. Versions of the new algorithm for both the long-run discounted MDP and semi-MDP are presented. Convergence properties of the new algorithm are studied mathematically, and a post-earthquake disaster response case study is employed to demonstrate numerically the algorithm’s efficacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Bertsekas, D.P.: Dynamic Programming and Optimal Control, 4th edn. Athena, Belmont (2012)

    MATH  Google Scholar 

  2. Bertsekas, D.P.: Feature-based aggregation and deep reinforcement learning: a survey and some new implementations. IEEE/CAA J Autom Sin 6(1), 1–31 (2018)

    MathSciNet  Google Scholar 

  3. Bertsekas, D.P.: Reinforcement Learning and Optimal Control. Athena, Belmont (2019)

    Google Scholar 

  4. Bertsekas, D.P.: Rollout, Policy Iteration, and Distributed Reinforcement Learning. Athena, Belmont (2021)

    Google Scholar 

  5. Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. J Heuristics 5(1), 89–108 (1999)

    Article  MATH  Google Scholar 

  6. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena, Belmont (1996)

    MATH  Google Scholar 

  7. Bradtke, S.J., Duff. M.: Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems 7, MIT Press, Cambridge (1995)

  8. Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, Boca Raton (2010)

    MATH  Google Scholar 

  9. Cao, X.R.: Stochastic Learning and Optimization: A Sensitivity-Based View. Springer, Berlin (2007)

    Book  Google Scholar 

  10. Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: Simulation-Based Algorithms for Markov Decision Processes, 2nd edn. Springer, NY (2013)

    Book  MATH  Google Scholar 

  11. Chang, H.S., Lee, H.-G., Fu, M.C., Marcus, S.: Evolutionary policy iteration for solving Markov decision processes. IEEE Trans. Autom. Control 50(11), 1804–1808 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. de la Torre, L.E., Dolinskaya, I.S., Smilowitz, K.R.: Disaster relief routing: integrating research and practice. Socioecon. Plann. Sci. 46(1), 88–97 (2012)

    Article  Google Scholar 

  13. FEMA. 2022-2016 Strategic plan. https://www.fema.gov/about/strategic-plan, (2023)

  14. Fern, A., Yoon, S., Givan, R.: Approximate policy iteration with a policy language bias: solving relational Markov decision processes. J Artif Intell Res 25, 75–118 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  15. Fraioli, G., Gosavi, A., Sneed, L.H.: Strategic implications for civil infrastructure and logistical support systems in postearthquake disaster management: the case of St. Louis. IEEE Eng Manag Rev 49(1), 165–173 (2020)

    Article  Google Scholar 

  16. Ghosh, S., Gosavi, A.: A semi-Markov model for post-earthquake emergency response in a smart city. Control Theory Technol 15(1), 13–25 (2017)

    Article  MathSciNet  Google Scholar 

  17. Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55, 5–29 (2004)

    Article  MATH  Google Scholar 

  18. Gosavi, A.: Boundedness of iterates in \({Q}\)-learning. Syst. Control Lett. 55, 347–349 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  19. Gosavi, A.: On step-sizes, stochastic paths, and survival probabilities in reinforcement learning. In Proceedings of the winter simulation conference. IEEE, (2008)

  20. Gosavi, A.: Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, 2nd edn. Springer, NY (2015)

    Book  MATH  Google Scholar 

  21. Gosavi, A., Fraioli, G., Sneed, L.H., Tasker, N.: Discrete-event-based simulation model for performance evaluation of post-earthquake restoration in a smart city. IEEE Trans. Eng. Manag. 67(3), 582–592 (2019)

    Article  Google Scholar 

  22. Hoffman, M., de Freitas, N.: Inference Strategies for Solving Semi-Markov Decision Processes. In Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions, pp. 82–96. IGI Global, Pennsylvania (2012)

  23. Howard, R.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)

    MATH  Google Scholar 

  24. Matsui, T., Goto, T., Izumi, K., Chen, Y.: Compound reinforcement learning: theory and an application to finance. In European workshop on reinforcement learning, pp. 321–332. Springer (2011)

  25. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.C., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  26. Puterman, M.L.: Markov Decision Processes. Wiley, NY (1994)

    Book  MATH  Google Scholar 

  27. Puterman, M.L., Shin, M.C.: Modified policy iteration for discounted Markov decision problems. Manage. Sci. 24, 1127–1137 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  28. Scherrer, B.: Performance bounds for \(\lambda \) policy iteration and application to the game of tetris. J. Mach. Learn. Res. 14(4), (2013)

  29. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  30. Sutton, R., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  31. Tadepalli, P., Ok, D.: Model-based average reward reinforcement learning algorithms. Artif. Intell. 100, 177–224 (1998)

    Article  MATH  Google Scholar 

  32. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double \(Q\)-learning. In Proceedings of AAAI conference on Art. Intel., vol. 30, No. (1) (2016)

  33. van Nunen, J.A.E.E.: A set of successive approximation methods for discounted Markovian decision problems. Z. Operat. Res. 20, 203–208 (1976)

    MathSciNet  MATH  Google Scholar 

  34. van Seijen, H., Whiteson, S., van Hasselt, H., Wiering, M.: Exploiting best-match equations for efficient reinforcement learning. J. Mach. Learn. Res. 12, 2045–2094 (2011)

    MathSciNet  MATH  Google Scholar 

  35. Yoshida, W., Ishii, S.: Model-based reinforcement learning: a computational model and an fMRI study. Neurocomputing 63, 253–269 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

The paper has benefitted significantly from responding to suggestions from the Associate Editor and the reviewers. The authors thank the first reviewer for revisions leading to performance comparisons with CAP, the second reviewer for the Robbins-Monro analysis, and the third reviewer for the added numerical details in the first computational test.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Gosavi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof

(Lemma 1) Note that since \(\gamma >0\) and \(t(.,.,.)>0\), there exists a \(\lambda \in (0,1)\) such that:

$$\begin{aligned} \max _{i,j\in \mathcal {S}; a\in \mathcal {A}(i)}\exp (-\gamma t(i,a,j))\le \lambda . \end{aligned}$$
(11)

From G(.)’s definition in Eq. (6), for any two vectors, \(\textbf{J}^k\) and \(\mathbf {\overline{J}}^k\):

$$\begin{aligned} G(J^k)(i)-G(\overline{J}^k)(i)=\sum _{j \in S}p(i,\mu (i),j) \exp (-\bar{\gamma t(i,\mu (i),j)}\left[ J^k(j)-\overline{J}^k(j)\right] ~~~\forall i. \end{aligned}$$
$$\begin{aligned} \text{ Then, }\ \forall i:\left| G(J^k)(i)-G(\overline{J}^k)(i)\right|&\le \sum _{j \in S}p(i,\mu (i),j) \exp (-\bar{\gamma t(i,\mu (i),j)}\max _{j\in \mathcal {S}}\left| J^k(j)-\overline{J}^k(j)\right| \\&\le \sum _{j \in \mathcal {S}}p(i,\mu (i),j)\lambda \max _{j \in \mathcal {S}}\left| J^k(j) -\overline{J}^k(j)\right| \text{ from } Eq. (11)\\&= \sum _{j \in \mathcal {S}}p(i,\mu (i),j)\lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty }\\&=\lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty } \sum _{j \in S}p(i,\mu (i),j) =\lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty }. \end{aligned}$$

Then, \(\max _{i \in \mathcal {S}}\left| G(J^k)(i)-G(\overline{J}^k)(i)\right| =||G(\textbf{J}^k)-G(\mathbf {\overline{J}}^k)||_{\infty } \le \lambda ||\textbf{J}^k-\mathbf {\overline{J}}^k||_{\infty }\). \(\square \)

Proof

(Lemma 2) It is claimed that for every \(i\in \mathcal {S}\):

$$\begin{aligned} |J^k(i)|\le M(1+\lambda +\lambda ^2+\cdots +\lambda ^k), \end{aligned}$$
(12)
$$\begin{aligned} \text{ where } M=\max \left\{ \max _{i,j\in \mathcal{S}, a\in \mathcal{A}(i)}|r(i,a,j)|,\max _{i\in \mathcal{S}} |J^1(i)|\right\} \end{aligned}$$

and \(\lambda \) is defined via Eq. (11). Then, the result follows from the fact that:

$$\begin{aligned} \limsup _{k\rightarrow \infty }|J^k(i)|\le M \frac{1}{1-\lambda } \text{ for } \text{ all } i\in \mathcal {S}. \end{aligned}$$

The claim in (12) is proved via induction. In asynchronous updating, two cases can occur:

Case 1 The value for a state visited in the kth iteration is updated: \(J^{k+1}(i)=(1-\eta ^k)J^k(i)+\eta ^k\left[ r(i,a,j)+\exp (-\gamma t(i,a,j))J^k(j)\right] .\)

Case 2 The value for a state not visited in the kth iteration is not updated: \(J^{k+1}(i)=J^k(i)\).

When the update is carried out as in Case 1:

$$ \begin{aligned} |J^{2} (i)|\, & \le \,(1 - \eta ^{1} )|J^{1} (i)| + \eta ^{1} |r(i,a,j) + \exp ( - \gamma t(i,a,j)J^{1} (j)| \\ & \le (1 - \eta ^{1} )|J^{1} (i)| + \eta ^{1} |r(i,a,j) + \lambda J^{1} (j)|{\text{ from }}(11) \\ & \le (1 - \eta ^{1} )M + \eta ^{1} M + \eta ^{1} \lambda M{\text{ from }}M^{\prime}s{\text{ definition}} \\ & < (1 - \eta ^{1} )M + \eta ^{1} M + \lambda M = M(1 + \lambda ){\text{ since }}\eta ^{1} < 1 \\ \end{aligned} $$

When the update is carried out as in Case 2: \( |J^2(i)| = |J^1(i)|\le M \le M(1+\lambda ).\) The claim thus holds for \(k=1\). When the claim holds for \(k=m\), one has that:

$$\begin{aligned}|J^{m}(i)|\le M(1+\lambda +\lambda ^2+\cdots +\lambda ^{m})~~~~\forall i. \end{aligned}$$

Under Case 1: \(\forall i\):

$$ \begin{aligned} |J^{{m + 1}} (i)|\, & \le (1 - \eta ^{m} )|J^{m} (i)| + \eta ^{m} |r(i,a,j) + \exp ( - \gamma t(i,a,j))J^{m} (j)| \\ & \le (1 - \eta ^{m} )|J^{m} (i)| + \eta ^{m} |r(i,a,j) + \lambda J^{m} (j)| \\ & \le (1 - \eta ^{m} )M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) + \eta ^{m} M + \eta ^{m} \lambda M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) \\ & = M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) + \eta ^{m} M\lambda ^{{m + 1}} \\ & \le M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} ) + M\lambda ^{{m + 1}} = M(1 + \lambda + \lambda ^{2} + \cdots + \lambda ^{m} + \lambda ^{{m + 1}} ). \\ \end{aligned} $$

Under Case 2: \(\forall i\):

$$\begin{aligned} |J^{m+1}(i)|= & {} |J^m(i)| \le M(1+\lambda +\lambda ^2+\cdots +\lambda ^m) \le M(1+\lambda +\lambda ^2+\cdots +\lambda ^m+\lambda ^{m+1}). \end{aligned}$$

\(\square \)

Data for discounted SMDP with known transition model \(\textbf{P}_a\), \(\textbf{R}_a\), and \(\textbf{T}_a\), the transition probability, reward, and time matrices for action a, respectively, used are:

$$ {\mathbf{P}}_{1} = \left[ {\begin{array}{*{20}l} {0.6,0.2,0.1,0.1} \hfill \\ {0.2,0.3,0.1,0.4} \hfill \\ {0.1,0.6,0.2,0.1} \hfill \\ {0.2,0.4,0.2,0.2} \hfill \\ \end{array} } \right];~{\mathbf{P}}_{2} = \left[ {\begin{array}{*{20}l} {0.5,0.1,0.2,0.2} \hfill \\ {0.2,0.4,0,0.4} \hfill \\ {0.2,0.5,0.1,0.2} \hfill \\ {0.6,0.2,0.1,0.1} \hfill \\ \end{array} } \right]; {\text{ }}{\mathbf{R}}_{1} = \left[ {\begin{array}{*{20}l} {6, - 5,0,12} \hfill \\ {7,120,3,1} \hfill \\ {5,4,12,\,3} \hfill \\ {7,48,10,10} \hfill \\ \end{array} } \right];~{\mathbf{R}}_{2} = \left[ {\begin{array}{*{20}l} {100,17,0,10} \hfill \\ { - 14,13,0,1} \hfill \\ {9,40,7,12} \hfill \\ {12,12,10,14} \hfill \\ \end{array} } \right]; {\text{ }}{\mathbf{T}}_{1} = \left[ {\begin{array}{*{20}l} {1,5,2,5} \hfill \\ {120,60,30,30} \hfill \\ {2,7,12,14} \hfill \\ {135,55,45,30} \hfill \\ \end{array} } \right];~{\mathbf{T}}_{2} = \left[ {\begin{array}{*{20}l} {50,75,20,12} \hfill \\ {7,2,7,2} \hfill \\ {35,65,30,20} \hfill \\ {50,30,50,70} \hfill \\ \end{array} } \right] $$

Other constants in MAP were set to the following values: \(\gamma =0.1\), \(m_{\max }=k_{\max }=n_{\max }=100\), and \(K=5\).

Data for post-earthquake disaster response SMDP \(\textbf{PP}=[0.375,0.2,0.35,0.075]\), where PP(s) denotes the probability of going from the stable state to a primary state s. The following values were used for SP matrix:

$$ {\mathbf{SP}} = \left[ {\begin{array}{*{20}l} {0.1,0.3,0,0.5,0,0,0.1} \hfill \\ {0,0.1,0,0.35,0,0.35,0.2} \hfill \\ {0,0,0.1,0,0.5,0.2,0.2} \hfill \\ {0,0,\,0,0.1,0,0,0.9} \hfill \\ \end{array} } \right], $$

where \(SP(s_1,s_2)\) denotes the probability of transition from a state, \(s_1\), in the set of primary states to a state,\(s_2\), in the union of the sets of the primary and secondary states. The following values were used for the hazard scores: \(HS(1)=log(2)\), \(HS(2)=log(4)\), \(HS(3)=log(8)\), \(HS(4)=log(6)\), \(HS(5)=log(10)\), \(HS(6)=log(12)\), and \(HS(7)=log(14)\). Natural logarithms were used here for costs, as RL algorithms can suffer from computer overflow with large absolute values for costs, especially in discounted problems [24]. The response time for a state s is defined following [21]: \(RT_c(s)=\sum _{d\in \mathcal {I}(s)}RT(d)\phi \), where \(\mathcal {I}(s)\) denotes the incidents present in state s, RT(d) denotes the response time for an incident, d, and \(\phi \) is a correction factor that equals 1 for a state that contains one incident, 1.2 for a state that contains two incidents, and 1.3 for a state that contains three incidents. The following values were used (all in hours): \(RT(G)=7+\frac{5}{X}\), \(RT(F)=21+\frac{15}{X}\), and \(RT(BC)=35 +\frac{25}{X}\), where \(X=1\) for the local agency and \(X=2\) for the federal agency. Finally, for the local and federal agencies, the travel time were TRIA(0.5, 1, 1.5) and TRIA(4, 5, 6), respectively, in hrs, where TRIA denotes the triangular distribution. Other constants in MAP were set as follows: \(\gamma =0.1\), \(m_{\max }=k_{\max }=n_{\max }=10,000\), and \(K=10\).

Steps in conservative approximate policy iteration (CAP) CAP uses Q-values instead of the future state-action values and bypasses model building. It has the same structure as MAP (see Algorithm 1 in main text) with the following exceptions: Step A (model-building) is skipped; Step B2 is replaced by evaluation of the Q-values, i.e., Eqn. (3) is replaced by: Update the Q-value for (ia) as follows:

$$\begin{aligned} Q^{n+1}(i,a)\leftarrow (1-\beta ^n)Q^n(i,a)+\beta ^n\left[ r(i,a,j)+\exp \left( -\gamma t(i,a,j)\right) J^{\infty }(j)\right] ; \end{aligned}$$

Step C, policy is improved, i.e., for each \(i\in \mathcal{S}\), select \(\mu '(i)\in {argma }x_{a\in \mathcal{A}(i)}Q^{\infty }(i,a),\) where \(Q^{\infty }(.,.)\) denotes the final Q-value obtained in Step B2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gosavi, A., Sneed, L.H. & Spearing, L.A. Deep reinforcement learning for approximate policy iteration: convergence analysis and a post-earthquake disaster response case study. Optim Lett (2023). https://doi.org/10.1007/s11590-023-02062-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11590-023-02062-0

Keywords

Navigation