skip to main content
research-article
Open Access

DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning – Extended Version

Authors Info & Claims
Published:26 October 2023Publication History

Skip Abstract Section

Abstract

Neural networks (NN) are gaining importance in sequential decision-making. Deep reinforcement learning (DRL), in particular, is extremely successful in learning action policies in complex and dynamic environments. Despite this success, however, DRL technology is not without its failures, especially in safety-critical applications: (i) the training objective maximizes average rewards, which may disregard rare but critical situations and hence lack local robustness; (ii) optimization objectives targeting safety typically yield degenerated reward structures, which, for DRL to work, must be replaced with proxy objectives. Here, we introduce a methodology that can help to address both deficiencies. We incorporate evaluation stages (ES) into DRL, leveraging recent work on deep statistical model checking (DSMC), which verifies NN policies in Markov decision processes. Our ES apply DSMC at regular intervals to determine state space regions with weak performance. We adapt the subsequent DRL training priorities based on the outcome, (i) focusing DRL on critical situations and (ii) allowing to foster arbitrary objectives.

We run case studies on two benchmarks. One of them is the Racetrack, an abstraction of autonomous driving that requires navigating a map without crashing into a wall. The other is MiniGrid, a widely used benchmark in the AI community. Our results show that DSMC-based ES can significantly improve both (i) and (ii).

Skip 1INTRODUCTION Section

1 INTRODUCTION

In recent years, neural networks (NN), especially deep neural networks, have accomplished major successes across many computer science domains, such as image classification [36], natural language processing [31], and game-playing [55]. The latter was especially accomplished by combining reinforcement learning (RL) and deep neural networks, so-called deep reinforcement learning (DRL). DRL was used successfully for sequential decision-making, e.g., mastering Atari games [40, 41], playing the games Go and chess [54, 55, 56], or solving the Rubik’s cube [1], and is beginning to be used in real-world (motivated) examples, such as vehicle routing [42], robotics [25], and autonomous driving [49].

Despite this success, however, DRL technology is not without fail, especially in safety-critical applications. While neural network action policies achieve good performance in many sequential decision-making processes, that performance pertains to average rewards as optimized by DRL training. That objective, however, may average out poor local behavior, and thus disregard rare but critical situations (e.g., a child running in front of a car). In other words, we do not get system-level guarantees, even in the ideal case where the learned policy is near-optimal with respect to its training objective. We refer to this deficiency as a lack of local robustness. Dedicated exploration strategies have been developed to ensure the inclusion of rare experiences during training [15, 16]. These focus on reducing the variance of the accumulated reward, e.g., by importance sampling, but they are not flexible enough to enforce desirable behavior robustly across the whole state space.

This problem is exacerbated by the fact that optimization objectives specifically targeting safety typically yield degenerated reward structures. This is true, in particular, for the natural objective to maximize the probability of reaching a goal condition—without getting stuck in an unsafe (terminal) state. That objective yields reward 1 in goal states and 0 elsewhere, an extremely sparse reward structure not suited for (D)RL training in large state spaces (a widely known fact; see, e.g., References [3, 27, 35, 48, 52]). Hence, for (D)RL training to be able to identify a useful policy, proxy objectives are used, such as discounted cumulative reward giving positive feedback for goal states and (highly) negative feedback for unsafe states. 1

In summary, two deficiencies of current DRL methods in safety-critical systems are that (i) training for average reward lacks local robustness, and (ii) safety objectives like goal reachability probability cannot be used for effective training. Let us illustrate these points in an example, taken from the Racetrack benchmark, which we will also use in our case studies later on. Racetrack is a commonly used benchmark for Markov decision process (MDP) solution algorithms in AI [7, 9, 45, 58]. The challenge is to navigate a car to a goal line on a discrete map without crashing into a wall, where actions accelerate/decelerate the car. Racetrack is thus a simple (but highly extensible [6]) abstraction of autonomous driving. Consider Figure 1, which measures the performance of an NN policy trained with DRL, using a discounted reward structure with \(+100\) reward for goal states and \(-50\) reward for crashes.

Fig. 1.

Fig. 1. Example performance measures of a DRL policy on a Racetrack example map.

Figure 1(a) evaluates policy performance according to the reward structure it is trained on; whereas Figure 1(b) evaluates goal reachability probability, which is the objective we ideally want to optimize. Both heat maps visualize the performance when starting the policy from each map cell. We clearly see deficiency (i) from the high variance in colors, in particular, red and orange areas with (very) low expected reward (a)/goal reachability probability (b). Regarding deficiency (ii), while expected reward correlates with goal reachability probability, crashes are “more tolerable” in the reward structure than for goal reachability probability (if we set high negative rewards for crashes, then the policy learns to drive in circles). This is difficult to see in the heat maps, as the reward scale in (a) cannot be directly compared to the probability scale in (b). Figure 1(c) hence complements this picture by showing the average goal reachability probability in the critical areas of the map, as achieved by the standard DRL method deep Q-network, vs. EPR\(^{G}_{\mathit {DQN}}\), which is one of the new methods we introduce here. EPR\(^{G}_{\mathit {DQN}}\) takes goal reachability probability into account directly, which clearly pays off.

We address deficiencies (i) and (ii) through incorporating evaluation stages (ES) into DRL, conducted at regular intervals during training (i.e., periodically after a given number of training episodes) to determine state space regions with weak performance. The “performance” evaluation here is flexible and can be done either (i) with respect to the training objective or (ii) with respect to the true objective (for example: goal reachability probability in EPR\(^{G}\) above) in case a proxy objective is used for training.

To design such flexible ES, we leverage recent work on deep statistical model checking (DSMC) [21], an approach to explicitly examine properties of NN action policies. The approach assumes that the NN policy resolves the nondeterminism in an MDP, resulting in a Markov chain that is analyzed by statistical model checking [11]. This provides flexible methodology for evaluating policy performance starting from individual states.

The target of an evaluation stage being to identify “weak regions,” the question arises which individual states to apply DSMC to. Our answer to this question, at present, is based on the assumption that the possible initial states for the problem at hand (the states from which policy execution may start) can be partitioned into a feasibly small set of state space regions. In Racetrack, regions are identified by the location of the car. The approach we propose is to sample a single representative state s from each region and evaluate s through DSMC.2

Upon termination of an ES, we adapt the subsequent DRL training priorities based on the outcome. Specifically, we introduce two alternative methods, of which one adapts the probabilities with which new training experiences are generated, and the other adapts the probabilities with which the accumulated training experiences are taken into consideration within individual learning steps. Overall, this approach results in an iterative feedback loop between DRL training and DSMC model checking. It addresses (i) through focusing the DRL on critical situations and addresses (ii), as DSMC can evaluate arbitrary temporal properties.

We implement this approach on top of deep Q-learning [7, 41] a single-threaded advantage actor-critic [39] and Dueling DQN (DDQN) [61], and we run experiments in case studies from Racetrack and MiniGrid [14]. The results show that DSMC-based ES can indeed (i) make policy reward more robust across complex maps and (ii) improve goal reachability probability when using a discounted-reward proxy for DRL training.

In summary, our contributions are as follows:

  • We introduce evaluation stages as an idea to improve local robustness and goal reachability probability training in DRL.

  • We design and implement two variants of this approach, adapting state-of-the-art DRL algorithms.

  • We evaluate the approach on two benchmarks and show that it can indeed have beneficial effects regarding deficiencies (i) and (ii).

Related Work: Recent work of Hasanbeig et al. [29] proposes a method to include a property encoded as an LTL formula and to synthesize policies that maximize the probability of that LTL property. However, while this allows us to specify complex tasks, their approach inherits both of our deficiencies.

Similarly, Hasanbeig et al. [30] use LTL properties to construct meaningful reward functions for unknown environments. Applying said method to our property of interest (optimizing goal reachability probability without getting stuck in an unsafe state) leads to the exact same reward function we use: positive when reaching the goal, negative when harming safety, and zero else (with the possible addition to penalize standstill). As shown in Figure 1, by using this reward function, both of the deficiencies occur. Thus, this elsewhere promising approach is of no help here.

Further, our work relates to the area of safe reinforcement learning. Several works investigate the usage of shields [2, 5, 32] or permissive schedulers [34] to restrict the agent from entering unsafe states, even during training. However, these approaches can only be applied if a shield/permissive scheduler was computed beforehand, which is a model-based task. In contrast, our approach is model-free; it does not need to compute a shield or permissive scheduler beforehand and does not restrict the action (and thus also state-) space. Instead, the task is learned entirely through self-play and Monte Carlo-based evaluation runs. Moreover, our approach is also applicable in more general scenarios, when there are not just safe and unsafe states but more fine-grained state distinctions.

Skip 2BACKGROUND Section

2 BACKGROUND

We briefly introduce the necessary background on Markov decision processes, deep Q-learning, and deep statistical model checking.

2.1 Markov Decision Processes

The underlying model of both DSMC and DRL is that of a (state-discrete) Markov decision process in discrete time. Let \(\mathcal {D}(S)\) denote the set of probability distributions over S for any non-empty set S.

Definition 2.1

(Markov Decision Process).

An MDP is a tuple \(\mathcal {M}= \left\langle \mathcal {S}, \mathcal {A}, \mathcal {T}, \mu \right\rangle\) consisting of a finite set of states \(\mathcal {S}\), a finite set of actions\(\mathcal {A}\), a partial transition probability function\(\mathcal {T}:\mathcal {S}\times \mathcal {A}\rightharpoonup \mathcal {D}(\mathcal {S})\), and an initial distribution \(\mu \in \mathcal {D}(\mathcal {S})\). We say that an action \(a \in \mathcal {A}\) is applicable in state \(s \in \mathcal {S}\) if \(\mathcal {T}(s,a)\) is defined. We denote by \(\mathcal {A}(s) \subseteq \mathcal {A}\) the set of actions applicable in s.

MDPs are typically associated with a reward structure r, specifying numerical rewards that are obtained when following a transition, i.e., \(r : \mathcal {S}\times \mathcal {A}\times \mathcal {S}\rightarrow \mathbb {R}\). In the following, we call the support of \(\mu\) the set of initial states I, i.e., \(I=\left\lbrace s\in \mathcal {S}\mid \mu (s)\gt 0\right\rbrace\).

Usually, an MDP’s behavior is considered jointly with an entity resolving the otherwise non-deterministic choices in a state. Given a state, a so-called action policy (or scheduler, or adversary) determines which of the applicable actions to apply.

Definition 2.2

(Action Policy).

A (history-independent) action policy is a function \(\pi :\mathcal {S}\times \mathcal {A}\rightarrow [0,1]\) such that \(\pi (s,\cdot)\) is a probability distribution over \(\mathcal {A}\) and, for all \(s \in \mathcal {S}\), \(\pi (s,a)\gt 0\) implies that \(a \in \mathcal {A}(s)\).

We remark that history-independent action policies are often also called memoryless, because their decisions depend only on the given state and not on the history of formerly visited states. We call an action policy deterministic if in each state s, \(\pi\) selects an action with probability one. We then simply write \(\pi (s)\) for the corresponding action.

In the sequel, for a given MDP \(\mathcal {M}\) and action policy \(\pi\), we will write \(S_0,S_1,S_2,\ldots\) for the states visited at steps \(t=0,1,2,\ldots\). Let \(A_t\) be the action selected by policy \(\pi\) in state \(S_t\) and \(R_{t+1}=r(S_t,A_t,S_{t+1})\) the reward obtained when transitioning from \(S_t\) to \(S_{t+1}\) with action \(A_t\). Note that—as we are dealing with finite-state MDPs—the probability measure associated with these random variables is well defined and \(\lbrace S_t\rbrace _{t\in \mathbb {N}_0}\) is a Markov chain with state space \(\mathcal {S}\) induced by policy \(\pi\). For further details, we refer to Puterman [46].

The induced Markov chain can be analyzed using statistical model checking [53, 62]. For statistical model checking of MDPs, different approaches have been proposed to handle nondeterminism [8, 11].

2.2 Deep Q-learning

In the following, let (1) \(\begin{equation} G_t=\sum _{k=t+1}^{T}\gamma ^{k-t-1} R_{k} \end{equation}\) denote the discounted, accumulated reward, also called return, from time t on, where \(\gamma \in [0,1]\) is a discount factor, and T is the final timestep [58]. The discount factor determines the importance between short- and long-term rewards; if \(\gamma = 0\), then the discounted return will be equal to the reward accumulated in one step only, if \(\gamma = 1\), then all future rewards will be worth the same, and if \(\gamma \in (0,1)\), then the long-term rewards will be less important than the short-term ones.

Q-learning is a well-known algorithm to approximate action policies that maximize said accumulated reward [7]. For a fixed policy \(\pi\), the so-called action-value or q-value \(q_\pi (s,a)\) at time t is defined as the expected return \(G_t\) that is achieved by taking an action \(a \in \mathcal {A}(s)\) in state s and following the policy \(\pi\) afterward, i.e., (2) \(\begin{equation} q_\pi (s,a) = \mathbb {E}_\pi \left[G_t \,\middle |\,S_t = s, A_t = a\right] = \mathbb {E}_\pi \left[\sum \limits _{k=0}^{\infty } \gamma ^k R_{t+k+1} \,\middle |\,S_t = s, A_t = a\right]. \end{equation}\)

Policy \(\pi\) is optimal if it maximizes the expected return. We write \(q_*(s,a)\) for the corresponding optimal action-value. Intuitively, the optimal action-value \(q_*(s,a)\) is equal to the expected sum of the reward that we receive when taking action a from state s and the (discounted) highest optimal action-value that we receive afterward. For optimal \(\pi\), the Bellman optimality equation [58] gives (3) \(\begin{equation} q_*(s,a) = \mathbb {E}_\pi \left[R_{t+1} + \gamma \cdot \max _{a^{\prime }}q_*\left(S_{t+1},a^{\prime }\right) \,\middle |\,S_t = s, A_t = a\right]. \end{equation}\)

Vice versa, one can evidently obtain the optimal policy if the optimal action-values are known by selecting \(\pi (s) = \mathop {\mathrm{argmax}}\nolimits _{a \in \mathcal {A}(s)} q_*(s,a)\).

By estimating the optimal q-values, one can obtain (an approximation of) an optimal policy. During tabular Q-learning, the action-values are approximated separately for each state-action pair [7]. In the case of large state spaces, deep Q-learning can be used to replace the Q-table with a neural network (NN) as a function approximator [41]. NNs can learn low-dimensional feature representations and express complex non-linear relationships. Deep reinforcement learning is based on training deep neural networks to approximate optimal policies. Here, we consider a neural network with weights \(\theta\) estimating the Q-value function as a DQN [40]. We denote this Q-value approximation by \(Q_\theta (s,a)\) and optimize the network w.r.t. the target (4) \(\begin{equation} y_\theta (s,a)=\mathbb {E}_\theta \left[R_{t+1} + \gamma \cdot \max _{a^{\prime }} Q_\theta (S_{t+1},a^{\prime })\mid S_t = s, A_t = a \right], \end{equation}\) where the expectation is taken over trajectories induced by the policy represented by the parameters \(\theta\). The corresponding loss function in iteration i of the learning process is (5) \(\begin{equation} L(\theta _i) = \mathbb {E}_{\theta _i}\left[\left(y_{\theta ^{\prime }}(S_t,A_t) - Q_{\theta _i}(S_t,A_t) \right)^2\right]. \end{equation}\) Here, the so-called fixed target means that in Equation (5) \(\theta ^{\prime }\) does not depend on the current iteration’s weights of the (so-called local) neural network \(\theta _i\) but on weights that were stored in earlier iterations (so-called target network), to avoid an unstable training procedure [41]. We approximate \(\nabla L(\theta _i)\) and optimize the loss function by stochastic gradient descent.

In contrast to Mnih et al. [41], we do not update the target network after a fixed number of stochastic gradient descend update steps but perform a soft update instead, i.e., whenever we update the local network in iteration i, the weights of the target network are given by \(\theta ^{\prime } = (1 - \tau) \cdot \theta _i+ \tau \cdot \theta ^{\prime }\) with \(\tau \in (0,1)\) [17, 57].

Stochastic gradient descent assumes independent and identically distributed samples. However, when directly learning from self-play, this assumption is disrupted, as the next state depends on the current decision. To mitigate this problem, we do not directly learn from observations but store them in an experience replay buffer [41]. Whenever a learning step is performed, we uniformly sample from this replay buffer to consider (approximately) uncorrelated tuples. Thus, the loss is given by (6) \(\begin{equation} L(\theta _i) = \mathbb {E}_{(s,a,r,s^{\prime }) \sim U(D)}\left[ \left(r + \gamma \cdot \max \limits _{a^{\prime }} Q_{\theta ^{\prime }}(s^{\prime },a^{\prime }) - Q_{\theta _i}(s,a) \right)^2\right]. \end{equation}\) We generate our experience tuples by exploring the state space epsilon-greedily, i.e., during the Monte Carlo simulation, we follow the policy that is implied by the current network weights with a chance of \((1 - \epsilon)\) and otherwise choose a random action. We start with a high exploration coefficient \(\epsilon = \epsilon _\mathit {start}\) and exponentially decay it, i.e., for every iteration, we set \(\epsilon = \epsilon \cdot \epsilon _\mathit {decay}\) with \(\epsilon _\mathit {decay} \lt 1\), until a certain threshold \(\epsilon _\mathit {end}\) is met. Afterward, we constantly use \(\epsilon = \epsilon _\mathit {end}\). Common termination criteria for the learning process are fixing the number of episodes or using a threshold on the expected return achieved by the current policy. The overall algorithm is displayed in Section 3 (Algorithm 1), together with the extensions and changes we will introduce.

A common improvement to the DQN algorithm sketched above, which we will also consider in this article, is the so-called prioritized replay buffer [50]. Not all samples are equally useful for improving the policy. In particular, those samples with a relatively small individual loss do not contribute to the learning process as much as those with a high loss. Thus, the idea of prioritized experience replay is to sample from the aforementioned replay buffer with a probability that reflects the loss. Specifically, the priority \(\delta\) of a sample \((s,a,s^{\prime }, r)\) in iteration i is given by (7) \(\begin{equation} \delta = \left({\left(Q_{\theta ^{\prime }}(s,a) - \left(r + \gamma \cdot \max _{a^{\prime }} Q_{\theta _i}(s^{\prime },a^{\prime }) \right) \right) + \epsilon _p } \right)^\alpha , \end{equation}\) where \(\epsilon _p\) is a hyperparameter to ensure that all samples have non-zero probability, and \(\alpha\) is used to control the amount of prioritization. \(\alpha = 0\) means that there is no prioritization, \(\alpha = 1\) means full prioritization, \(\alpha \in (0,1)\) defines a balance. In Equation (6), instead of sampling uniformly, the probability at which a sample is picked from the buffer is then proportional to its priority, i.e., we divide the samples’ priority by the sum of all priorities. In the following, we will abbreviate DQN with such prioritized experience replay as DQNPR.

2.3 Advantage Actor-critics

Also, popular alternatives to deep Q-learning are given by policy-gradient methods [58], which approximate the optimal policy \(\pi ^{*} = \mathop {\mathrm{argmax}}_{\pi } v_\pi (S_0)\) directly with a parameterized policy \(\pi _\theta\) that is trained by stochastic gradient ascent on estimates of the policy-gradient (8) \(\begin{equation} g_\theta = \nabla _\theta ~v_{\pi _\theta }(S_0), \end{equation}\) where the state-value can be defined as \(v_\pi (s) = \sum _a \pi (s,a)q_\pi (s,a)\) or equivalently \(v_\pi (s) = \mathbb {E}_\pi \left[G_t \mid S_t = s\right]\). Note that, in contrast to deep Q-learning, we here consider a stochastic instead of a deterministic policy.

Intuitively, policy-based methods treat RL as a classification task. Rather than predicting the exact value of actions, they only need to be able to discriminate actions based on their relative value. Providing a loss function that facilitates learning in this setting entails that actions can be labeled based on their value. Using Monte Carlo estimates for this purpose often leads to unstable learning due to their high variance. Actor-critics combine policy-based methods with value-based methods, whereby the learned action-values are used to create a learning signal for the policy-based agent, i.e., critique the choices of the actor.

Recent advantage actor-critics (A2C) [39] instantiate this approach by implementing \(\pi _\theta\), the so-called actor, as a deep neural network and using policy-gradient estimates of the form (9) \(\begin{equation} g_\theta \approx \sum _{t=0}^{T-1} \nabla _\theta \log \pi _\theta (S_t, A_t) a_{\pi _\theta }(S_t, A_t), \end{equation}\) where \(a_\pi (s, a) = q_\pi (s, a) - v_\pi (s)\) denotes the action-advantage-value of action a in state s. In theory, one could also use \(q_{\pi _\theta }\) in place of \(a_{\pi _\theta }\), but in practice, the latter family of estimators is typically of lower variance, which helps to make the optimization process more stable, ultimately leading to decreased training times [39].

Similarly to Mnih et al. [39], to approximate \(v_{\pi _\theta }\), we train a separate critic network \(V_\phi (s)\) with parameters \(\phi\) by stochastic gradient descent on (10) \(\begin{equation} \nabla _\phi \frac{1}{T} \sum _{t=0}^{T-1} {(V_\phi (S_t) - G_t)}^2. \end{equation}\) In place of bootstrapped n-step temporal-difference errors [58], we use generalized advantage estimation (GAE) [51], i.e., (11) \(\begin{equation} A_t^{\phi ,\lambda } = \sum _{k=t}^{T-1}(\gamma \lambda)^{k-t} (R_{k+1} + \gamma V_\phi (S_{k+1}) - V_\phi (S_k)), \end{equation}\) to finally obtain estimates of \(a_{\pi _\theta }(S_t, A_t)\). The benefit of using GAE is given by the fact that the hyperparameter \(\lambda \in [0,1]\) enables us to trade of bias and variance, whereby \(A_t^{\phi , 0}\) minimizes variance at the cost of high bias and \(A_t^{\phi , 1}\) minimizes bias at the cost of higher variance.

2.4 Dueling DQN

Dueling deep Q-networks (DDQN) [61] make use of an improved neural network architecture that is based on a decomposition of \(q_\pi (s,a)\). The action-advantage values defined in the previous section can be used to rewrite action-values as (12) \(\begin{equation} q_\pi (s,a) = v_\pi (s) + a_\pi (s,a). \end{equation}\)

Rather than using a single feed-forward network \(Q_{\theta }\), DDQNs compute their Q-estimates by aggregating the scalar and vector output of two separate networks \(V_\beta\) and \(A_\alpha\), i.e., (13) \(\begin{equation} Q_{(\alpha , \beta)}(s,a) = V_\beta (s) + A_\alpha (s,a). \end{equation}\) Intuitively, this allows the agent to learn a decoupled estimate \(V_\beta (s)\) for state-values, thereby reducing the need to learn precise advantages \(A_{\alpha }(s,a)\) for actions in less valuable states. The DDQN algorithm is identical to the DQN algorithm except for the fact that the deep Q-networks are replaced with dueling deep Q-networks, which entails that during training, the agent now optimizes over two sets of parameters \(\alpha\) and \(\beta\) instead of a single set of parameters \(\theta\), which we express by setting \(\theta = (\alpha , \beta)\).

2.5 Deep Statistical Model Checking

Deep statistical model checking [21] is a method to analyze an NN-represented policy \(\pi\) taking action decisions (resolving the nondeterminism) in an MDP \(\mathcal {M}\). Namely, the induced Markov chain \(\mathcal {C}\) is examined by statistical model checking. Given an MDP \(\mathcal {M}\), DSMC assumes that the policy \(\pi\) has been trained based on \(\mathcal {M}\) completely prior to the analysis without influencing the training process at all. This approach is promising in terms of scalability, as the analysis of \(\mathcal {C}\) merely requires evaluating the NN on input states: There is no need for other deeper and more complex NN analyses. Gros et al. [21] implemented this approach for the statistical model checker modes [11] in the Modest Toolset [28].

Skip 3RL WITH EVALUATION STAGES Section

3 RL WITH EVALUATION STAGES

We now introduce our approach of DRL with evaluation stages, addressing the DRL deficiencies discussed in the introduction: (i) training for average reward lacks local robustness; (ii) safety objectives like goal reachability probability cannot be used for effective training. We next discuss a basic design decision, then describe our two alternative methods, and then specify how they are realized on top of deep Q-learning.

3.1 Initial State Partitioning and Notations

Recall that I denotes the initial states of the MDP, i.e., the support of the initial distribution \(\mu\). As already mentioned, an important premise of our work is that I can be partitioned into a manageable number of regions. We denote that partition by \(\mathbb {P}= \lbrace J_1, J_2, \ldots , J_k \rbrace ,\) where the regions are non-empty \(J_i \ne \emptyset\), cover the set of all initial states \(\bigcup _{i \in {1,2,\ldots , k}} J_i = I\), and are disjoint \(J_i \cap J_j = \emptyset\) for \(i \ne j\). During the evaluation stages, we consider one representative \(s_i \in J_i\) from each region. The underlying assumption is that the representatives are sufficiently meaningful to identify important deficiencies in policy behavior.3

The evaluation stages may consider arbitrary optimization objectives in principle and use arbitrary methods to measure the objective values of the states \(s_i\). Here, we compute E using DSMC, measuring expected reward or goal reachability probability. We denote the outcome of the evaluation as an evaluation function, i.e., a function \(E: \mathbb {P}\rightarrow [0,1]\) mapping each region \(J_i\) to the evaluation value of its representative state \(s_i\). For optimization objectives that are not probabilities, we assume here a normalization step into the interval \([0,1]\), with 0 being the worst value and 1 the best. In particular, for expected rewards, the natural method we use in our experiments is to set \(E(r^{\mbox{min}}) = 0\) and \(E(r^{\mbox{max}}) = 1\) and interpolate linearly in between.

We also use the representative states to define an initial probability distribution over the regions \(J_i\): (14) \(\begin{equation} \beta (J_i)= \mu (s_i)/\sum _{j=1}^k \mu (s_j). \end{equation}\)

3.2 Evaluation-based Initial Distribution (EID)

Given the initial distribution \(\mu\) of the MDP, with the insights gained through the DSMC evaluation stages, we can adapt the initial distribution to guide the training process after an evaluation stage. Recall that \(\beta\) is the initial distribution of a region in the original MDP. The probability of starting in a region \(J_i\) for the EID method is then given by (15) \(\begin{equation} p(J_i) = \frac{(1 - E(J_i) + \epsilon _p)\cdot \beta (J_i)}{\sum \nolimits _{j}{(1 - E(J_j) + \epsilon _p)\cdot \beta (J_j)}}, \end{equation}\) i.e., we shift the initial distribution for the regions such that we start with a higher probability in areas with low quality and vice versa. Once region \(J_i\) is selected, we uniformly sample a starting state from \(J_i\).

The idea of EID is that by generating experiences from regions with poor behavior, we improve the robustness of the policy, as the NN will learn to select the most appropriate actions in these regions.

In contrast to prior work [23], we introduce a hyperparameter \(\epsilon _p\) to ensure that all regions have a non-zero probability. We experimentally found that this increases the stability of the training procedure and prohibits catastrophic forgetting [4], i.e., forgetting already learned parts of the task.

3.3 Evaluation-based Prioritized Replay (EPR)

As discussed, the principle of prioritized experience replay buffers is to sample states according to their loss, i.e., we more often sample states where the loss is high and less often where the loss is low (see Equation (7)). Here, our idea is to base the priorities on the outcome of the evaluation instead.

The samples \((s,a,r,s^{\prime })\) in the replay buffer may be arbitrary and, in particular, may not contain possible initial states. Yet, the evaluation is done for initial states only. To be able to judge individual transition samples, we evaluate each sample in terms of the initial state \(s_0 \in I\) from which it was generated, i.e., from which the respective training episode started. This arrangement is meaningful, as improving the policy for \(s_0\) necessarily involves further training on its successor states. For each transition sample, we store the partition \(J_i\) of the initial state \(s_0\) in the replay buffer. The replay priority \(\delta\) is then set to (16) \(\begin{equation} \delta = (1 - E(J_i) + \epsilon _p)^\alpha , \end{equation}\) where \(s_0 \in J_i\) is the initial state of the training episode, and \(\epsilon _p\) and \(\alpha\) have the same functionality as in Equation (7). After every evaluation stage, we update the priorities of the replay buffer according to Equation (16). The probability of picking experience \((s,a,r,s^{\prime })\) during training from the buffer is then proportional to the above replay priority.

3.4 Deep Reinforcement Learning with Evaluation Stages

EID is applicable to any (deep) reinforcement learning algorithm and EPR to any such algorithm using a replay buffer. In this section, we will present our method using three different algorithms, namely, DQN, A2C, and DDQN.

The unmarked lines in both Algorithm 1 and Algorithm 2 are inherited from the original algorithm and are applied in all versions. Lines that are marked differently are only applied in the versions they are marked with, e.g., line 2 in Algorithm 1 is part of DQN, DQNPR, and EPR but not of EID. The colored lines mark the extensions of EID (e.g., Algorithm 1, line 3, blue) and the extensions of both EID and EPR (e.g., Algorithm 1, lines 17–21, green). The DSMC-based evaluation stages are inserted after a threshold P of pre-training episodes was met (Algorithm 1, line 17; Algorithm 2, line 23) and then are repeated every L episodes (Algorithm 1, line 18; Algorithm 2, line 24). Thus, the total number of training episodes M is given by \(M = P + N \cdot L\), where N is the number of performed evaluation stages.

The adaption of our methods to other reinforcement learning algorithms is straightforward.

3.4.1 Deep Q-learning with Evaluation Stages.

Here, we introduce both methods on top of deep Q-learning [41]. Algorithm 1 shows pseudocode for deep Q-learning with soft updates (denoted DQN) and its previously discussed variant DQNPR, as well as the extensions for EID and EPR.

The priority \(\delta\) (marked in orange, line 8) depends on the algorithm:

  • Both original deep Q-learning and EID sample uniformly from the replay buffer, so \(\delta\) is set to a constant value.

  • For DQNPR [50], \(\delta\) is initialized with the maximal temporal difference loss observed throughout the training procedure and updated in every learning step according to Equation (7).

  • EPR sets the priority to a constant prior to the first ES and afterward according to Equation (16).

3.4.2 A2C with Evaluation Stages.

Algorithm 2 shows pseudocode for A2C as well as the extension for EID. (Remark that EPR is not applicable here, as A2C does not use a replay buffer.)

3.4.3 DDQN with Evaluation Stages.

Dueling DQN makes use of an improved neural network architecture. Using that architecture with Algorithm 1 gives us DDQN with evaluation stages.

Skip 4CASE STUDIES Section

4 CASE STUDIES

We next describe the Racetrack benchmark, which we use to evaluate our approach.

4.1 Racetrack

Racetrack originally is a pen-and-paper game, adopted as a benchmark in the AI community [7, 10, 38, 44, 45], particularly for reinforcement learning [6, 20, 24]. The task is to steer a car on a map towards the goal line without crashing into walls. The map is given by a two-dimensional grid, where each map cell is either free, part of the goal line, or a wall. We assume that, initially, the car may start on any free map cell with velocity 0 with equal probability (i.e., \(\mu\) is uniform, and I is the set of all non-wall positions with zero velocity).

Figure 2 shows the three maps that we consider in the following: Barto-big Figure 2(a) was originally introduced by Barto et al. [7]. We designed the other two maps, Maze Figure 2(b) and River Figure 2(c), as examples with a more localized structure highlighting the problem of local robustness.

The position and velocity of the car each is a pair of integers for the x- and y-dimension. In each step, the agent can accelerate the car by at most one unit in each dimension, i.e., the agent can add an element of \(\lbrace -1, 0, 1\rbrace\) to each of x and y, resulting in nine different actions. The ground is slippery, meaning that the action might fail, in which case the acceleration/deceleration does not happen, and the car’s velocity remains unchanged. Each action application fails with a fixed probability that we will refer to as noise.

The velocity after applying an action defines the car’s new position. The car then moves in a straight line from the old position to the new position. If that line intersects with a wall cell, then the car crashes, and the game is lost. If that line intersects with a goal cell, then the game is won. In both cases, the game terminates.

We use the following simple reward function: (17) \(\begin{equation} r\left(s \xrightarrow {\left(\mathit {ax},\mathit {ay}\right)}s^{\prime }\right) = {\left\lbrace \begin{array}{ll} \hfill 100 & \text{if } s^{\prime } = \top \\ \hfill -50 & \text{if } s^{\prime } = \bot \\ \hfill -5 & \text{if } s^{\prime } = s \wedge \mathit {ax}= \mathit {ay}= 0 \\ \hfill 0 & \text{otherwise.} \end{array}\right.} \end{equation}\) This reward function is positive if the game was won (\(\top\)), negative if the game was lost (\(\bot\)), and slightly negative if the state did not change (incentivizing the agent to not stand still); otherwise, no reward signal is given. The incentive to reach the goal as quickly as possible is given through the discount factor \(\gamma\), which is chosen to be smaller than 1, making short-term rewards more important than long-term ones (see Equation (2)). This reward function encodes the objective to reach the goal as quickly as possible and to not crash into a wall; the concrete values were found experimentally by optimizing the performance of the vanilla DQN algorithm. We remark that one can view the above reward structure as a proxy for the probability of reaching the goal. We will consider both perspectives in our experiments, as described in the next section.

Fig. 2.

Fig. 2. Three Racetrack maps, where the goal line is marked in green and wall cells are colored gray.

Additionally to the overall performance, we will analyze critical regions of the map that are particularly hard to solve. These regions, as shown in Figure 3, are the “dead-end streets” that the agent will need to back out from, i.e., temporarily increase the distance to the goal.

Fig. 3.

Fig. 3. Selected regions (yellow) of the Maze and River maps.

4.2 MiniGrid

The MiniGrid library [14] is a widely used benchmark in the AI and especially the DRL community [13, 18, 33, 47, 60]. The task is a 2D navigation task with several options, e.g., lava, keys, and doors, or moving obstacles.

Fig. 4.

Fig. 4. Used MiniGrid with randomly moving obstacles (blue), where the goal is marked in green, and the selected regions in yellow.

We consider the MiniGrid environment displayed in Figure 4, where the task is to navigate to a pre-defined goal position while avoiding randomly moving obstacles. Each grid cell has a type that can either be empty, a goal tile, a wall cell, occupied by the agent, or occupied by exactly one obstacle. Initially, the six obstacles are placed at random, empty positions. At each step, they move to a randomly chosen empty position adjacent to their current position.

In contrast to Racetrack, the agent is not subject to a self-imposed velocity that is applied to its position at each step. Instead, the agent is associated with a movement direction (up, right, left, down) and can choose to alter this direction by \(\pm 90\)° by turning left or right or to move one unit in that direction. The default action space thus consists of only three different actions. Also note that there is no noop-action, i.e., an action that does nothing.

If the agent moves into a wall or into a (moving) obstacle, then the game is lost and the episode is terminated.

We use the following simple reward function, similar to the one used in Racetrack: (18) \(\begin{equation} r\left(s \xrightarrow {}s^{\prime }\right) = {\left\lbrace \begin{array}{ll} \hfill 1 & \text{if } s^{\prime } = \top \\ \hfill -1 & \text{if } s^{\prime } = \bot \\ \hfill 0 & \text{otherwise.} \end{array}\right.} \end{equation}\) Again, this reward function is positive if the game was won (\(\top\)), negative if the game was lost (\(\bot\)), and zero otherwise. In contrast to Racetrack, there is no noop-action that could be penalized. We take over the reward function’s concrete values from the MiniGrid benchmark’s default values with the slight alteration that we motivate reaching the goal quickly by using a discount factor and not by penalizing steps.

In our MiniGrid, the state is fully observable, i.e., the agent can see the full grid. Every single cell is given through a numerical value that represents its type, e.g., empty, wall, or occupied by an agent. Additionally, we include the number of steps taken in the state description.

Similarly to the Racetrack, we define critical regions that are difficult to solve. Here, these regions, as shown in Figure 4, are given by the narrow parts of the map, where it is particularly hard to pass by the moving obstacles, and the cells that have the furthest distance from the goal.

4.3 Experimental Setup

The policies (also: agents) in our experiments are trained using the different variants of Algorithm 1 and Algorithm 2. Specifically from Algorithm 1, we run DQN and DQNPR, as well as two variants of our DSMC-based algorithms (EID and EPR) each. Additionally, we use the DDQN algorithm (see Section 3.4.3) with the same EID and EPR variants. From Algorithm 2, we run A2C and two variants of EID.

The evaluation stage variants arise from two different optimization objectives in EID and EPR: the expected discounted accumulated reward, which is the same as DRL is trained upon; vs. the probability of reaching the goal, as an idealized evaluation objective not suited for training. We denote our algorithms using these objectives with EID\(^{R}\) and EPR\(^{R}\) for the former and with EID\(^{G}\) and EPR\(^{G}\) for the latter.

Additionally, we denote the corresponding base algorithm with a subscript, e.g., EID\(^{R}_{\mathit {DQN}}\) or EID\(^{R}_{\mathit {A2C}}\).

For the evaluation stages, we use DSMC with an error bound \(P(\mathit {error} \gt \epsilon _\mathit {err}) \lt \kappa\). For our comparison to be as fair as possible, all approaches use the same number of training episodes as the DSMC-based methods, i.e., \(M = P + L \cdot N\) (cf. Section 3.4.1).

For Racetrack, our partition \(\mathbb {P}\) of the initial states I considers each map cell with zero velocity to be a region on its own. We use a high noise level, namely, \(50\%\), for the Barto-big Figure 2(a) and River maps Figure 2(c), to make the decision-making problems challenging. The Maze map Figure 2(b), with its long and narrow paths, is already challenging with much less uncertainty, so we set the noise to \(10\%\) there. Also, all compared approaches use the same neural network structure. We consider multilayer perceptrons (MLPs), a.k.a. feed-forward networks, with a ReLU activation function for every single neuron. We specifically consider the same NN structure as Reference [19], with input and output layers fixed by Racetrack, and two hidden layers with 64 neurons each.

For MiniGrid, our partition of the initial states considers each grid cell as a region. Thus, all initial states where the agent occupies said grid cell, including all of the agent’s possible directions and all possible placements of the moving obstacles, form one region. We use a high number of six obstacles, which makes passing them especially difficult in the narrow parts of the grid. Again, all compared approaches make use of the same neural network by using the same network structure as Gros [19], but with input and output layers fixed by MiniGrid and two hidden layers with 128 neurons each. The increased number of neurons in the hidden layers compared to Racetrack is due to the larger input to the NN, i.e., the larger state representation.

As deep reinforcement learning is known to be sensitive to different random seeds (affecting the exploration of the state space), we perform multiple trainings and report about the best result achieved. Moreover, we fix the random seeds across algorithms in individual runs, such that the first P episodes are equal. The detailed hyperparameter settings can be found in Appendix A.

Skip 5LOCAL ROBUSTNESS (DEFICIENCY (I)) Section

5 LOCAL ROBUSTNESS (DEFICIENCY (I))

We now analyze whether the inclusion of evaluation stages in the EID and EPR algorithms can improve the first deficiency of local robustness compared to standard algorithms. We thus set the evaluation objective to be identical to the expected-reward training objective (EID\(^{R}\) and EPR\(^{R}\)) and analyze whether local robustness is improved.

We use DQN, A2C, and DDQN as baselines and use two different benchmarks: the Racetrack and the MiniGrid.

5.1 Racetrack

5.1.1 DQN.

We start by using the Racetrack benchmark to analyze DQN (and EID and EPR), which was already included in prior work [22]. Consider the heat maps in Figure 5. For each cell on the map, we plot the expected cumulative discounted reward—the return—when starting from that map cell with zero velocity. In other words, the heat maps have one colored entry for every initial state \(s_0 \in I\). We compute the return value for each \(s_0\) using DSMC, with \(\epsilon _\mathit {err} = 1\) and \(\kappa = 0.05\), i.e., with a confidence of \(95\%\) that the error is at most 1.

Fig. 5.

Fig. 5. Return per map cell on the Maze map.

Clearly, the intended improvement of local robustness is achieved by EID\(^{R}_{\mathit {DQN}}\) and EPR\(^{R}_{\mathit {DQN}}\) compared to DQN and DQNPR: The return of the algorithms with evaluation stages is much better in specific areas of the map. This pertains foremost to both right corners of the map; far away from the goal at the top left; and to the dead-ends colored black in (a) and (b), where there is no direct connection to the nearest goal and the agents have to temporarily increase the distance to the goal. While the return of EID\(^{R}_{\mathit {DQN}}\) and EPR\(^{R}_{\mathit {DQN}}\) may also seem lower in these critical parts, recall that the environment is noisy and it is impossible to navigate through this map without a high crash risk.

Figure 6(a) summarizes these findings, for all maps, in terms of the variance of the return across the map (the variance of the return per map cell).

Fig. 6.

Fig. 6. Variance and average of return.

The variances of our approaches are mostly smaller than that of DQN and DQNPR, confirming their improved local robustness. The variance reduction reaches up to about \(40 \%\) compared to DQN/DQNPR. On Maze, which has the most complex structure and thus is the most difficult, this variance reduction becomes the clearest.

Additionally, Figure 6(b) shows that the improved local robustness also results in slightly improved average return for EID\(^{R}_{\mathit {DQN}}\) and EPR\(^{R}_{\mathit {DQN}}\). This shows that evaluation stages can also help the overall performance when challenging local sub-tasks are frequent.

Note that most of the policies achieved through the DQN algorithm act near-optimal, which makes even slight improvements hard to achieve.

Finally, consider Figure 7, which on the River map exemplarily demonstrates that the improvements observed above are indeed due to more intense training in critical parts of the map. For each map call, we show how often reinforcement learning considered a state from that cell. The training intensity of DQN is spread fairly homogeneously across the map (positions in the dead-end street are seen more often merely because, in any run traversing those, the car needs to turn around). In contrast, EID and EPR have a clear focus on the critical parts of the map.

Fig. 7.

Fig. 7. Number of times each cell was encountered during training on the River map.

Fig. 8.

Fig. 8. Return per map cell on the Maze map.

5.1.2 DDQN.

We here consider DDQN as our baseline and compare it in extension with EID and EPR. We restrict ourselves to the Maze map, as we have seen in Section 5.1.1 that the DQN already performs quite well for the other maps and, thus, there is only limited potential for further improvement.

First, consider Figure 8, plotting the return for every initial state \(s_0 \in I\) computed by DSMC. The intended improvement becomes vivid when focusing on the selected regions of the map, especially the dead-ends and the top right corner. Clearly, these are areas where DDQN see Figure 8(a) can be improved and where EID\(^{R}_{\mathit {DDQN}}\) Figure 8(b) and EPR\(^{R}_{\mathit {DDQN}}\) Figure 8(c) perform better.

Figure 9 strengthens these findings by considering the average return. Figure 9(a) displays the average return throughout the whole map, where DDQN, EID\(^{R}_{\mathit {DDQN}}\), and EPR\(^{R}_{\mathit {DDQN}}\) perform roughly equally, even though we just found that DDQN lacks local robustness. In contrast, Figure 9(b) can confirm that the average return of the selected regions (see Figure 3) is clearly increased for both ES-based methods.

Fig. 9.

Fig. 9. Average of return.

Fig. 10.

Fig. 10. Return per map cell on the Maze map.

5.1.3 A2C.

With A2C, we here consider an actor-critic approach. As our A2C approach does not make use of a replay buffer, we can only compare between A2C and EID\(^{R}_{\mathit {A2C}}\).

First, consider Figure 10(a), which depicts the return for every initial state \(s_0 \in I\) for A2C computed by DSMC. The agent is able to perfectly solve the map close to the goal line but not able to find it at all from most of the map’s positions. Note that policy-based approaches, in general, perform poorly when paired with degenerated reward structures as used here, which explains the decreased performance compared to the value-based approaches. Figure 10(b) shows a clear improvement of EID\(^{R}_{\mathit {A2C}}\) compared to plain A2C, as the agent is able to solve a larger part of the map. However, our method is not able to completely compensate for the lackluster of the actor-critic approaches.

The same findings can be confirmed when looking at Figure 11, which displays the average return across the entire map. Given that most of the map cannot be solved, we only find a slightly increased average return of EID\(^{R}_{\mathit {A2C}}\) compared to A2C.

Fig. 11.

Fig. 11. Average of return across entire map.

5.2 MiniGrid

Since it became clear in Section 5.1 that the value-based approaches are more promising, we also evaluate the two value-based approaches considered (DQN and DDQN) and their extensions with EID and EPR on the MiniGrid benchmark.

5.2.1 DQN.

Consider the heat maps in Figure 12. For each cell on the map, we plot the return when starting from that map cell with any direction and the opponents placed randomly anywhere. In other words, the heat maps have one colored entry that represents all initial states \(s_0 \in I\) with the same grid position. We compute the return value for each cell using DSMC, with \(\epsilon _\mathit {err} = 0.01\) and \(\kappa = 0.05\), i.e., with a confidence of \(95\%\) that the error is at most 0.01.

Fig. 12.

Fig. 12. Return per map cell on the Maze map.

Although the DQN approach already performs quite well, EID\(^{R}_{\mathit {DQN}}\) and EPR\(^{R}_{\mathit {DQN}}\) perform especially better at the top of the grid, where the distance to the goal is the largest and where one still needs to pass most of the moving obstacles.

Figure 13 strengthens that finding in terms of the achieved return. Already for the return across the whole grid Figure 13(a), but, in particular, for the selected regions Figure 13(b), both ES-based approaches outperform DQN and DQNPR, confirming their improved local robustness.

Fig. 13.

Fig. 13. Selected regions.

5.2.2 DDQN.

Similar to the Racetrack, we additionally examine DDQN in combination with our ES-based approaches for MiniGrid. First, consider Figure 14, where we again plot the return when starting from each cell.

Fig. 14.

Fig. 14. Return per map cell on the Maze map.

All of the considered approaches perform similarly. While EID\(^{R}_{\mathit {DDQN}}\) seems to come with a slight improvement, EPR\(^{R}_{\mathit {DDQN}}\) looks slightly worse than DDQN.

However, considering Figure 15, showing (a) the average return across the entire map and (b) for the selected regions, shows that there are slight improvements for both EID\(^{R}_{\mathit {DDQN}}\) and EPR\(^{R}_{\mathit {DDQN}}\) compared to DDQN. Especially considering the selected regions, which come with a higher difficulty, Figure 15(b) indicates a significantly increased return, especially for EPR\(^{R}_{\mathit {DDQN}}\), and therefore an improved local robustness for the ES-based approaches.

Fig. 15.

Fig. 15. Average of return.

Skip 6FOSTERING GOAL REACHABILITY PROBABILITY (DEFICIENCY (II)) Section

6 FOSTERING GOAL REACHABILITY PROBABILITY (DEFICIENCY (II))

We now turn to deficiency (ii), goal reachability probability performance when training on expected reward. As discussed above, the reward structure is such that it rewards reaching the goal but also punishes crashing into a wall. We now show that, indeed, goal reachability performance can be improved by introducing evaluation stages. EID\(^{G}\) and EPR\(^{G}\) improve the learning signal w.r.t. this objective.

In what follows, we compute all goal reachability probabilities using DSMC, with \(\epsilon _\mathit {err} = 0.01\) and \(\kappa = 0.05\), i.e., with a confidence of \(95\%\) that the error is at most 0.01.

6.1 Racetrack

6.1.1 DQN.

As already included in prior work [22], we start by using the Racetrack benchmark to analyze DQN (and EID and EPR).

Figure 16 shows the corresponding goal reachability probabilities, (a) for all maps across the entire map and (b) for the Maze and River map, only for the critical regions. In Figure 16(a), we see that, again, on the Maze and River maps, our proposed methods increase the average goal reachability probability. On Barto-big, this does not happen due to the simpler structure of that map.

As one would expect, the improvement is mostly higher in critical areas of the maps. To illustrate this, Figure 16(b) shows the average goal reachability probability for selected regions of the Maze and River maps, as shown in Figure 3.

Fig. 16.

Fig. 16. Average goal reachability probability when training on expected reward, without (DQN and DQNPR) vs. with (EID \(^{G}_{\mathit {DQN}}\) and EPR \(^{G}_{\mathit {DQN}}\) ) goal reachability probability evaluation stages.

6.1.2 DDQN.

Now, we take DDQN into account (Figure 17), but consider only Maze, the most difficult map.

Figure 17(a) shows the average goal reachability probability across the entire map, and Figure 17(b) only for the critical regions. While all three approaches seem to perform similarly across the map, EID\(^{G}_{\mathit {DDQN}}\) and EPR\(^{G}_{\mathit {DDQN}}\) both clearly improve the goal reachability probability in the selected regions, confirming that evaluation stages can be used to improve learning w.r.t. said objective.

Fig. 17.

Fig. 17. Average goal reachability probability when training on expected reward, without (DDQN) vs. with (EID \(^{G}_{\mathit {DDQN}}\) and EPR \(^{G}_{\mathit {DDQN}}\) ) goal reachability probability evaluation stages.

6.1.3 A2C.

For the policy-based approach, Figure 18 shows that, again, our findings from the value-based approaches can be confirmed. For Barto-big, the goal reachability probability is significantly increased. For River, both approaches perform roughly equally. However, all of the A2C-based algorithms can once again not compete with the value-based approaches.

Fig. 18.

Fig. 18. Average goal reachability probability when training on expected reward, without (A2C) vs. with (EID \(^{G}_{\mathit {A2C}}\) ) goal reachability probability evaluation stages.

The improvements can be explained by considering Figure 19, which depicts the number of training episodes started in each grid cell. While A2C Figure 19(a) starts uniformly throughout the whole map, EID\(^{G}_{\mathit {A2C}}\) starts significantly less where the task has already been learned: close to the goal line. Unfortunately, the area that has already been learned is small (similar to Figure 10), which is why the A2C approaches perform worse than the value-based ones.

Fig. 19.

Fig. 19. Number of episodes started in each grid cell without (A2C) vs. with (EID \(^{G}_{\mathit {A2C}}\) ) goal reachability probability evaluation stages.

6.2 MiniGrid

Just as in Section 5, we additionally evaluate the more promising value-based approaches on the MiniGrid benchmark.

6.2.1 DQN.

Consider Figure 20, depicting the goal reachability probability of all four considered approaches. Using evaluation stages does not only improve the goal reachability probability across the entire grid Figure 20(a) but especially in the selected regions, where the task is more difficult Figure 20(b).

Fig. 20.

Fig. 20. Average goal reachability probability when training on expected reward, without (DQN and DQNPR) vs. with (EID \(^{G}_{\mathit {DQN}}\) and EPR \(^{G}_{\mathit {DQN}}\) ) goal reachability probability evaluation stages.

6.2.2 DDQN.

We again use DDQN as our baseline and compare it to its extensions with evaluation stages. Figure 21 displays the goal reachability probability (a) across the entire map and (b) for the selected regions only. In Figure 21(a), we can observe a roughly equal performance and even a small decrease for EID\(^{G}_{\mathit {DDQN}}\) compared to DDQN, while EPR\(^{G}_{\mathit {DDQN}}\) performs significantly better. Figure 21(b) clearly shows the fostered goal reachability probability for the interesting regions, especially for EPR\(^{G}_{\mathit {DDQN}}\).

Skip 7CONCLUSION AND FUTURE WORK Section

7 CONCLUSION AND FUTURE WORK

Despite its enormous successes, deep reinforcement learning suffers from important deficiencies in safety-critical systems. Apart from the general inscrutability of neural networks, these include that (i) training on average performance measures lacks local robustness, and that (ii) safety-related objectives like goal reachability probability are sparse and hence not themselves suited for training. We propose to address (i) and (ii) through the incorporation of evaluation stages, which focus the reinforcement learning process on areas of the state space where performance according to an evaluation objective is poor. We observe that such evaluation stages can be readily implemented based on a recently introduced tool for deep statistical model checking [21]. Our experiments on Racetrack [7, 10, 38, 44, 45] and MiniGrid [13, 18, 33, 47, 60], both frequently used benchmarks for AI sequential decision-making algorithms, confirm that this approach can work.

Fig. 21.

Fig. 21. Average goal reachability probability when training on expected reward, without (DDQN) vs. with (EID \(^{G}_{\mathit {DDQN}}\) and EPR \(^{G}_{\mathit {DDQN}}\) ) goal reachability probability evaluation stages.

On the algorithmic side, especially the poor performance of actor-critic approaches remains unsolved. To address this, especially their combination with both evaluation stages and exploration techniques [12, 26] is of interest.

Our approaches cannot straightforwardly be applied to problems that have a single initial state instead of a set of initial states. We plan to extend our algorithm to that class of problems.

Apart from that, an even broader empirical exploration of our approach is an important direction. A straightforward possibility is extensions of the Racetrack to include obstacles, traffic, fuel, and so on, on a roadmap towards more realistic abstractions of autonomous driving as outlined by Reference [6]. Of course, our approach can, in principle, be applicable in arbitrary contexts where deep reinforcement learning is used. We believe that safety-critical cyber-physical systems should be the prime target, seeing as (i) and (ii) are key in that context, and seeing as the initial state partition required by our approach can be naturally obtained by (coarse discretizations of) physical location. In this context, a particular question to address will be the partition granularity tradeoff, between the amount of information available during evaluation stages and the overhead for conducting them.

APPENDIX

A HYPERPARAMETERS

ParameterDescriptionValue

DQN/DDQN:

\(\epsilon _\mathit {start}\)exploration coefficient in the beginning of the training0.99
\(\epsilon _\mathit {end}\)exploration coefficient in the end of the training0.05
\(\epsilon _\mathit {decay}\)exponential decay factor of the training coefficient that is applied every episode until \(\epsilon _\mathit {end}\) is reached0.999
\(\gamma\)discount factor0.99
Mnumber of used training episodes (Racetrack)\(110{,}000\)
Mnumber of used training episodes (MiniGrid)\(40{,}000\)
Tmaximal length episodes100
learning rate in gradient descent optimization (Racetrack)0.0008
learning rate in gradient descent optimization (MiniGrid)0.0001
\(\tau\)soft update coefficient\(0.001\)
size of replay buffer\(10^8\)

DQNPR:

\(\alpha\)prioritization coefficient1
\(\epsilon _p\)minimal priority\(10^{-6}\)

ES-based:

Pnumber of pre-training episodes\(10{,}000\)
Nnumber of evaluation stages (Racetrack)10
Nnumber of evaluation stages (MiniGrid)3
Lnumber of episodes between the evaluation stages\(10{,}000\)
\(\kappa\)error rate/half-width parameter\(0.05\)
\(\epsilon _\mathit {err}\)error probability/confidence\(0.05\)
\(\epsilon _p\)minimal priority0.2

A2C:

\(\gamma\)discount factor0.99
Mnumber of used training episodes\(110{,}000\)
Tmaximal length episodes100
actor learning rate0.0005
critic learning rate0.001
\(\lambda\)GAE bias-variance tradeoff coefficient0.9
entropy regularization weight [39]0.001

Footnotes

  1. 1 One can combine such a proxy with the goal reachability probability objective, though multiple objectives are difficult to achieve with a one-dimensional reward signal and standard backpropagation algorithms for neural nets [37]; anyway, training objective vs. ideal objective are still not identical here. Reward shaping is an alternative option that can in principle preserve the optimal policy [43], but this is not always possible, and manual work is needed for individual learning tasks (substantial work sometimes; see, e.g., Reference [59]).

    Footnote
  2. 2 The benefit of our proposed ES thus hinges, in particular, on how meaningful these representative states are for policy performance. While this is a limitation, partitioning by physical location like in Racetrack could be a canonical candidate in many scenarios.

    Footnote
  3. 3 In our Racetrack case studies, we use the map cells as the basis of \(\mathbb {P}\)—i.e., states sharing the same physical location. We believe that this partitioning method may work for many application scenarios involving physical space. Alternatively, one may, for example, partition state-variable ranges into intervals.

    Footnote

REFERENCES

  1. [1] Agostinelli Forest, McAleer Stephen, Shmakov Alexander, and Baldi Pierre. 2019. Solving the Rubik’s cube with deep reinforcement learning and search. Nat. Mach. Intell. 1, 8 (2019), 356363.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Alshiekh Mohammed, Bloem Roderick, Ehlers Rüdiger, Könighofer Bettina, Niekum Scott, and Topcu Ufuk. 2018. Safe reinforcement learning via shielding. In 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Amit Ron, Meir Ron, and Ciosek Kamil. 2020. Discount factor as a regularizer in reinforcement learning. In International Conference on Machine Learning. PMLR, 269278.Google ScholarGoogle Scholar
  4. [4] Atkinson Craig, McCane Brendan, Szymanski Lech, and Robins Anthony. 2021. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. Neurocomputing 428 (2021), 291307.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Avni Guy, Bloem Roderick, Chatterjee Krishnendu, Henzinger Thomas A., Könighofer Bettina, and Pranger Stefan. 2019. Run-time optimization for learned controllers through quantitative games. In International Conference on Computer Aided Verification. Springer, 630649.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Baier Christel, Christakis Maria, Gros Timo P., Groß David, Gumhold Stefan, Hermanns Holger, Hoffmann Jörg, and Klauck Michaela. 2020. Lab conditions for research on explainable automated decisions. In Trustworthy AI–Integrating Learning, Optimization and Reasoning: First International Workshop, TAILOR 2020. Springer Nature, 83.Google ScholarGoogle Scholar
  7. [7] Barto Andrew G., Bradtke Steven J., and Singh Satinder P.. 1995. Learning to act using real-time dynamic programming. Artif. Intell. 72, 1-2 (1995), 81138.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bogdoll Jonathan, Hartmanns Arnd, and Hermanns Holger. 2012. Simulation and statistical model checking for modestly nondeterministic models. In International GI/ITG Conference on Measurement, Modelling, and Evaluation of Computing Systems and Dependability and Fault Tolerance. Springer, 249252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Bonet Blai and Geffner Hector. 2001. GPT: A tool for planning with uncertainty and partial information. In IJCAI Workshop on Planning with Uncertainty and Incomplete Information. 8287.Google ScholarGoogle Scholar
  10. [10] Bonet Blai and Geffner Hector. 2003. Labeled RTDP: Improving the convergence of real-time dynamic programming. In International Conference on Automated Planning and Scheduling. 1221.Google ScholarGoogle Scholar
  11. [11] Budde Carlos E., D’Argenio Pedro R., Hartmanns Arnd, and Sedwards Sean. 2018. A statistical model checker for nondeterminism and rare events. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 340358.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Burda Yuri, Edwards Harrison, Storkey Amos, and Klimov Oleg. 2018. Exploration by random network distillation. arXiv preprint arXiv:1810.12894 (2018).Google ScholarGoogle Scholar
  13. [13] Chevalier-Boisvert Maxime, Bahdanau Dzmitry, Lahlou Salem, Willems Lucas, Saharia Chitwan, Nguyen Thien Huu, and Bengio Yoshua. 2019. BabyAI: First steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations, Vol. 105.Google ScholarGoogle Scholar
  14. [14] Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo de Lazcano, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. 2023. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. CoRR abs/2306.13831 (2023).Google ScholarGoogle Scholar
  15. [15] Ciosek Kamil and Whiteson Shimon. 2017. Offer: Off-environment reinforcement learning. In AAAI Conference on Artificial Intelligence, Vol. 31.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Frank Jordan, Mannor Shie, and Precup Doina. 2008. Reinforcement learning in the presence of rare events. In 25th International Conference on Machine Learning. 336343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Fujita Yasuhiro, Nagarajan Prabhat, Kataoka Toshiki, and Ishikawa Takahiro. 2021. ChainerRL: A deep reinforcement learning library. J. Mach. Learn. Res. 22, 77 (2021), 114.Google ScholarGoogle Scholar
  18. [18] Goyal Anirudh, Sodhani Shagun, Binas Jonathan, Peng Xue Bin, Levine Sergey, and Bengio Yoshua. 2019. Reinforcement learning with competitive ensembles of information-constrained primitives. arXiv preprint arXiv:1906.10667 (2019).Google ScholarGoogle Scholar
  19. [19] Gros Timo P.. 2021. Tracking the Race: Analyzing Racetrack Agents Trained with Imitation Learning and Deep Reinforcement Learning. Master’s thesis. Saarland University, Saarland Informatics Campus, 66123 Saarbrücken.Google ScholarGoogle Scholar
  20. [20] Gros Timo P., Groß David, Gumhold Stefan, Hoffmann Jörg, Klauck Michaela, and Steinmetz Marcel. 2020. TraceVis: Towards visualization for deep statistical model checking. In 9th International Symposium on Leveraging Applications of Formal Methods, Verification and Validation. From Verification to Explanation.Google ScholarGoogle Scholar
  21. [21] Gros Timo P., Hermanns Holger, Hoffmann Jörg, Klauck Michaela, and Steinmetz Marcel. 2020. Deep statistical model checking. In 40th International Conference on Formal Techniques for Distributed Objects, Components, and Systems (FORTE’20). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Gros Timo P., Hermanns Holger, Hoffmann Jörg, Klauck Michaela, and Steinmetz Marcel. 2022. Analyzing neural network behavior through deep statistical model checking. Int. J. Softw. Tools Technol. Transf. (2022), 120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Gros Timo P., Höller Daniel, Hoffmann Jörg, Klauck Michaela, Meerkamp Hendrik, and Wolf Verena. 2021. DSMC evaluation stages: Fostering robust and safe behavior in deep reinforcement learning. In International Conference on Quantitative Evaluation of Systems. Springer, 197216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Gros Timo P., Höller Daniel, Hoffmann Jörg, and Wolf Verena. 2020. Tracking the race between deep reinforcement learning and imitation learning. In International Conference on Quantitative Evaluation of Systems. Springer, 1117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Gu Shixiang, Holly Ethan, Lillicrap Timothy, and Levine Sergey. 2017. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA’17). IEEE, 33893396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning. PMLR, 18611870.Google ScholarGoogle Scholar
  27. [27] Hare Joshua. 2019. Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019).Google ScholarGoogle Scholar
  28. [28] Hartmanns Arnd and Hermanns Holger. 2014. The modest toolset: An integrated environment for quantitative modelling and verification. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (LNCS 8413). 593598.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Hasanbeig Mohammadhosein, Abate Alessandro, and Kroening Daniel. 2018. Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018).Google ScholarGoogle Scholar
  30. [30] Hasanbeig Mohammadhosein, Kroening Daniel, and Abate Alessandro. 2020. Deep reinforcement learning with temporal logics. In International Conference on Formal Modeling and Analysis of Timed Systems. Springer, 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Hinton G., Deng L., Yu D., Dahl G. E., Mohamed A., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T. N., and Kingsbury B.. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Process. Mag. 29, 6 (2012), 8297.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Bettina Könighofer, Roderick Bloem, Sebastian Junges, Nils Jansen, and Alex Serban. 2020. Safe reinforcement learning using probabilistic shields. International Conference on Concurrency Theory: 31st CONCUR.Google ScholarGoogle Scholar
  33. [33] Jiang Minqi, Grefenstette Edward, and Rocktäschel Tim. 2021. Prioritized level replay. In International Conference on Machine Learning. PMLR, 49404950.Google ScholarGoogle Scholar
  34. [34] Junges Sebastian, Jansen Nils, Dehnert Christian, Topcu Ufuk, and Katoen Joost-Pieter. 2016. Safety-constrained reinforcement learning for MDPs. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 130146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Knox W. Bradley and Stone Peter. 2012. Reinforcement learning from human reward: Discounting in episodic tasks. In 21st IEEE International Symposium on Robot and Human Interactive Communication. 878885. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems Conference. 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Liu Chunming, Xu Xin, and Hu Dewen. 2014. Multiobjective reinforcement learning: A comprehensive overview. IEEE Trans. Syst., Man, Cybern.: Syst. 45, 3 (2014), 385398.Google ScholarGoogle Scholar
  38. [38] McMahan H. Brendan and Gordon Geoffrey J.. 2005. Fast exact planning in Markov decision processes. In International Conference on Automated Planning and Scheduling. 151160.Google ScholarGoogle Scholar
  39. [39] Mnih Volodymyr, Badia Adria Puigdomenech, Mirza Mehdi, Graves Alex, Lillicrap Timothy, Harley Tim, Silver David, and Kavukcuoglu Koray. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. PMLR, 19281937.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Graves Alex, Antonoglou Ioannis, Wierstra Daan, and Riedmiller Martin. 2013. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).Google ScholarGoogle Scholar
  41. [41] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A., Veness Joel, Bellemare Marc G., Graves Alex, Riedmiller Martin A., Fidjeland Andreas, Ostrovski Georg, Petersen Stig, Beattie Charles, Sadik Amir, Antonoglou Ioannis, King Helen, Kumaran Dharshan, Wierstra Daan, Legg Shane, and Hassabis Demis. 2015. Human-level control through deep reinforcement learning. Nature 518 (2015), 529533.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Nazari MohammadReza, Oroojlooy Afshin, Snyder Lawrence, and Takac Martin. 2018. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems 31, Bengio S., Wallach H., Larochelle H., Grauman K., Cesa-Bianchi N., and Garnett R. (Eds.). Curran Associates, Inc., 98399849.Google ScholarGoogle Scholar
  43. [43] Ng Andrew Y., Harada Daishi, and Russell Stuart J.. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In 16th International Conference on Machine Learning (ICML’99). 278287.Google ScholarGoogle Scholar
  44. [44] Pineda Luis Enrique, Lu Yi, Zilberstein Shlomo, and Goldman Claudia V.. 2013. Fault-tolerant planning under uncertainty. In 23rd International Joint Conference on Artificial Intelligence. 23502356.Google ScholarGoogle Scholar
  45. [45] Pineda Luis Enrique and Zilberstein Shlomo. 2014. Planning under uncertainty using reduced models: Revisiting determinization. In International Conference on Automated Planning and Scheduling, Vol. 24.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Puterman Martin L.. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming (1st ed.). John Wiley & Sons, Inc., New York, NY.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Raileanu Roberta and Rocktäschel Tim. 2020. RIDE: Rewarding Impact-Driven Exploration for procedurally-generated environments. arXiv preprint arXiv:2002.12292 (2020).Google ScholarGoogle Scholar
  48. [48] Riedmiller Martin, Hafner Roland, Lampe Thomas, Neunert Michael, Degrave Jonas, Wiele Tom, Mnih Vlad, Heess Nicolas, and Springenberg Jost Tobias. 2018. Learning by playing solving sparse reward tasks from scratch. In International Conference on Machine Learning. PMLR, 43444353.Google ScholarGoogle Scholar
  49. [49] Sallab Ahmad El, Abdou Mohammed, Perot Etienne, and Yogamani Senthil. 2017. Deep reinforcement learning framework for autonomous driving. Electron. Imag. 2017, 19 (2017), 7076.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Schaul Tom, Quan John, Antonoglou Ioannis, and Silver David. 2016. Prioritized experience replay. In 4th International Conference on Learning Representations, Bengio Yoshua and LeCun Yann (Eds.).Google ScholarGoogle Scholar
  51. [51] Schulman John, Moritz Philipp, Levine Sergey, Jordan Michael, and Abbeel Pieter. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).Google ScholarGoogle Scholar
  52. [52] Schwartz Anton. 1993. A reinforcement learning method for maximizing undiscounted rewards. In 10th International Conference on Machine Learning, Vol. 298. 298305.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Sen Koushik, Viswanathan Mahesh, and Agha Gul. 2005. On statistical model checking of stochastic systems. In International Conference on Computer Aided Verification. 266280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Silver David, Huang Aja, Maddison Chris J., Guez Arthur, Sifre Laurent, Driessche George Van Den, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484489.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Silver David, Hubert Thomas, Schrittwieser Julian, Antonoglou Ioannis, Lai Matthew, Guez Arthur, Lanctot Marc, Sifre Laurent, Kumaran Dharshan, Graepel Thore, Lillicrap Timothy, Simonyan Karen, and Hassabis Demis. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018), 11401144.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Silver David, Schrittwieser Julian, Simonyan Karen, Antonoglou Ioannis, Huang Aja, Guez Arthur, Hubert Thomas, Baker Lucas, Lai Matthew, Bolton Adrian, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without human knowledge. Nature 550, 7676 (2017), 354359.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Stooke Adam and Abbeel Pieter. 2019. rlpyt: A research code base for deep reinforcement learning in PyTorch. arXiv preprint arXiv:1909.01500 (2019).Google ScholarGoogle Scholar
  58. [58] Sutton Richard S. and Barto Andrew G.. 2018. Reinforcement Learning: An Introduction (2nd ed.). The MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Vinyals Oriol, Babuschkin Igor, Czarnecki Wojciech M., Mathieu Michaël, Dudzik Andrew, Chung Junyoung, Choi David H., Powell Richard, Ewalds Timo, Georgiev Petko, Oh Junhyuk, Horgan Dan, Kroiss Manuel, Danihelka Ivo, Huang Aja, Sifre Laurent, Cai Trevor, Agapiou John P., Jaderberg Max, Vezhnevets Alexander S., Leblond Remi, Pohlen Tobias, Dalibard Valentin, Budden David, Sulsky Yury, Molloy James, Paine Tom L., Gulcehre Caglar, Wang Ziyu, Pfa Tobias, Wu Yuhuai, Ring Roman, Yogatama Dani, Wünsch Dario, McKinney Katrina, Smith Oliver, Schaul Tom, Lillicrap Timothy, Kavukcuoglu Koray, Hassabis Demis, Apps Chris, and Silver David. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Wachi Akifumi, Wei Yunyue, and Sui Yanan. 2021. Safe policy optimization with local generalized linear function approximations. Adv. Neural Inf. Process. Syst. 34 (2021), 2075920771.Google ScholarGoogle Scholar
  61. [61] Wang Ziyu, Schaul Tom, Hessel Matteo, van Hasselt Hado, Lanctot Marc, and de Freitas Nando. 2015. Dueling Network Architectures for Deep Reinforcement Learning. (2015). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Younes Håkan L. S., Kwiatkowska Marta, Norman Gethin, and Parker David. 2004. Numerical vs. statistical probabilistic model checking: An empirical study. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 4660.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. DSMC Evaluation Stages: Fostering Robust and Safe Behavior in Deep Reinforcement Learning – Extended Version

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Modeling and Computer Simulation
              ACM Transactions on Modeling and Computer Simulation  Volume 33, Issue 4
              October 2023
              175 pages
              ISSN:1049-3301
              EISSN:1558-1195
              DOI:10.1145/3630105
              • Editor:
              • Wentong Cai
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 26 October 2023
              • Online AM: 12 July 2023
              • Accepted: 13 June 2023
              • Revised: 22 May 2023
              • Received: 15 February 2022
              Published in tomacs Volume 33, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader