1 Introduction

In real-world decision making, a policy is often only executed once. For example, consider a government planning to build an off-shore wind farm to generate electricity. To ensure electricity generation is maximised, the wind farm must be located in an area with sufficient wind while also not interfering with any fishing routes or protected marine life. Given the off-shore wind farm will only be constructed once, the government must consider each potential outcome and likelihood of each outcome to ensure an optimal decision can be made.

In reinforcement learning (RL), the expected return is used to make decisions [71]. However, in many scenarios the utility of a user is derived from a single execution of a policy, and, therefore, the utility of the returns must be optimised [61]. For example, in a medical setting a patient may only have one opportunity to select a treatment. In this example, a patient will aim to cure their illness based on a single course of a treatment. As such, the user’s utility is derived from the single execution of a policy. Moreover, computing a policy based on applying a utility function to expected return is incompatible with how the user’s utility is derived because the expected return considers the average outcome over multiple policy executions. Therefore, the utility of the expectation is computed. In contrast, a policy that maximises utility of the return considers the utility obtained from each individual outcome, which is compatible with how the user’s utility is derived. Therefore, the expected utility must be maximised (see Sect. 2.4).

When optimising for expected utility the underlying distribution of the returns must be used differently. Therefore, decisions must be made using the utility of the returns of a full policy. Under these conditions, an agent must be able to sample from the underlying return distribution to calculate the future returns. The agent must also be able to calculate the returns accrued at each timestep. Therefore, we theorise that for an agent to have sufficient critical information at decision time the agent must apply the utility function to the cumulative returns, which is the sum of the accrued and future returns [61].

To calculate the utility, we apply the utility function to the returns where a user’s utility function is known a priori. In other words, in the taxonomy of multi-objective sequential decision making [62], we are in the known utility function scenario. When optimising under the expected utility, it is critical to only apply the utility function to the returns of a full execution of a policy [61] because nonlinear utility functions do not distribute across the sum of immediate and future returns [31, 61]. In this case, the agent must know the returns it has already accrued and the future returns before applying the utility function. For example, before the 2008 financial crash, many investment bankers were guaranteed their base salaries regardless of their losses, but their bonuses were dependent on their returns from investments. In the case of an investor incurring a loss, the only policy that would result in a bonus would be one that executes an increasingly risky strategy to win back the losses and receive some bonus.

Learning the utility of the returns is thus naturally risk-aware. Optimising the utility of the sum of the accrued and future returns to make decisions enables an agent to avoid certain undesirable outcomes. Without knowing the accrued returns, an agent cannot understand how future actions could affect the cumulative return. To make decisions that maximise the user’s utility, the agent must have information about both the accrued and the future returns.

A further complicating factor is that, in the real world, decision making often involves trade-offs based on multiple conflicting objectives [17, 60, 75]. For example, we may want to maximise the power output of coal-burning electrical generators while minimising \(CO_{2}\) emissions. Many approaches to multi-objective decision making only consider linear utility functions; this limitation severely restricts the real-world applicability of these methods [76], given that utility in many real-world problems is derived in a nonlinear manner.

In the multi-objective case, optimising under the expected utility is known as optimising the expected scalarised returns (ESR) criterion. For multi-objective reinforcement learning (MORL), the utility function expresses the user’s preferences over objectives. If the utility function is linear and is known a priori, it is possible to translate a multi-objective decision problem to its single-objective equivalent. Once translated, we can then apply single objective methods to solve the decision problem. However, if the utility function is nonlinear, as human preferences often are, explicitly multi-objective methods are required to find optimal solutions [62]. The majority of MORL algorithms focus on the scalarised expected returns (SER) criterion. It has been shown that for nonlinear utility functions the policies learned under the ESR criterion and the SER criterion can be different [64]. Futhermore, nonlinear utility functions invalidate the Bellman equation, given nonlinear utility functions do not distribute across the sum of the immediate and future returns, which restricts the number of usable algorithms in this setting [31]. Therefore, to increase MORL’s usability in the real world, dedicated multi-objective algorithms for the ESR criterion and the SER criterion that can learn policies for nonlinear utility functions must be formulated. We note that the MORL literature focuses almost exclusively on the SER criterion, leaving the ESR criterion largely understudied with a few exceptions [29, 30, 32, 33, 42, 61, 74].Footnote 1

We propose a novel algorithm that can optimise for nonlinear utility functions for expected utility by taking both the accrued and future returns into consideration. To do so, we define a nonlinear utility function Monte Carlo tree search (NLU-MCTS) algorithm that performs Monte Carlo roll-outs to calculate the future returns while also calculating the accrued returns. Therefore, NLU-MCTS can make decisions using the expected utility of the returns over multiple policies. NLU-MCTS builds upon Monte Carlo tree search and uses UCB to explore during planning.

In sequential decision making settings, one of the fundamental challenges is the exploration versus exploration dilemma [71]. Thompson sampling is an algorithm that has been shown to address this dilemma in bandit settings [15]. Thompson sampling selects actions based on the probability matching principle, where actions are selected stochastically based on the probability of the action being optimal. Thompson sampling has been shown to empirically outperform UCB in bandit settings, and under a wide range of problems shows a more robust convergence compared to UCB [7]. Therefore, to exploit the potential performance gains of Thompson sampling we propose a new algorithm known as distributional Monte Carlo tree search (DMCTS) which computes an approximate posterior distribution over the expected utility over the returns. Using the computed approximate posterior distribution it is possible to use Thompson sampling methods to explore during planning.

Both NLU-MCTS and DMCTS overcome the issues present when making decisions solely with the expected return [11, 38, 53, 70, 80]. As we will show, computing the utility of the returns of a policy is useful when optimising for risk-aware RL and under the MORL ESR criterion, given utility of the returns of a policy contains more information about the range of potential negative and positive outcomes during planning and at decision time. NLU-MCTS achieves good performance in both risk-aware and multi-objective settings, while DMCTS achieves good performance in risk-aware settings and state-of-the-art performance under multi-objective ESR settings.

2 Background

In this section we introduce necessary background material, including multi-objective Markov decision processes, the known utility function scenario, commonly used optimality criteria in multi-objective decision making, risk-aware utility functions, Bootstrap Thompson Sampling, Expected Utility Policy Gradient and Monte Carlo tree search.

2.1 Multi-objective reinforcement learning

In multi-objective reinforcement learning (MORL), we deal with decision problems with multiple objectives [44, 62], often modelled as a multi-objective Markov decision process (MOMDP). A MOMDP is a tuple, \(\mathcal {M} = (\mathcal {S}, \mathcal {A}, \mathcal {T}, \gamma , \mathcal {R})\), where \(\mathcal {S}\) and \(\mathcal {A}\) are the state and action spaces, \(\mathcal {T} :\mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow \left[ 0, 1 \right]\) is a probabilistic transition function, \(\gamma\) is a discount factor determining the relative importance of future rewards and \(\mathcal {R} :\mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow \mathbb {R}^n\) is an n-dimensional vector-valued immediate reward function. In MORL, \(n>1\).

2.2 The known utility function scenario

In MORL, an agent seeks to maximise a user’s utility function, where a user’s utility function describes their preferences over objectives. In certain scenarios the utility function of a user can be known at the time of learning or planning. In the taxonomy of MORL we are deemed to be in the known utility function scenario [31, 62]. When the utility function of a user is known, a single optimal policy can be computed. Figure 1 describes the phases of the known utility function scenario. There are two phases in the known utility function scenario: the planning or learning phase and the execution phase. During the planning or learning phase a multi-objective reinforcement learning or planning algorithm is deployed in the MOMDP to compute a single optimal policy for the known utility function. A single optimal policy is computed once the algorithm has completed planning or learning. The computed policy is then executed during the execution phase.

Fig. 1
figure 1

The known utility function scenario [31]

2.3 Risk-aware utility functions

In single-objective decision making under uncertainty, utility functions are often utilised [79]. In scenarios where risk is considered, utility functions are often used to represent a user’s preference for risk. In risk-aware settings the utility function is applied to the returns and the expected utility is maximised. When making decisions in scenarios with risk, a user can be described as risk-averse, risk-seeking or risk-neutral.

A user’s preference for risk can be described by the shape of their utility function [5, 21]. The shape of a risk-seeking utility function is convex. For example the nonlinear utility function \(u(x) = x^{2}\) is a risk-seeking utility function given its shape is convex [21, 41]. For a user that is risk-averse their utility function is concave. For example the nonlinear utility function \(u(x) = x^{0.5}\) is risk-averse given the utility function has a concave shape from below [21, 36]. In contrast to risk-averse and risk-seeking utility functions, risk-neutral utility functions are linear [36]. A user who has a risk-neutral utility function has no preferences for risk and therefore the utility is a linear function of the returns. For example, \(u(x) = x\) is a risk-neutral and linear utility function. In this paper we only consider nonlinear utility functions, therefore we focus on risk-seeking and risk-averse utility functions.

2.4 Scalarised expected returns versus expected scalarised returns

In MORL, the user’s utility derives from the vector-valued outcomes (returns). This is typically modelled as a utility function that needs to be applied to these outcomes in one way or another. For this, we consider two choices [31, 62]. Calculating the expected value of the return of a policy before applying the utility function leads to the scalarised expected returns (SER) optimisation criterion:

$$\begin{aligned} V_{u}^{\pi } = u\left( \mathbb{E} \left[ \sum \limits ^\infty _{t=0} \gamma ^t \textbf{r}_t \,|\, \pi , \mu _0 \right] \right) . \end{aligned}$$
(1)

SER is the most commonly used criterion in the multi-objective (single agent) planning and reinforcement learning literature [62]. For SER, a coverage set is defined as a set of optimal policies for all possible utility functions.

In contrast to the SER criterion, if the utility function is applied before computing the expectation, then the expected scalarised returns (ESR) criterion is being optimised [61]:

$$\begin{aligned} V_{u}^{\pi } = \mathbb {E} \left[ u\left( \sum \limits ^\infty _{t=0} \gamma ^t \textbf{r}_t \right) \,|\, \pi , \mu _0 \right] . \end{aligned}$$
(2)

Similar to risk-aware settings for single objectives (see Sect. 2.3), the ESR criterion maximises the expected utility. Therefore, the ESR criterion is naturally risk-aware while considering multiple objectives. ESR is the most commonly used criterion in the game theory literature on multi-objective games [59], with some exceptions (e.g. [64]).

2.5 Monte Carlo tree search

One way of approaching a decision problem is to use tree search. Perhaps the most popular of such methods is Monte Carlo tree search (MCTS) [16], which employs heuristic exploration to construct its search tree. MCTS builds a search tree of nodes, where each node has a number of children. Each child node corresponds to an action available to the agent. MCTS has two phases: the planning phase and the execution phase.

In the planning phase the agent implements the following four steps [11]: selection, expansion, simulation and backpropagation. Selection: the agent traverses the search tree until it reaches a node for which not all of its possible child nodes have been explored. Expansion: at a node whose children have not all been expanded, the node must be expanded. The agent creates a random child node and then must simulate the environment for the newly created child node. Simulation: the agent executes a random policy through Monte Carlo simulations until a terminal state of the environment is reached. The agent then receives the returns. Backpropagation: the agent must backpropagate the returns received at a terminal state to each node visited during selection where a predefined algorithm statistic e.g. UCB [16, 38] is updated. Each step is repeated a specified number of times, which incrementally builds the search tree. Or, as we will discuss in the next subsection, a posterior belief on the returns, from which we can draw actions using Thompson sampling [8].

During the execution phase the agent must select a child node, corresponding to an action and associated state transition, to traverse to next. The agent evaluates the statistic at each node that is reachable from the root node and moves to the node which returns the maximum value. Once the execution phase has completed, the agent repeats the planning phase.

As already highlighted MCTS makes decisions and explores based on a predefined algorithm statistic. One such version of MCTS is UCT [38] which uses the following formula to derive the optimal action at decision time while also incorporating exploration during learning:

$$\begin{aligned} v_{i} + C \times \sqrt{\frac{ln(N)}{n_{i}}}, \end{aligned}$$
(3)

where \(v_{i}\) is the approximated value of the node i, \(n_{i}\) is the number of the times the node i has been visited and N is the total number of times that the parent of node i has been visited. C is a hyperparameter that can be tuned for exploration, however C is often set to \(\sqrt{2}\).

2.6 (Bootstrap) Thompson sampling

As previously mentioned, during the planning phase of MCTS, we can use Thompson sampling to take exploring actions [8]. However, it is not always possible to get an exact posterior. In this case a bootstrap distribution over means can be used to approximate a posterior distribution [20, 49]. Eckles et al. [18, 19] use a bootstrap distribution to replace the posterior distribution used in Thompson Sampling. This method is known as Bootstrap Thompson Sampling (BTS) [18] and was proposed in the multi-arm bandit setting. The bootstrap distribution contains a number of bootstrap replicates, \(j \in \{1,\ldots , J\}\), where J is a hyper-parameter that can be tuned for exploration. For a small J, BTS can become greedy. A larger J value increases exploration, but at a computational cost [18].

Each bootstrap replicate, j, in the bootstrap distribution contains two parameters, \(\alpha _{j}\) and \(\beta _{j}\), where \(\frac{\alpha _{j}}{\beta _{j}}\) is an is an estimate of replicate j’s expected utility. At decision time, to determine the optimal action the bootstrap distribution for each arm, i, is sampled. The observation for the corresponding bootstrap replicate, j, is retrieved and the arm with the maximum expected utility is pulled [18].

The distribution which corresponds to the maximum arm is randomly re-weighted by simulating a coin-flip (commonly known as sampling from a Bernoulli bandit) for each bootstrap replicate, j, in the bootstrap distribution (see Algorithm 1). If the coin-flip is heads, the \(\alpha\) and \(\beta\) parameters for j is re-weighted.Footnote 2 To do so, the return is added to the \(\alpha _{j}\) value and 1 is added to \(\beta _{j}\) [18].

figure a

Bootstrap methods with random re-weighting [65] are more computationally appealing as they can be conducted online rather than re-sampling data [52]. BTS addresses problems of scalability and robustness when compared to Thompson Sampling [18]. Furthermore, bootstrap distributions can approximate posteriors that are difficult to represent exactly.

2.7 Expected utility policy gradient

We now introduce Expected Utility Policy Gradient (EUPG) [61], a state-of-the-art MORL algorithm for ESR that we will use as a benchmark algorithm in our experiments. EUPG is an extension of Policy Gradient [72, 83], where Monte Carlo simulations are used to compute the returns and optimise the policy. EUPG calculates the accrued returns, \({\textbf {R}}^-_{t}\), which is the sum of the immediate returns received as far as the current timestep, t. EUPG also calculates the future returns, \({\textbf {R}}^+_{t}\), which is the sum of the immediate returns from the current timestep, t, to the terminal state. Using both the accrued and future returns enables EUPG to optimise over the utility of the full returns of an episode, where the utility function is applied to the sum of \({\textbf {R}}^-_{t}\) and \({\textbf {R}}^+_{t}\).

In policy gradient the policies are adapted towards the attained utility by gradient descent. For EUPG the utility of the sum of the accrued and future returns is calculated inside the loss function, which results in the following:

$$\begin{aligned} \mathcal {L}(\pi ) = -\sum _{t=0}^{T} u({\textbf {R}}^-_{t} + {\textbf {R}}^+_{t}) \log (\pi _\theta (a|s, {\textbf {R}}^-_{t}, t)). \end{aligned}$$
(4)

Roijers et al. [61] demonstrated for the ESR criterion the accrued and future returns must be considered when learning in order to learn a good policy. Applying this consideration to EUPG, the algorithm achieves the state-of-the-art performance under the ESR criterion. In this paper, we use the same method of adding past and future returns together before applying the utility inside of the search scheme of our novel DMCTS algorithm.

3 Expected scalarised returns

In the known utility function scenario, there are two phases: the planning phase and the execution phase. During the learning phase a policy is computed and returned to the user. After planning has completed the user executes the computed policy during the execution phase. In scenarios where the utility of a user is derived from the single execution of a policy, the expected scalarised returns (ESR) criterion must be optimised. Under the ESR criterion, the user will only execute the computed policy once in the execution phase.Footnote 3 The majority of RL research focuses on the SER criterion [54, 58], while the ESR criterion has been largely overlooked with some exceptions [32, 34, 42, 61, 74]. Additionally, the majority of RL research only considers linear utility functions. However, in the real world, utility functions can be nonlinear. A potential reason why the RL community has focused on linear utility function is that nonlinear utility functions invalidate the Bellman equation, given nonlinear utility functions do not distribute across the sum of the immediate and future returns [61],

$$\begin{aligned} \begin{aligned} \max _\pi \ {}&\mathbb {E} \left[ u\left( \textbf{R}_t^- + \sum _{i=t}^{\infty } \gamma ^i \textbf{r}_i\right) \ \bigg |\ \pi , s_t \right] \not = \\&u(\textbf{R}_t^-) + \max _\pi \mathbb {E}\left[ u\left( \sum _{i=t}^{\infty } \gamma ^i \textbf{r}_i\right) \ \bigg |\ \pi , s_t \right] , \end{aligned} \end{aligned}$$
(5)

where u is a nonlinear utility functionFootnote 4 and \(\mathbf {R^{-}_{t}}\) \(=\) \(\sum _{i=0}^{t - 1} \gamma ^i \textbf{r}_i\). It has also been shown that for nonlinear utility functions the policies computed under the ESR criterion and the SER criterion can be different [59]. Therefore, to enhance RL’s usability in real-world problem domains, new methods must be formulated that can compute policies for the ESR criterion and the SER criterion for nonlinear utility functions.

When making decisions under the SER criterion, the expected returns is computed before the utility function is applied. A decision is then selected based on the scalar utility of the expectation [77]. SER methods learn policies that optimise a user’s utility function over multiple policy executions. Therefore, a user will execute a policy computed under the SER criterion multiple times during the execution phase. Under the SER criterion, making decisions on an expected value vector is optimal [62]. However, a user optimising for the ESR criterion may only have one opportunity to execute a policy. Therefore, making decisions based on a single expected value vector is not sufficient because the user must have sufficient critical information available about each potential return vector and the associated likelihood [32, 33]. Therefore, applying the utility function to each return vector before computing the expectation ensures the user has taken into consideration each potential outcome a policy may have and the associated utility. We outline two new algorithms that compute policies under the ESR criterion in Sects. 4 and 5.

4 Monte Carlo tree search for nonlinear utility functions

To compute policies for the ESR criterion when the utility function is nonlinear and known a priori [31], we present a novel Monte Carlo tree search algorithm, known as NLU-MCTS. As shown by Roijers et al. [61], in order to compute optimal policies under the ESR criterion, both the accrued and future returns must be taken into consideration before applying the utility function. Therefore, an algorithm must either maintain a distribution over the returns or have some method which allows the agent to sample from the underlying return distribution of the environment. NLU-MCTS utilises the latter, by performing Monte Carlo simulations to compute the future returns. Usually in single objective MCTS an expectation of the returns is maintained at each chance node and the agent seeks to maximise the expectation. When the utility function is nonlinear, making decisions based on the expected returns does not account for the potential undesired outcomes a decision may have. For risk-aware RL and MORL under the ESR criterion, we need to be able to make decisions with sufficient information to avoid undesirable outcomes and exploit positive outcomes. Our key insight is that computing the utility of the cumulative returns, the returns received from executing a policy, can be used to replace the expected future returns (of vanilla MCTS) at each node. We outline our algorithm for single-objective risk-aware RL and MORL that can compute policies for the ESR criterion.

Before we outline how the accrued and future returns are computed, we must describe the structure of the search tree constructed by NLU-MCTS. Under the ESR criterion, the environment must be stochastic, where the state transitions or reward function are stochastic. To handle this uncertainty, NLU-MCTS builds an expectimax search tree using the same planning phase as MCTS (see Sect. 2.5). A search tree is a representation of the state-action space that is incrementally built via the steps of the underlying MCTS algorithm. An expectimax search tree [78] uses both decision and chance nodes. Figure 2 describes a search tree constructed by NLU-MCTS which contains both decision and chance nodes. Each decision node represents a state, action and reward of a MOMDP, where each decision node has a child chance node per action. In this paper we examine environments with stochastic rewards. Each chance node represents the state and action of a MOMDP. At each chance node, the environment is sampled. For NLU-MCTS, if a new observation-reward combination is generated when sampling the environment, a new child decision node is created. This process repeats as the agent traverses the search tree. It is important to note that each chance node and its parent decision node share the same state and action. A child decision node is only created when a new observation-reward combination is received when sampling the environment. To build and traverse a search tree similar to MCTS, NLU-MCTS uses the following phases: selection, expansion, simulation and backpropagation (Sect. 2.5).

Fig. 2
figure 2

A representation of a search tree constructed using NLU-MCTS for a problem with stochastic rewards and two actions. The search contains both decision nodes, represented by circular nodes, and chance nodes, represented by octagons

Now that the structure of the underlying search tree has been outlined it is possible to describe how the cumulative returns and future returns are calculated. The accrued returns is the sum of returns the NLU-MCTS algorithm receives during the execution phase from timestep 0, \(t_{0}\), to timestep, \(t-1\), where \({\textbf {r}}_t\) is the reward vector received at each timestep,

$$\begin{aligned} {\textbf {R}}^-_{t} = \sum _{t_{0}}^{t-1} {\textbf {r}}_{t}. \end{aligned}$$

Given we utilise the underlying planning phases of Monte Carlo tree search, we can use the simulation phase to compute the future returns. As already mentioned during the simulation phase the agent performs a random rollout, also known as a Monte Carlo simulation, until a terminal state. NLU-MCTS utilises Monte Carlo simulations of the environment until a terminal state is reached. Therefore, the future returns can be computed from Monte Carlo simulations performed at each node during planning. Taking this into consideration the future returns, \({\textbf {R}}^{+}_{t}\), is the sum of the rewards received when traversing the search tree during the planning phase and Monte Carlo simulations from timestep, t, to a terminal node, \(t_{n}\),

$$\begin{aligned} {\textbf {R}}^+_{t} = \sum _{t}^{t_{n}} {\textbf {r}}_{t}. \end{aligned}$$
(6)

Finally, before the utility function is applied the cumulative returns must be calculated. The cumulative returns, \({\textbf {R}}_t\), is the sum of the accrued returns, \({\textbf {R}}^-_{t}\), and the future returns, \({\textbf {R}}^+_{t}\),

$$\begin{aligned} {\textbf {R}}_{t} = {\textbf {R}}^-_{t} + {\textbf {R}}^+_{t}. \end{aligned}$$
(7)

In other words, the cumulative returns is the returns received from a full policy execution. Once the cumulative returns, \({\textbf {R}}_{t}\), have been calculated, it is possible to compute the utility of the returns, \(u({\textbf {R}}_{t})\), to optimise for the ESR criterion.

As already highlighted, NLU-MCTS builds an expectimax search tree and utilises both decision and chance nodes. Over multiple iterations of the planning phase, NLU-MCTS constructs a search tree using the selection, expansion, simulation and backpropagation phases used by traditional MCTS [70]. We outline the NLU-MCTS algorithm in Algorithm 2.

Firstly, NLU-MCTS utilises the selection phase (Algorithm 3, see Fig. 3), where the agent traverses the search tree starting at the current root decision node [68]. During the selection phase, we utilise outcome selection for chance nodes and action selection for decision nodes. When the agent arrives at a chance node, we perform outcome selection where the agent simulates the environment model (Algorithm 5). The agent then moves to the child decision node corresponding to the observation-reward combination received from the simulation [68]. When the agent arrives at a decision node, \(n_{d}\), the agent must decide which of its child chance nodes, \(C_{n_d}\), to select. Therefore, NLU-MCTS selects the chance node, \(n_{c}\), which maximises the UCB term:

$$\begin{aligned} \text {BestChild} = arg \ max_{n_{c} \in C_{n_d}} \text {UCB}(n_{d}, n_{c}) \end{aligned}$$
(8)

the UCB term is defined as follows:

$$\begin{aligned} \text {UCB}(n_{d}, n_{c}) = \frac{v_{n_{c}}}{N_{n_{c}}} + C \times \sqrt{\frac{ln(N_{n_{d}})}{N_{n_{c}}}}, \end{aligned}$$
(9)

where \(v_{n_{c}}\) is the total utility of the child node \(n_{c}\), \(N_{n_{c}}\) is the number of times child node \(n_{c}\) has been visited, \(\frac{v_{n_{c}}}{N_{n_{c}}}\) is the expected utility of the child node \(n_{c}\), C is an exploration value, and \(N_{n_{d}}\) and \(N_{n_{c}}\) are the number of times \(n_{d}\) and \(n_{c}\) have been visited respectively. Equation 9 ensures that the agent explores areas of the tree which have not been visited often while also ensuring that the agent exploits nodes which have good returns. The agent then traverses to the chance node corresponding to the best action. The agent continues to traverse the search tree until a decision node is encountered which has not had all of is children expanded. The agent then progresses to the expansion phase (Algorithm 4) where the selected decision node is utilised. It is important to note that, as the agent traverses the search tree, the future returns, \({\textbf {R}}_{t}^{+}\), is being computed incrementally.

Fig. 3
figure 3

During the selection phase, NLU-MCTS starts at the root node and traverses down the search tree (nodes highlighted in red). The agent traverses the search tree until a leaf decision node is found

Fig. 4
figure 4

During the expansion phase of NLU-MCTS (nodes highlighted in red), a child chance node is created. The newly generated chance node simulates the environment and creates a child decision for the corresponding reward received

During the expansion phase (Algorithm 4, see Fig. 4), the agent considers a decision node selected during the previous phase which has not had all of its children expanded. There are three steps to the expansion phase. Firstly, for the decision node, a child chance node corresponding to a previous remaining action is created for a randomly selected action. Secondly, the agent simulates the environment model for the newly created chance node. Finally, for the previously created chance node, the agent creates a child decision node corresponding to the observation-reward combination received. It is important to note that both a chance node and a decision node are generated during the expansion phase. The newly created decision node is then utilised in the next phase, known as the simulation phase.

Fig. 5
figure 5

During the simulation phase of NLU-MCTS (nodes highlighted in red), the decision node generated in the expansion phase executes a random policy until a terminal state. Finally, the cumulative returns \({\textbf {R}}_{t}\) is computed

After expansion, the created decision node must be simulated. Figure 5 highlights the simulation phase (Algorithm 6) for NLU-MCTS. When a decision node is simulated, a random rollout is executed. During the rollout, a random policy is followed until a terminal state is reached. Once the simulation has completed, the cumulative returns, \({\textbf {R}}_{t}\), can be computed. The future returns, \({\textbf {R}}_{t}^{+}\), is equal to the sum of the rewards received when traversing the search tree and the returns from the random rollout in the simulation phase. The cumulative returns, \({\textbf {R}}_{t}\), is then computed by adding both the accrued returns, \({\textbf {R}}^{-}_{t}\), and the future returns, \({\textbf {R}}^{+}_{t}\). We note that \({\textbf {R}}_t\) is the same for every node during backpropagation.

Fig. 6
figure 6

During the backpropagation phase, the cumulative returns, \({\textbf {R}}_{t}\), is backpropagated to each node visited during the planning phase

Figure 6 and Algorithm 7 outlines the backpropagation phase of NLU-MCTS. Once the simulation phase has completed, the cumulative returns, \({\textbf {R}}_{t}\), is backpropagated to each node visited during the previous phases of the search tree. As the agent backpropagates the cumulative returns, the agent updates the required statistic for each node.

Under the ESR criterion, the utility of the cumulative returns, u(\({\textbf {R}}_{t}\)), is computed during the backpropagation phaseFootnote 5 by applying the known utility function, u, to the cumulative returns, \({\textbf {R}}_{t}\). Therefore during backpropagation, the statistics at chance node are updated by updating the total utility, v, of the node as follows:

$$\begin{aligned} v_{n_{c}} \leftarrow v_{n_{c}} + u({\textbf {R}}_{t}). \end{aligned}$$
(10)

The visit count for both chance node and decision nodes is also updated as follows:

$$\begin{aligned}{} & {} N_{n_{c}} \leftarrow N_{n_{c}} + 1, \end{aligned}$$
(11)
$$\begin{aligned}{} & {} N_{n_{d}} \leftarrow N_{n_{d}} + 1. \end{aligned}$$
(12)

The NLU-MCTS algorithm runs each step of the planning phase (selection, expansion, simulation and backpropagation) a specified number of times. We denote the number of times the planning phase is run as \(n_{exec}\). Once the NLU-MCTS algorithm has run the planning phase an \(n_{exec}\) number of times, the algorithm returns the best action to take from the current root node, \(n_{r}\). Under the ESR criterion, the best action, \(a^{*}\), can be calculated by evaluating the expected utility of each of the current root nodes, \(n_{r}\), children, \(C_{n_{r}}\) and taking the action which returns the maximum expected utility as follows:

$$\begin{aligned} a^{*} = \arg \max _{n \in C_{n_{r}}} \frac{v_{n}}{N_{n}}. \end{aligned}$$
(13)
figure b
figure c
figure d
figure e
figure f
figure g

5 Distributional Monte Carlo tree search

Monte Carlo tree search for nonlinear utility functions (NLU-MCTS) utilises the UCB statistic to explore during planning. However, Thompson sampling methods have been shown to outperform UCB methods in bandit settings [15, 66]. Therefore, to exploit the potential performance increases associated with Thompson sampling methods, we present a novel distributional Monte Carlo tree search algorithm (DMCTS) that learns a posterior distribution over the expected utility of the returns.

figure h
figure i

Firstly, it is important to discuss how DMCTS (Algorithm 8) builds an underlying search tree. DMCTS builds an expectimax search tree using the same planning phase as NLU-MCTS (see Sect. 4). However, DMCTS takes a distributional approach to decision making.

DMCTS aims to maintain a posterior distribution over the expected utility of the returns at each chance node. However, because the utility function may be nonlinear, a parametric form of the posterior distribution may not exist. Since a bootstrap distribution can be used to approximate a posterior [20, 49], it is much more suitable to maintain a bootstrap distribution over the expected utility of the returns at each chance node.

Each bootstrap distribution contains a number of bootstrap replicates, \(j \in \{1,\ldots , J\}\) [18] (see Sect. 2.6). It is important to note the number of bootstrap replicates, J, is a hyperparameter that can be tuned for exploration [18]. Each bootstrap replicate, j, in the bootstrap distribution has two parameters, \(\alpha _{j}\)Footnote 6 and \(\beta _{j}\), where \(\frac{\alpha _{j}}{\beta _{j}}\) is the expected utility for replicate j. On initialisation of a new node, for each bootstrap replicate, j, the parameters \(\alpha _{j}\) and \(\beta _{j}\) are both set to 1. Moreover, \(\alpha _{j}\) can be set to positive or negative values to increase initial exploration without a computational cost. Figure 7 outlines a bootstrap distribution learned by the DMCTS algorithm. For ESR settings, the expected utility of each bootstrap replicate, j, can be computed as follows:

$$\begin{aligned} \mathbb {E}(u(j)) = \frac{\alpha _{j}}{\beta _{j}}. \end{aligned}$$
(14)

It is important to note that, similarly to NLU-MCTS, DMCTS requires the utility function of the user to be known a priori. The bootstrap distribution is updated during the backpropagation phase of the DMCTS algorithm.

Fig. 7
figure 7

A bootstrap distribution learned by DMCTS with the number of bootstrap replicates, J, set to 8. The expected utility for each bootstrap replicate, j, can be calculated by \(\frac{\alpha _{j}}{\beta _{j}}\). For example, the expected utility for bootstrap replicate \(j_{4}\) can be calculated as follows: \(\mathbb {E}(u(j_{4})) = \frac{\alpha _{j_{4}}}{\beta _{j_{4}}}\)

figure j

During the backpropagation phase (Algorithm 10) the cumulative returns is backpropagated and the bootstrap distribution at each chance node is updated. Algorithm 11 outlines how a bootstrap distribution for a node is updated for the ESR criterion. In this paper, we do not use discounting as we perform evaluations only on finite horizon tasks. We note that DMCTS can easily be adapted to discounted settings. At chance node, i, for each bootstrap replicate, j, a coin flip is simulated (See Algorithm 11, Line 4). If the result of the coin flip is equal to 1 (heads), \(\alpha _{ij}\) and \(\beta _{ij}\) are updated:

$$\begin{aligned} \alpha _{ij} \leftarrow \alpha _{ij} + u({\textbf {R}}_t) \\ \beta _{ij} \leftarrow \beta _{ij} + 1 \end{aligned}$$

To select actions while planning (Algorithm 9), we use the previously computed statistics. At each timestep the agent must choose which action to execute in order to traverse the search tree (as outlined in Algorithm 12). At decision node n, we select an action by sampling the bootstrap distribution at each child chance node, i. For each sampled bootstrap replicate, j, the \(\alpha _{ij}\) and \(\beta _{ij}\) values are retrieved and \(\frac{\alpha _{ij}}{\beta _{ij}}\) is computed. Since the following approximation is true,

$$\begin{aligned} \frac{\alpha _{ij}}{\beta _{ij}} \equiv \mathbb {E}[u({\textbf {R}}^-_{t} + {\textbf {R}}^+_{t})], \end{aligned}$$
(15)

by maximising over i in Eq. 15, we select an action corresponding to j approximately proportional to the probability of that action being optimal–per the Bootstrap Thompson Sampling exploration strategy. The agent then executes the action, \(a^*\), which corresponds to the following:

$$\begin{aligned} a^* = \arg \max _{i} \frac{\alpha _{ij}}{\beta _{ij}}. \end{aligned}$$
(16)

We note that, at execution time, we can calculate the best action (Algorithm 8, Line 11) by simply selecting the overall maximising action by averaging over all the acquired data, thereby maximising the ESR criterion:

$$\begin{aligned} ESR = \mathbb {E}[u({\textbf {R}}^-_{t} + {\textbf {R}}^+_{t})]. \end{aligned}$$
(17)
figure k
figure l

Using the outlined algorithm, DMCTS is able to learn policies for risk-aware settings and under ESR for multi-objective settings. In Sect. 6, we evaluate DMCTS for risk-aware settings and multi-objective settings for the ESR criterion.

6 Experiments

In order to evaluate NLU-MCTS and DMCTS, we test both algorithms in multiple settings. Firstly, we perform an ablative study to outline the effect on computation and performance the J parameter has when computing the BTS distribution for DMCTS. We then evaluate NLU-MCTS and DMCTS in a risk-aware setting. Finally, we evaluate both algorithms in multi-objective settings under the ESR criterion. In multi-objective settings, we test our algorithms on variants of standard benchmark problems from the MORL literature.

We also evaluate NLU-MCTS and DMCTS against two other state-of-the-art RL algorithms: Expected Utility Policy Gradient (EUPG) [61] and C51 [10]. EUPG is the only MORL algorithm that can compute policies under the ESR criterion and is therefore the state-of-the-art performance in this setting [61]. We use C51 as a baseline algorithm for our evaluation of DMCTS given C51 is a distributional RL algorithm and has achieved state-of-the-art performance [10].

At each timestep for NLU-MCTS and DMCTS, the planning phase is performed multiple times before an action is selected during the execution phase. To fairly evaluate all other algorithms against NLU-MCTS and DMCTS, we have altered each benchmark algorithm to have the same number of policy executions of each environment at each timestep as NLU-MCTS and DMCTS. At each timestep, each algorithm gets \(n_{exec}\) full policy executions worth of learning from that state and timestep onward. Therefore, if \(n_{exec} = 10\), NLU-MCTS and DMCTS perform the planning phase ten times before selecting an action. To ensure a C51 and EUPG get the same opportunity to learn, both algorithms are altered to execute a policy \(n_{exec}\) number of times from the current state. For the other algorithms (except NLU-MCTS and DMCTS) this has the effect of increasing the learning speed. The number of policy executions \(n_{exec}\) varies for each problem domain. All experiments are averaged over 10 runs.

6.1 Ablation study

Before we evaluate both NLU-MCTS and DMCTS in risk-aware and multi-objective sequential decision making problems, we empirically evaluate how the Bootstrap Thompson Sampling (BTS) parameter settings affect performance and run time. We also provide a visualisation that shows how a BTS distribution is updated over time to estimate the underlying posterior distribution over the expected utility. Finally, we evaluate the performance of DMCTS under different J values in a MOMDP, to highlight how the selection of the J value can effect performance in sequential settings.

6.1.1 Bootstrap Thompson sampling J values and runtime

To illustrate how a BTS distribution evolves over time, we update a single BTS distribution based on the returns of a simple multi-objective bandit. In this setting the bandit has one arm, where there is a 0.5 chance of receiving the following return: \({\textbf {r}} = [1, 1]\), and a 0.5 chance of receiving the following return: \({\textbf {r}} = [0, 0]\). The returns are then scalarised using the following utility function:

$$\begin{aligned} u = r_{1} r_{2}, \end{aligned}$$
(18)

where \(r_{1}\) and \(r_{2}\) are the returns for objective 1 and objective 2 respectively. In this example, expected utility is 0.5.

Using this bandit we update a single BTS distribution and show how the distribution evolves over a number of updates using the utility of the returns. Figure 8 outlines how a BTS distribution with 25 bootstrap replicates evolves after 1, 8, 32, 128, 250 and 500 updates.

Fig. 8
figure 8

A BTS distribution after 1, 8, 32, 128, 250 and 500 updates. After 500 updates the distribution converges to the correct expected utility, where expected utility is on the x-axis

Next, we investigate the computational run time for a BTS distribution with varying number of replicates, J. To evaluate the run time for each chosen J value we compute the time in seconds taken to perform 1, 000 updates of a BTS distribution. This experiment was performed 10 times for each J value and the average run time was computed. To evaluate the run time we use the following J values: 10, 100, 200, 300, 400, 500, 600, 700, 800, 900 and 1000 and present the results in Fig. 9.

Fig. 9
figure 9

The run time is seconds required to complete 1000 updates of a BTS distribution for different J values. The run time required increases linearly with the increase in the J value

Figure 9 shows that the run time in seconds increases linearly with the number of replicates J. Therefore, the hyperparameter J can have an impact on the run time of the algorithm and therefore should be taken into consideration in order to optimise performance. Next we will evaluate the performance of a BTS distribution for multiple J values in a multi-objective multi-armed bandit setting. By comparing run time and performance it should be possible to determine which J values can be selected for good performance and efficiency.

6.1.2 Bootstrap Thompson sampling J values & performance

To investigate the effect the hyperparameter J on the performance of DMCTS we consider a multi-objective multi-armed bandit (MOMAB) setting where a BTS distribution is utilised per arm to determine which arm is optimal for a given utility function.

We utilise a MOMAB setting from the literature [63], with 5 arms, and each of the following ground truth mean vectors: (0, 0.8), (0.4, 0.4), (0.8, 0.0) and (0.9, 0.1). Each reward distribution is multi-variate Gaussian with correlations 0 and in-objective variance 0.0005 [63]. We utilise the following utility function:

$$\begin{aligned} u = 6.25 \ max(r_{0}, 0) \ max(r_{1}, 0). \end{aligned}$$
(19)

In this setting, the arm with mean vector (0.4, 0.4) is optimal and returns an expected utility of 1. We run BTS for 10, 000 trials for the following J values: 10, 100, 500 and 1000.

Figure 10 presents the results for each J value in the MOMAB setting. In this case it is clear that the choice of J value has little impact on the algorithms ability to compute the optimal utility. Therefore, given the computational results presented in Fig. 9 a lower J value may be preferred.

Fig. 10
figure 10

The performance of BTS algorithm is a multi-objective multi-armed bandit setting for different J values. For each J value the algorithm converges to the optimal utility of 1

Although the results presented in Fig. 10 show that the choice of J has little impact on the BTS distributions, it has been shown by Eckles and Kaptein [18] that J values lower than 100 lead to higher levels of regret in single-objective bandit settings. Therefore, we aim to utilise a J value of around 100 for DMCTS given the run time for J values of around 100 are relatively low. Such values also provide good performance while avoiding the limitations highlighted by Eckles and Kaptein [18]. However, we acknowledge that the J value is problem dependent and certain problems may require a higher J.

6.1.3 Bootstrap Thompson sampling J values in MOMDPs

To evaluate the selection of the J parameter for the BTS distribution has on the performance of DMCTS, we run DMCTS using different J values in a MOMDP. To do so, we have utilised a random MOMDP from the literature [61]. The random MOMDP is configurable based on the requirements of the experiments, where the numbers of states, actions, objectives, timesteps, and possible successor states can be determined a priori. The random MOMDP can then be initialised for each experiment by selecting a consistant seed. We generate a random MOMDP with 20 states, 2 actions, and 2 objectives. The transition function \(T(s, a, s')\) is generated using \(N = 8\) possible successor states per action, with random probabilities drawn from a uniform distribution [61]. We use the following nonlinear utility function:

$$\begin{aligned} u = r_{1}^{2} + r_{2}^{2}. \end{aligned}$$
(20)
Fig. 11
figure 11

Evaluation of different J values in a random MOMDP with 20 states, 2 actions, 2 objectives and 8 successor states reachable from each state

We evaluate DMCTS using the following J values of 1, 2, 10, 100, 500 and 1000 for the BTS distributions. Figure 11 outlines the results from the random MOMDP. Utilising a J value of 1 has an impact on performance, given DMCTS with J set to 1 achieves a lower utility compared to the other parameter settings. As we increase the J value to 2 we can see that performance begins to improve. However, for a very low J value (\(J=1\) or \(J=2\)) DMCTS will select actions greedily and will not explore the environment enough to obtain a good utility. As we increase the J value we can see that the performance increase. Once the J value is set to 10, DMCTS has a large increase in performance. Similarly, once the J value increases to 100 we can see even better performance. However, we do not see any more performance increases for DMCTS in the random MOMDP when we set the J value to a higher value. When the J value is set to 500 or 1000 the performance does not increase relative to \(J = 100\). However, the computational cost of updates the BTS for higher J values increases. Therefore, it is important to ensure that the J value is set sufficiently high for exploration, while also avoiding J values with a high computational cost. Therefore, it may be important to tune the J value depending on the evaluation setting.

6.2 Risk-Aware MDP

Before testing NLU-MCTS and DMCTS on benchmark problems from the MORL literature, we evaluate both algorithms in a risk-aware problem domain under the ESR criterion. Shen et al. [69] define a Risk-Aware MDP where an agent must decide from a number of stocks in which to invest. The underlying MDP which has 4 actions (each action is a monetary amount, in Euros, of investment) and 7 states. At each timestep the agent must select a monetary amount to invest in the stock for a given state. We can invest €0, €1, €2 or €3 in a stock at each timestep. Each stock has a probability of making a profit and a probability of making a loss where the agent’s return is the action multiplied by the stock price. All remaining implementation details can be found in the work of Shen et al. [69]. In risk-aware decision making a user can be risk seeking, risk averse or risk neutral. A user’s preference for risk is described by their individual utility function, which can often be nonlinear. Given risk-based decision making scenarios are ubiquitous in the real world [23] it is important that algorithms can compute policies for risk-aware nonlinear utility functions. Therefore, to highlight the usability of NLU-MCTS and DMCTS in risk-aware decision making scenarios, we evaluate NLU-MCTS and DMCTS using the outlined Risk-Aware MDP. For the Risk-Aware MDP we use a nonlinear risk-seeking and risk-averse utility functions to evaluate the performance of NLU-MCTS and DMCTS.

6.2.1 Risk-seeking utility function

Firstly, we evaluate DMCTS and NLU-MCTS in the Risk-Aware MDP using the follow risk-seeking utility function:

$$\begin{aligned} u(x) = (max(0, x))^2. \end{aligned}$$
(21)

In the Risk-Aware MDP, utilising the risk-seeking utility function presented in Sect. 2.3, would reward the agent with a positive utility when the returns are negative. Therefore, we have used the utility function in Eq. 21 to ensure that negative returns are not seen as a positive outcome by the agent. The nonlinear utility function outlined in Eq. 21 is risk-seeking given the shape of the utility function is convex.

Fig. 12
figure 12

Results from the Risk-Aware MDP environment where DMCTS is evaluated against NLU-MCTS using a risk-seeking utility function. DMCTS achieves a higher utility compared to NLU-MCTS for a risk-seeking utility function

For all experiments in the Risk-Aware MDP with the risk-seeking utility function, the parameter \(n_{exec}\) is set to 10 for each algorithm and each experiment lasts for 1000 episodes. For DMCTS we set the number of bootstrap replicates, J, for the bootstrap distribution as follows: \(J = 500\). For NLU-MCTS we set \(C = \sqrt{2}\).

Figure 12 describes the experimental results for the risk-seeking utility function in the Risk-Aware MDP. While both algorithms learn good stable policies, DMCTS achieves a higher utility when compared to NLU-MCTS for the risk-seeking utility function.

6.2.2 Risk-averse utility function

Secondly, we evaluate DMCTS and NLU-MCTS using the following risk-averse utility function:

$$\begin{aligned} u(x) = x^{\frac{1}{2}}. \end{aligned}$$
(22)

The utility function in Eq. 22 is risk-averse given the shape of the utility function is concave.

For all experiments in the Risk-Aware MDP with the risk-averse utility function, the parameter \(n_{exec}\) is set to 10 for each algorithm and each experiment lasts for 1000 episodes. For DMCTS we set the number of bootstrap replicates, J, for the bootstrap distribution as follows: \(J = 500\). For NLU-MCTS we set \(C = \sqrt{2}\). In the Risk-Aware MDP the returns can be negative. Therefore, for the risk-averse utility function, we add 150 the returns because this is the minimum returns that the agent can achieve.

Figure 13 shows that both NLU-MCTS and DMCTS can compute good policies for the risk-averse utility function. DMCTS achieves a higher utility when compared to NLU-MCTS.

Fig. 13
figure 13

Results from the Risk-Aware MDP environment where DMCTS is evaluated against NLU-MCTS using a risk-averse utility function. DMCTS achieves a higher utility compared to NLU-MCTS for a risk-averse utility function

6.2.3 Discussion of experimental results for risk-aware MDP

Both NLU-MCTS and DMCTS learn good policies for the risk-seeking and risk-averse utility functions. However, DMCTS achieves a higher utility in both settings. It is important to note that both the risk-seeking and risk-averse utility function are nonlinear. Therefore for any algorithm to compute good policies the cumulative returns must be taken into consideration. Both NLU-MCTS and DMCTS compute policies based on the expected utility by computing the cumulative returns. If the cumulative returns were not taken into consideration we would expect the utility obtained by both algorithms to be lower, given that nonlinear utility functions do not distribute across the sum of the future and immediate returns. Therefore, for risk-aware settings it is important that the cumulative returns are calculated before the utility function is applied. Taking this approach ensures that an agent can make decisions with knowledge of how future outcomes may affect utility. This is important in risk-aware settings given an agent may only have one opportunity to execute a policy, and having access to the expected utility of a policy ensures sufficient information is available to the agent so utility can effectively be optimised.

While both NLU-MCTS and DMCTS learn good polices for both risk-aware utility functions, DMCTS achieves a higher utility. The key difference between NLU-MCTS and DMCTS is how each algorithm explores during planning. NLU-MCTS utilises UCB while DMCTS uses the Thompson sampling method, Bootstrap Thompson sampling. In the bandit literature Thompson sampling methods have been shown to empirically outperform UCB [15]. Thompson sampling selects actions proportional to the probability of an action being optimal [67]. Therefore, by maintaining an approximate posterior distribution via a bootstrap distribution, and using Thompson sampling to sample from each approximate posterior distribution at each chance node to select actions, DMCTS can exploit the performance gains of Thompson sampling to achieve a higher utility than NLU-MCTS.

6.3 Multi-objective MDPs

To evaluate NLU-MCTS and DMCTS in multi-objective settings under the ESR criterion, we use a number of problem domains. Firstly, we evaluate NLU-MCTS and DMCTS in the Fishwood problem [61], given this is one of the very few domains for which ESR results have been published. Secondly, we evaluate NLU-MCTS and DMCTS in the Renewable Energy Dynamic Economic Emissions Dispatch (REDEED) problem domainFootnote 7.

6.3.1 Fishwood

Fishwood is a multi-objective benchmark problem proposed by Roijers et al. [61]. In Fishwood the agent has two states: in the woods or at the river. The goal of the agent is to catch fish and collect wood. The Fishwood environment is parameterised by the probabilities of successfully obtaining fish and wood at these respective states. In this paper we use the following values: at the river the agent has a 0.25 chance of catching a fish and in the woods the agent has a 0.65 chance of acquiring wood. For every fish caught, two pieces of wood are required to cook the fish, which results in a utility of 1. The goal in this setting is to maximise the following nonlinear utility function:

$$\begin{aligned} u = \min \left( \texttt {fish}, \left\lfloor \frac{\texttt {wood}}{2} \right\rfloor \right) . \end{aligned}$$
(23)

As demonstrated by Roijers et al. [61], to maximise utility in Fishwood it is essential that both past and future returns are taken into consideration when learning. For example, if there are 5 timesteps remaining and the agent has received 2 pieces of wood, the agent should go to the river and try to catch a fish to ensure a utility of 1 [61].

We evaluate NLU-MCTS and DMCTS in the Fishwood domain against Expected Utility Policy Gradient (EUPG) [61] and C51 [10]. EUPG achieves state-of-the-art results in the Fishwood problem under ESR [61]. C51 [10] is a distributional deep reinforcement learning algorithm that achieved state-of-the-art results in the Atari game problem domain.

For C51 the learning parameters were set as follows: \(V_{min} = 0\), \(V_{max} = 2\), \(\epsilon = 0.01\), \(\gamma = 1\) and \(\alpha = 0.0001\). For DMCTS we set the number of bootstrap replicates, J, in the bootstrap distribution as follows: \(J = 100\). For NLU-MCTS we set \(C = \sqrt{2}\). EUPG is conditioned on the accrued returns and the current timestep, t. We set \(n_{exec} = 2\) and run each experiment for 10, 000 episodes where each episode has 13 timesteps.

Fig. 14
figure 14

Results from the Fishwood environment where DMCTS achieves state-of-the-art performance in a multi-objective setting over EUPG

As shown in Fig. 14, the utility for C51 fluctuates throughout experimentation and it fails to learn a consistent policy. Given C51 does not take the accrued returns into consideration during learning the utility function is applied directly to the reward received by the agent. The reward received by an agent in the Fishwood domain can be [0, 1] or [1, 0]. Thereby applying the utility function, presented in Eq. 23, to the reward the C51 agent can only receive a utility of 0. DMCTS, NLU-MCTS and EUPG all take the accrued and future returns into consideration and can learn better policies when compared to C51.

DMCTS, NLU-MCTS and EUPG outperform C51. DMCTS and NLU-MCTS achieve a higher utility when compared to EUPG. All algorithms, except C51, use Monte Carlo simulations of the environment and optimise over the expected utility of the returns of a full episode. Although EUPG uses Monte Carlo simulations of the environment, policy gradient algorithms are sample inefficient. DMCTS and NLU-MCTS are sample efficient given both algorithms utilise the planning phase steps, which has been shown to be sample efficient [4, 14].

In the Fishwood environment, the agent is not guaranteed to obtain a fish or a piece of wood. For an action in a particular state the agent may need multiple simulations to understand the underlying distribution of the stochastic rewards. Both DMCTS and NLU-MCTS build a search tree, which enables the agent to re-sample the environment at each chance node during learning. However, DMCTS achieves an overall higher utility when compared with NLU-MCTS despite both algorithms utilising repeated sampling at each chance node and Monte Carlo simulations.

6.3.2 Renewable energy dynamic economic emissions dispatch

Next, we evaluate NLU-MCTS and DMCTS in a complex problem domain with a large state action space. Renewable Energy DEED (REDEED) is a variation of the traditional DEED problem [9]. In REDEED, the power demand for a city must be met over 24 h. To supply the city with sufficient power, a number of generators are required. There are 9 fossil fuel-powered generators, including a slack generator and 1 generator powered by renewable energy which is generated by a wind turbine. The optimal power output for each generator was derived by Mannion et al. [43] and the derived values are used for the both the fossil fuel generators and the renewable energy generator. In this example, Generator 3 is controlled by an agent, Generator 1 is a slack generator and Generator 4 is powered by a wind turbine.

In this setting we imagine a period of 24 h and for each hour we receive a weather forecast for a city. For hours 1–15, the weather is predictable and the optimal power values derived by Mannion et al. [43] can be used to generate power. From hours 16–24, a storm is forecast for the city. During the storm, both high and low levels of wind are expected and the weather forecast impacts how much power the wind turbine can generate. At each hour during the storm, there is a 0.15 chance the wind turbine will produce \(25\%\) less power than optimal, a 0.7 chance the wind turbine will produce optimal power and a 0.15 chance the wind turbine will produce \(25\%\) more power than optimal. In the REDEED problem we aim to learn a policy that can ensure the required power is met over the entire day while reducing both the cost and emissions created by all generators.

The goal is to maximise the following nonlinear utility function under the ESR criterion,

$$\begin{aligned} R_{+} = \prod _{o=1}^{O} f_{o}, \end{aligned}$$
(24)

where \(f_{o}\) is the objective function for each objective, \(o \in O\) [26, 43].

The following equation calculates the local cost for each generator n, at each hour m:

$$\begin{aligned} f_c^L(n,m) = a_n + b_n P_{nm} + c_n (P_{nm})^2 + |d_n sin\{ e_n (P_{n}^{min} - P_{nm}) \}|. \end{aligned}$$
(25)

Therefore the global cost for all generators can be defined as:

$$\begin{aligned} f_c^G(m) = \sum \limits _{n=1}^N f_c^L(n,m). \end{aligned}$$
(26)

The local emissions for each generator, n, at each hour, m, is calculated using the following equation:

$$\begin{aligned} f_e^L(n,m) = E(a_n + b_n P_{nm} + \gamma _n(P_{nm})^2 + \eta \exp \delta P_{nm}). \end{aligned}$$
(27)

Therefore the global emissions for all generators can be defined as:

$$\begin{aligned} f_e^G(m) = \sum \limits _{n=1}^N f_e^L(n,m). \end{aligned}$$
(28)

It is important to note the emissions for the generator controlled by the wind turbine are set to 0.

If the agent exceeds the ramp and power limits a penalty is received. A global penalty function \(f_p^G\) is defined to capture the violations of these constraints,

$$\begin{aligned} f_{p}^G(m) = \sum \limits _{v=1}^V C(|h_{v} + 1| \delta _v). \end{aligned}$$
(29)

Along with cost and emissions, the penalty function is an additional objective that will need to be optimised. Some parameters for this problem domain have not been included, all equations and parameters absent from this paper that are required to implement this problem domain can be found in the works of Basu [9] and Mannion et al. [43].

To evaluate NLU-MCTS and DMCTS in the REDEED domain, we compare against EUPG and C51. For DMCTS we set the number of bootstrap replicates, J, for the bootstrap distribution as follows: \(J = 100\). For NLU-MCTS we set \(C = \sqrt{2}\). For C51 the learning parameters were set as follows: \(V_{min} = -8e^{22}\), \(V_{max} = 0\), \(\epsilon = 0.01\), \(\gamma = 1\) and \(\alpha = 0.0001\). For the REDEED problem the agent learns for 10,000 episodes and \(n_{exec} = 2\) for each algorithm.

Fig. 15
figure 15

Results from the REDEED environment DMCTS outperforms EUPG, C51 and NLU-MCTS. DMCTS achieves a higher utility compared to other algorithms used throughout experimentation in the REDEED domain under the ESR criterion

As seen in Fig. 15, DMCTS outperforms EUPG, NLU-MCTS and C51 in the REDEED domain. C51 struggles to learn a consistent policy and C51’s utility fluctuates throughout experimentation. The hyper-parameters chosen for C51 provide good performance but are difficult to tune. Although the learning speed of EUPG is slow, EUPG achieves a higher utility than C51.

Both DMCTS and NLU-MCTS learn good policies faster than EUPG. MCTS algorithms are much more sample efficient when compared to policy gradient algorithms like EUPG. Figure 15 highlights the difference in sample efficiency of DMCTS, NLU-MCTS and EUPG given the differences in the number episodes required for each of the aforementioned algorithms to compute stable policies for the defined nonlinear utility function.

DMCTS, NLU-MCTS and EUPG all learn stable policies. However, DMCTS achieves a higher utility when compared to NLU-MCTS and EUPG. DMCTS converges to a policy with an average utility of \(-1.54\times 10^{21}\). In comparison, NLU-MCTS converges to a policy with an average utility of \(-1.80\times 10^{21}\), while EUPG converges to a stable policy with an average utility of \(-1.75\times 10^{21}\). Given the scale of the utility computed throughout REDEED experimentation, it is difficult to see the final difference in utility in Fig. 15. Therefore, to highlight the difference in utility between DMCTS, NLU-MCTS and EUPG we have plotted the final 4, 000 episodes in Fig. 16. It is important to note, given C51 has performed poorly in the REDEED domain, we have not included C51 in Fig. 16. For the highlighted episodes in Fig. 16, it is clear that DMCTS achieves a higher utility when compared to both NLU-MCTS and EUPG.

Fig. 16
figure 16

Results from the final 4000 episodes of the REDEED environment to highlight how DMCTS outperforms NLU-MCTS, EUPG and C51

The REDEED environment has a large state action space with complex returns. Although C51 has achieved state-of-the-art results in the Atari environment [10], C51 fails to learn any meaningful policy for REDEED. We hypothesise that a reason for poor performance is C51’s inability to learn a distribution over the full returns and the level of discretisation of the distribution. The distribution for C51 uses 51 bins to discretise the algorithm’s categorical distribution. In the work presented by Bellemare et al. [10] the number of bins is set to 51. While this provides good performance, Bellemare et al. [10] highlight that increasing the number of parameters may lead to increased performance. However, we fix the number of bins to 51 to remain consistant with the literature, given the potential added performance when increasing the number of bins has not been thoroughly explored. The results presented in this paper for C51 show this parameter setting is sub-optimal in scenarios where the returns are not simple scalars over small ranges. The results present in Fig. 15 show that C51 struggles to scale to large problem domains with complex returns over large ranges.

6.3.3 Discussion of experimental results for multi-objective MDPs

In multi-objective settings both NLU-MCTS and DMCTS learn good policies for the specified nonlinear utility functions. Similarly to the Risk-Aware MDP, applying the utility function to the cumulative returns (rather than just the expected future return) ensures that both NLU-MCTS and DMCTS can learn good policies. It is clear from the performance of C51 that applying the utility function to the cumulative returns is crucial for good performance in multi-objective settings when the utility function is nonlinear.

As previously highlighted, the difference between NLU-MCTS and DMCTS is the method used to explore during planning. NLU-MCTS uses UCB to determine which action to take during planning. UCB selects actions deterministically based on the expected utility and an exploration bonus [6, 66]. In contrast DMCTS selects actions using Thompson sampling, which stochastically samples from the underlying approximate posterior distribution (BTS distribution) and selects the action proportional to the probability of the action being optimal [15, 18]. In the bandit literature Thompson sampling has been shown to empirically outperform UCB [15, 66, 67]. Monte Carlo tree search methods utilise independent nodes, therefore we can consider each node itself to be a bandit. In this case, we expect Thompson sampling methods to also outperform UCB in sequential settings. Our findings for sequential settings in risk-aware and multi-objective settings are consistent with prior bandit literature that suggests that Thompson sampling can outperform UCB [15].

Additionally, UCB makes highly pessimistic assumptions regarding the underlying reward/return distributions, in order to guarantee a bound on the regret of its action selection procedure [66]. For known parametric distributions, tighter bounds have been proven using tighter upper confidence bounds (e.g. for Gaussian reward distributions [66]). However, doing something similar in our setting isn’t opportune, because even if the return distributions are nicely parametric, the nonlinear transformation resulting from the application of the utility function would no longer allow for a closed-form distributions [66]. As such, we are either stuck with highly pessimistic assumptions (and therefore suboptimal performance) or we need to have a different method. Bootstrap distributions and the resulting Bootstrap Thompson sampling algorithm for action selection is able to approximate, and effectively exploit knowledge about the utility distributions, regardless of the shape of this underlying distribution.

6.4 Nonlinear utility functions

During experimentation DMCTS has been evaluated using previously defined utility functions for each experimental benchmark. To show that DMCTS can learn a good policy for any nonlinear utility function we have evaluated DMCTS in the Fishwood problem domain using four nonlinear utility functions under the ESR criterion. The following nonlinear utility functions are used to evaluate DMCTS in the Fishwood domain:

$$\begin{aligned} u_{1}= & {} max(\frac{r_{1}}{2}, \frac{r_{2}}{2} ), \\ u_{2}= & {} \frac{r_{1}}{2} + r_{2}^{4}, \\ u_{3}= & {} min(\frac{r_{1}}{2}, \frac{r_{2}}{4} ), \\ u_{4}= & {} r_{1}^{2} + r_{2}^{2}, \end{aligned}$$

where \(r_{1}\) is the returns received for the fish objective and \(r_{2}\) is the returns received for the wood objective.

For this demonstration, we set \(n_{exec} = 2\) and each experiment lasts 10, 000 episodes. For DMCTS we set the number of bootstrap replicates, J, for the bootstrap distribution as follows: \(J = 100\).

In Fig. 17, for each utility function we have scaled the utility between 0 and 1. For the scaled utility, 1 represents the maximum utility and 0 represents the minimum utility obtained by DMCTS. We have scaled the utility to show the performance of DMCTS for each utility function on a single plot.

Fig. 17
figure 17

Results from the Fishwood environment where DMCTS is evaluated against multiple nonlinear utility functions

Figure 17 outlines the performance of DMCTS when optimising for each nonlinear utility function. It is clear from Fig. 17 that DMCTS converges to a good policy for each utility function. Therefore, DMCTS can learn a good policy for each of the outlined nonlinear utility functions and is not limited to the utility functions associated with predefined benchmark problems. The ability of DMCTS to learn a good policy for a range of nonlinear utility showcases how DMCTS could potentially be used in real-world scenarios, where different decision makers may have very different nonlinear utility functions for the same problem.

7 Related Work

Many risk-aware RL approaches seek to learn policies to maximise the expected return. Some research in this area focuses on learning policies which maximise the expected exponential utility [46]. Other approaches take the weighted sum of the return and risk into consideration when learning policies [22, 24]. Although most risk-aware RL approaches aim to maximise the expected utility, they often do not take into consideration the utility of the return of a full episode. It is also important to note that little research exists where decisions are made based on a learned distribution over the expected returns [47, 48] for risk-aware RL.

Many MCTS methods have developed for situations involving reward uncertainty. For example, Tesauro et al. [73] take a Bayesian approach to UCT with Gaussian approximation. Their method backpropagates probability distributions over rewards. To select actions Tesauro et al. [73] use UCB1 while taking the distributions into consideration. Cazenave and Saffidine [13] define a MCTS algorithm that takes into account the bounds on the possible values of a node to select nodes for exploration. They apply their algorithm to problems that have more than two outcomes and show that taking the bounds into consideration can increase performance. Kaufmann and Koolen [37], and Huang et al. [35] also have developed MCTS algorithms which can compute policies for settings with reward uncertainty.

As previously highlighted, the majority of RL research focuses on the SER criterion. Multi-objective MCTS (MOMCTS) [80] was shown to be able to learn a coverage set under SER. However, MOMCTS can only learn a coverage set in deterministic environments. Convex Hull MCTS [53] is able to learn the convex hull of the Pareto front but focuses solely on linear utility functions. A number of other multi-objective MCTS methods exist [39, 56, 57], but no method has previously been shown to learn the Pareto front for both deterministic and stochastic environments for any unknown utility function. In contrast to the SER criterion, no method exists that can learn a set of optimal policies under the ESR criterion in sequential settings. Hayes et al. [33] compute a set of ESR non-dominated return distributions, known as the ESR set, in a multi-objective multi-armed bandit setting. However, the method proposed by Hayes et al. [33] cannot be applied to sequential decision making problems. An interesting opportunity for future work is the possibility of building on the methods of Wang and Sebag [80] and Painter et al. [53] to extend DMCTS to learn the optimal coverage set under both SER and ESR for any unknown utility function.

Zhang et al. [87] compute a multi-variate distribution over the returns for RL settings. While this work considers reward vectors, we believe this algorithm will suffer from similar limitations to traditional RL algorithms when applied to nonlinear utility functions. The method proposed by Zhang et al. [87] does not take the accrued returns into consideration. Therefore we believe that such an approach would fail to achieve a high utility [61]. However, this method could be an interesting starting point for developing model-free multi-objective distributional RL algorithms.

A key argument in this paper is that the expected utility of the future returns under ESR must be replaced with a posterior distribution over the expected utility of the returns. Bai et al. [8] extend MCTS to maintain a distribution at each node using Thompson Sampling as an exploration strategy. However, the work presented in this paper is significantly different. In their work, Bai et al. [8] do not learn a posterior distribution over the expected utility of the return, apply their work to multi-objective settings, or incorporate the accrued returns as part of their algorithm. It is also important to note the C51 algorithm proposed by Bellemare et al. [10] achieves state-of-the-art performance in single-objective settings and learns a distribution over the future returns. Abdolmaleki et al. [1] learn a distribution over actions based on constraints set per objective. This approach ignores the utility-based approach [62] and uses constraints set by the user to learn a coverage set of policies where the value of constraints is dependent on the scale of the objectives. Abdolmaleki et al. [1] claim setting the constraints for this algorithm is a more intuitive approach when compared to setting weights for a linear utility function. We theorise that if the user’s utility function is nonlinear, this approach would fail to learn a coverage set.

8 Conclusion and future work

In this paper we propose a novel Monte Carlo tree search algorithm that can compute good policies for nonlinear utility functions (NLU-MCTS). We then extend NLU-MCTS, to define a new distributional Monte Carlo tree search (DMCTS) algorithm. Both NLU-MCTS and DMCTS are able to learn good policies in MORL settings, under the ESR criterion for nonlinear utility functions in problem domains with stochastic rewards. DMCTS replaces the expected utility of the future returns with a bootstrap distribution over the utility of the returns, and achieves state-of-the-art performance in MORL domains under the ESR criterion. We achieve this by using a bootstrap distribution as an approximate posterior over the expected utility of the returns of the episode. It is our hope that this paper will inspire further work on algorithms that replace the expected returns with a distribution over the expected utility of the returns for risk-aware and ESR settings.

Although DMCTS achieves state-of-the-art performance under the ESR criterion, as the size of the problem domain increases we expect the bootstrap distribution may encounter limitations. In order to apply DMCTS to real-world problem domains like, [3], distributions like Dirichlet [50] may be better suited to high dimensional state spaces especially when dealing with multi-variate scenarios when learning policies for the SER criterion.

We also aim to extended both NLU-MCTS and DMCTS to learn policies in multi-objective environments with continuous state spaces. For example, Abels et al. [2] define a multi-objective benchmark problem known as Minecart that has a continuous state space. In the future we plan to extend DMCTS to learn policies in problem domains with continuous action spaces [85].

DMCTS initialises the \(\alpha\) and \(\beta\) values for each bootstrap replicate to 1. However, not much is known about the impact of such an initialisation. It is possible that varying this value could increase performance. Therefore, we plan to investigate the potential performance gains possible from altering this parameter on initialisation.

In the work of Martin et al. [45] stochastic dominance [25, 40, 84] is used to determine optimal actions by comparing learned categorical distributions over the returns. Martin et al. [45] demonstrate good performance in risk-based scenarios. Hayes et al. [32, 33] present ESR dominance as a dominance criteria for distributional multi-objective decision making under the ESR criterion. In future work we aim to utilise stochastic dominance with DMCTS for single objective risk-based problems. We also plan to extend DMCTS to utilise ESR dominance as a dominance criterion for multi-objective decision making problems under the ESR criterion.

In this paper, the utility function is known a priori. In different MORL scenarios, the utility function can be unknown at the time of learning or planning [31, 59, 62]. In these scenarios, an algorithm must recover a coverage set of optimal policies. Under the SER criterion many methods have been developed that compute sets of optimal policies [12, 55, 81, 82, 86]. For example, multi-objective MCTS [80] can learn a coverage set for deterministic environments under SER. Currently, no method exists that can compute a set of optimal policies for the ESR criterion by interacting with the environment in a sequential setting. In future work, we aim to extend our DMCTS algorithm to be able to learn coverage sets for unknown utility functions under the ESR criterion and the SER criterion for stochastic environments.