research-article

Open Access

Bayesian Optimisation vs. Input Uncertainty Reduction

Authors:
Juan Ungredda

University of Warwick, Coventry, UK

University of Warwick, Coventry, UK

0000-0002-8091-515X
View Profile

,
Michael Pearce

University of Warwick, Coventry, UK

University of Warwick, Coventry, UK
View Profile

,
Juergen Branke

University of Warwick, Coventry, UK

University of Warwick, Coventry, UK

0000-0002-4343-5878
View Profile

ACM Transactions on Modeling and Computer Simulation Volume 32 Issue 3Article No.: 17pp 1–26https://doi.org/10.1145/3510380

Published:25 July 2022Publication History

ACM Transactions on Modeling and Computer Simulation

Abstract

Simulators often require calibration inputs estimated from real-world data, and the estimate can significantly affect simulation output. Particularly when performing simulation optimisation to find an optimal solution, the uncertainty in the inputs significantly affects the quality of the found solution. One remedy is to search for the solution that has the best performance on average over the uncertain range of inputs yielding an optimal compromise solution. We consider the more general setting where a user may choose between either running simulations or querying an external data source, improving the input estimate and enabling the search for a more targeted, less compromised solution. We explicitly examine the trade-off between simulation and real data collection to find the optimal solution of the simulator with the true inputs. Using a value of information procedure, we propose a novel unified simulation optimisation procedure called Bayesian Information Collection and Optimisation that, in each iteration, automatically determines which of the two actions (running simulations or data collection) is more beneficial. We theoretically prove convergence in the infinite budget limit and perform numerical experiments demonstrating that the proposed algorithm is able to automatically determine an appropriate balance between optimisation and data collection.

1 INTRODUCTION

Simulation optimisation is the problem of identifying the best solution, when solution qualities can only be estimated via sampling—that is, running a computationally expensive simulation and obtaining a stochastic output value. In many cases, the simulation model has additional parameters that need to be set, such as the mean arrival rate of customers or the mean and variance of the demand distribution.

In reality, such input parameters are either chosen by expert opinion or set to values estimated from historical data. If the chosen values for the input parameters differ significantly from the true parameters, the solution found by optimising the simulation model may be far from optimal in the real world. This problem is generally known as simulation optimisation with input uncertainty and has received much attention in recent years [Lam et al. 2016]. Much work has focused on explicitly modeling the uncertainty of the input parameters and seeking a robust solution that performs well on average (or worst case) over this distribution.

In this article, we extend our previous work on Bayesian optimisation (BO) aiming to identify the solution with the best expected performance given the input uncertainty [Pearce and Branke 2017]. In particular, we assume that the user has access to real-world data that can help to inform the parameters required by the simulator. Given finite resources to spend on simulation and/or data collection, an algorithm must carefully determine which of the two possible actions to perform.

Devoting too much effort to data collection may not leave sufficient resources for optimisation and an algorithm would return a sub-optimal solution to an accurate simulator. However, devoting too little effort to data collection may lead to learning a good compromise solution that performs well on average across a variety of possible input parameters but may be sub-optimal under the true input parameters. In this work, we propose a BO algorithm that can intelligently trade off simulation and data collection.

This applies to simulation optimisation problems where additional external input data can be collected incrementally, although requiring resources to collect. For example, sales demand may be estimated from physical sales records that needs to be manually sorted and entered into a database to reduce uncertainty about true demand, or external data may require time-consuming physical measurements by observers such as traffic flow or user choices.

This work makes the following novel contributions.

We consider a new variant of input uncertainty problems where external data can be sampled to reduce input uncertainty and it is possible to decide how much effort to put into data collection vs. simulation. This differs from previous work where new data acquisition is not an option or data comes from a streaming process and automatically arrives over time without being requested.
We propose the Bayesian Information Collection and Optimisation (BICO) algorithm, which effectively balances the trade-off between simulation and real data collection when the aim is to find the best solution under the true inputs. The algorithm is an extension of the Value of Information (VoI) with input uncertainty idea proposed by Pearce and Branke [2017].
We prove consistency of BICO and show that BICO automatically detects whether an external data source is relevant for optimisation.
We demonstrate the effectiveness of BICO on several artificial and practical problems.

We start with an overview of related work in Section 2, followed by a formal definition of the problem in Section 3. Section 4 explains the statistical models, derives the suggested sampling procedures, and discusses their theoretical properties and practical computation. We perform numerical experiments in Section 5. Finally, the article concludes with a summary and some suggestions for future work in Section 6.

2 LITERATURE REVIEW

Running a simulation model may be computationally expensive. Response surfaces, or metamodels, can be built to predict the output of the computationally expensive simulation much more efficiently. Various metamodeling techniques have been proposed in the literature [Barton and Meckesheimer 2006]. Most relevant for this work are Gaussian processes, which belong to the category of Gaussian random fields (GRF) [Staum 2009; Vanmarcke 2010]. Gaussian processes are non-parametric regression models that provide not only a predicted value for each point in the domain but also a confidence interval.

This is exploited by BO, a global optimisation method. BO builds a Gaussian process, or Kriging, surrogate model of the simulator response surface based on a few initial samples, then uses an acquisition function, or infill criterion, to sequentially decide where to sample next to improve the model and find better solutions. For a brief introduction, refer to the work of Shahriari et al. [2016].

Several BO algorithms have been proposed in the literature. The most popular is the Efficient Global Optimisation (EGO) algorithm of Jones et al. [1998] that combines a Gaussian process with an expected improvement criterion for deciding where to sample next. The Knowledge Gradient (KG) policy for continuous parameters [Scott et al. 2011] is another myopic acquisition function that aims to maximise the new predicted optimal performance after one new sample. Different from EGO, KG accounts for covariance when judging the value of a sample and can be directly applied to noisy functions.

Conventional simulation optimisation approaches, including BO, assume that the auxiliary input parameters are known, when often this is not the case. Therefore, investigating the effect of input uncertainty has recently gained significant interest in the simulation community (for a general introduction, see, e.g., the work of Lam et al. [2016] and Corlu et al. [2020]). Currently, there are several proposed methods to assess the input uncertainty and its impact on the mean value of the simulation output. Barton and Schruben [2001] build an empirical distribution given historical data and sample from it using direct and bootstrap techniques to assess the impact of input uncertainty. Yi and Xie [2017] further improve by proposing a budget allocation approach that can efficiently employ the simulation resource. Chick [2001] uses a Bayesian posterior distribution to estimate the input distributions for the same purpose. Cheng and Holloand [1997] estimate the simulation output variability by decomposing it into random variations within the simulation model (simulation uncertainty) and input parameter uncertainty. Barton et al. [2014] replace the expensive simulation by metamodel-assisted bootstrapping using a stochastic Kriging response surface to estimate the impact of input uncertainty on the simulation output. Similarly, Yuan and Ng [2020] propose a Gaussian process based Bayesian approach to simultaneously calibrate the simulation model and select the most influential set of uncertain parameters.

The aforementioned methods assume the data to estimate input uncertainty is given. In the case when additional input data can be collected, Freimer and Schruben [2002] examine the question how much data to collect, and for what parameters. They suggest to run an initial experimental design with the endpoints of the confidence interval of the input uncertainty. Then they can use ANOVA to see whether the parameter effects are significant. If they are, then more information should be collected to reduce the uncertainty of the parameter. For a simplified setting only considering main effects, Song and Nelson [2015] propose a more efficient method that approximates the impact of input uncertainty on the overall variance in the simulation output with the help of a mean-variance metamodel depending on the means and variances of the input distributions. Similar work has also considered the trade-off between running more simulation evaluations or to collect more field data to reduce overall output performance uncertainty. Ng and Chick [2006] propose a method based on asymptotic approximations to quantify the impact of data collection. Then, a budget may be allocated in a way that minimises the overall performance uncertainty given sampling allocation constraints. Yuan and Ng [2013], similar to Yuan and Ng [2020], propose a Gaussian process based Bayesian approach to study how to allocate sampling resources by directly comparing the impact of each sampling decision using the integrated mean squared prediction error. Liu and Zhou [2020] suggest a method for quantifying the input uncertainty from streaming data that arrives sequentially over time.

When input uncertainty estimation is considered in the optimisation process, Song et al. [2015] explore the impact of model risk due to input uncertainty on indifference zone (IZ) ranking and selection. Wu and Zhou [2017] use ranking and selection in a two-stage allocation of finite budget, where the first stage consists of estimating the input parameters, followed by the budget allocation scheme to perform simulation runs in the second stage. Xiao and Gao [2018] consider taking the input uncertainty into account, but the optimisation is focused on the worst-case performance given a fixed finite number of input models. Zhou and Xie [2015] propose a formulation that allows to adapt to one’s risk preference for the optimisation.

Pearce and Branke [2017] proposed extensions to EGO and KG with continuous parameters to account for input uncertainty, essentially treating optimisation with input uncertainty as optimising an integrated expensive-to-evaluate function. Similar extensions to KG have been proposed by Toscano-Palmerin and Frazier [2018], and to the Informational Approach to Global Optimization (IAGO) algorithm by Wang et al. [2018].

Only very few papers consider the case where additional information can be gathered during the optimisation process. Song and Shanbhag [2019] consider the case of optimisation under input uncertainty when additional data is received from an uncontrolled streaming data process during optimisation. They propose a stochastic approximation framework that prescribes the number of gradient descent steps to be conducted in every timestep. For the discrete ranking and selection problem, Wu and Zhou [2019] study the impact of input uncertainty assuming new data becomes available in each iteration. They propose a technique that discards the oldest simulation outputs in the estimation of the means, an elimination of designs according to its confidence bounds, and a stopping criterion that has a guaranteed probability of correct selection.

In this work, we explicitly look at the trade-off between either running more simulations or instead collecting more input parameter data, all with the aim of finding the optimal solution to a simulator with accurate input parameters. Our methodology mainly builds on preliminary work by Pearce and Branke [2017]. We consider the more general case where the algorithm is allowed to choose between external data and simulation data, and provide a theoretical analysis as well as an empirical evaluation on multiple test problems.

3 PROBLEM FORMULATION

For simulation data, we assume solutions are given by vectors in a solution space, \( {x}\in X \subset \mathbb {R}^D \). The simulator may have multiple inputs for different purposes, and we refer to the concatenated vector as parameters in parameter space, \( {a}\in A \subset \mathbb {R}^J \). The simulator is an arbitrary stochastic black box function we denote as \( \begin{equation*} f: X \times A \rightarrow \mathbb {R}, \end{equation*} \) which takes as arguments a solution and parameters and returns a noisy scalar valued performance \( y = \theta (x,a) + \epsilon \), where \( \epsilon \) is independent and identically distributed as \( {\it \text{N}}(0, \sigma _{\epsilon }^{2}) \). Finally, the expectation of noisy performance is referred to as the expected output denoted \( \theta ({x},{a}) = \mathbb {E}[f({x},{a})] \).

For parameter data collection, we let N be the number of parameter data sources indexed by \( s\in S = \lbrace 1, \ldots , N\rbrace \). Querying a data source s returns a parameter data point \( r\sim \mathbb {P}[r|a^*,s], \) where \( {a}^* \) is the true parameter vector. Note that N may or may not equal parameter dimension J, as one data source may inform several input parameters (e.g., mean and variance in the newsvendor example in the following). If we denote m as the number of data samples collected so far, and \( r^i \) the i-th value observed from the parameter data source \( s^{i} \in S \), then \( \mathscr{R}^{m}= \lbrace (s, r)^1, \ldots , (s, r)^m\rbrace \) is the set of m queried data pairs. Therefore, \( {a}^* \) may be inferred using the likelihood of the data (1) \( \begin{equation} \prod _{i=1}^m\mathbb {P}[r^i|a,s^{i}]. \end{equation} \) The likelihood is defined by the application at hand, and therefore we assume it is given and may be used by any algorithm. For the goal of optimisation, both simulation triplets \( ({x},{a},y) \) and parameter data pairs \( (s, r) \) must be collected to infer both \( \theta (x,a) \) and a*, respectively. The objective function we want to optimise is the expected output given the true input parameter, \( \theta (x,a^*) \), and the aim is thus to identify \( \begin{equation*} {x}^* = \text{arg max}_{{x}}\theta ({x}, {a}^*). \end{equation*} \) Figure 1 illustrates an example.

Fig. 1. The expected output \( \theta ({x},{a}) \) with true parameter a* and optimal solution \( {x}^{*} \) . The goal is to learn \( x^* = \text{arg max}\theta (x, a^*), \) which requires learning both the true expected output \( \theta (x,a) \) and the true parameter \( a^* \) to be able to optimise the objective \( \max _{x} \theta (x,a^*) \) .

There is a budget of \( {B } \) units that can be spent either by choosing \( ({x}, {a}) \) and calling \( f({x},{a}) \) costing \( c_f \), or by choosing \( s\in \lbrace 1, \ldots , N\rbrace \) and querying \( \mathbb {P}[r| a^*,s] \) costing \( c_{s}\in \lbrace c_1, \ldots , c_{N}\rbrace \). After consuming the budget, a solution \( {x}_{r} \) is returned to the user and its quality is determined by the difference in true performance between \( {x}_{r} \) and the best solution \( {x}^{*} \), or opportunity cost (OC), (2) \( \begin{align} OC({x}_{r}) = \theta ({x}^{*}, a^{*})-\theta ({x}_{r},a^{*}). \end{align} \) For example, in Section 5 we consider the newsvendor problem. A newsvendor aims to maximise profit by choosing the optimal number of newspapers to stock, \( x \in \lbrace 0, 1, 2, \dots \!\rbrace \) under uncertain demand \( r \sim \text{Normal}(a^{*}) \) where the true parameter, a*, is composed by the mean demand for newspapers, \( \Gamma \in \mathbb {R}^+ \), and demand variance, \( \sigma ^{2}_{r} \). Both parameters are unknown and significantly affect the optimal number of newspapers to stock. We have a stochastic simulator to evaluate any chosen stock level with any set demand, \( f(x, r) \), which costs \( c_f=1 \) to run. We also have access to \( N=1 \) data source, that can be queried for \( c_s=1 \) to obtain demand sales data for a past day, to infer the true parameters \( a^{*}=[\Gamma , \sigma ^{2}_{r}] \). During optimisation, we have a budget of \( B=100 \) and we can collect either more simulator data, \( y=f(x, r) \), or more past sales data, r, to find to true optimal stock level, \( x^* \), that maximises expected profit for the true demand level, \( \max _x \mathbb {E}[f(x, a^*)] = \max _x \theta (x, a^*) \). Note that the costs \( c_s \) and \( c_f \) describe the relative cost incurred by running additional simulations or collecting additional input data, respectively, and they may correspond to time, money, computational power or other forms of resources required. In some cases, both might be monetary and measured in the same units—for example, if simulations are run on the cloud and input data is collected via a crowdsourcing platform, each would have a known monetary cost. In many other cases, the relative cost has to be set by a domain expert.

4 THE BICO ALGORITHM

We propose BICO, which automatically decides whether to conduct additional simulation experiments to find better solutions or to collect additional parameter data to reduce parameter uncertainty. In Sections 4.1 and 4.2, we describe the statistical models for inferring the expected output \( \theta (x,a) \) and true parameters \( a^* \), respectively. Section 4.3 derives the general VoI procedure, and Sections 4.5 and 4.6 apply this to value collecting simulation and collecting parameter data, respectively. At each iteration, the action is simply determined by what has the highest value. Together, the modelling and automated value based data collection form the BICO algorithm summarised as Algorithm 1 in Section 4.7. We then prove properties about BICO behaviour in Section 4.8.

4.1 Statistical Model for the Expected Simulator Output

Let us denote the n-th simulation point by \( ({x},{a})^{n} \), performance by \( y^{n} = f({x}^n, {a}^n) \) and the set of points up to n as \( \mathscr{F}^n= \lbrace (x,a,y)^1,\dots ,(x,a,y)^n \rbrace \). For convenience, we define the concatenated simulator arguments \( \tilde{X}^n = \lbrace (x,a)^{1},\dots ,(x,a)^{n}\rbrace \) with \( \tilde{{x}}= (x,a) \) and vector of outputs \( Y^{n} = (y^1,\dots , y^n) \). Then, we propose to use a Gaussian process to model \( \theta ({x},{a}) \).

A Gaussian process is defined by a mean function \( \mu ^0(\tilde{{x}}):X\times A \rightarrow \mathbb {R} \) and a covariance function \( k^0(\tilde{{x}}, \tilde{{x}}^{\prime }):(X\times A) \times (X\times A) \rightarrow \mathbb {R} \). Given the simulator dataset \( \mathscr{F}^n \), predictions of the expected output \( \theta (x, a) \) at new locations \( ({x}, {a}) \) are given by (3) \( \begin{align} \begin{split} \mathbb {E}[\theta ({x},{a})|\mathscr{F}^n] &= \mu ^n({x},{a}) \\ &=\mu ^0({x},{a})+k^0(({x},{a}),\tilde{X}^n)(k^0(\tilde{X}^n,\tilde{X}^n)+I\sigma _{\epsilon })^{-1}(Y^n-\mu ^0(\tilde{X}^n)), \end{split} \end{align} \) (4) \( \begin{align} \begin{split}\text{Cov}\,[\theta ({x},{a}),\theta ({x}^{\prime },{a}^{\prime })|\mathscr{F}^n] &= k^n(({x},{a}),({x}^{\prime },{a}^{\prime }))\\ &= k^0(({x},{a}),({x}^{\prime },{a}^{\prime }))-k^0(({x},{a}),\tilde{X}^n)(k^0(\tilde{X}^n,\tilde{X}^n)+I\sigma _{\epsilon })^{-1}k^0(\tilde{X}^n,({x}^{\prime },{a}^{\prime })). \end{split} \end{align} \) This model is also referred to as Kriging with a nugget effect [Yin et al. 2011] or stochastic Kriging with (constant, diagonal) intrinsic uncertainty. Although it may be used for the general heteroscedastic noise case (e.g., [Ankenman et al. 2008]), we only consider homogeneous noise in this article. The prior mean \( \mu ^0({x},{a}) \) is typically set to \( \mu ^0(x,a)=0 \) and the \( k^0(\tilde{{x}}, \tilde{{x}}^{\prime }) \) allows the user to encode known properties of the expected output \( \theta ({x},{a}) \) such as smoothness and periodicity. In Section 5, we use the popular squared exponential kernel that assumes \( \theta (x,a) \) is a smooth function such that nearby \( (x,a) \) have similar outputs while widely separated points have unrelated outputs, (5) \( \begin{align} k^{0}(({x},{a}),({x}^{\prime },{a}^{\prime })) = \sigma _{0}^{2}\exp \left(\frac{||({x},{a})-({x}^{\prime },{a}^{\prime })||^{2}}{2l_{XA}^{2}}\right), \end{align} \) where \( \sigma _{0} \ge 0 \) and \( l_{XA} \gt 0 \) are hyper-parameters estimated from the data \( \mathscr{F}^n \) by maximum marginal likelihood described in the appendix. Further details can be found in the work of Rasmussen and Williams [2006].

Note that we use a Gaussian process to model the output over the space \( X\times A \). Durrande et al. [2012] show that the number of data points required by a Gaussian process increases exponentially with the dimension of the input space. Therefore, this approach may not work for a high number of uncertain parameters or dimensions of the solution space.

4.2 Statistical Model for the True Parameters

We use a Bayesian approach to estimate the true parameters a*. The sources \( s^1, \ldots , s^m \in S \) are deterministically chosen by the algorithm and the observed \( r^1, \ldots , r^m \) are each independently generated from each corresponding source and have a likelihood given by Equation (1). To supplement data with expert knowledge, we can combine this with a prior distribution \( \mathbb {P}[a] \) resulting in a posterior distribution \( \begin{equation*} \mathbb {P}[a| \mathscr{R}^{m}] \propto \mathbb {P}[a]\prod _{i=1}^{m}\mathbb {P}[r^i| a, s^i]. \end{equation*} \) By assuming a convenient and intuitive prior distribution, the posterior distribution \( \mathbb {P}[{a}|\mathscr{R}^m] \) can be computed analytically and updated as new sources are queried. In practice, prior information may be modelled based on historical data on the distribution of parameter values, data from experiments done prior to the one being undertaken, or in case of no prior information, a non-informative prior such as Jeffrey’s prior can be used. Using conjugate priors facilitates closed-form posterior distributions which increases sampling efficiency.

In this work, we assume a uniform prior \( \mathbb {P}[a] \) over the box-constrained space A thereby restricting \( a^* \) to realistic values. In our experiments in Section 5, we work with Gaussian distributed data \( \mathbb {P}[r|a^*, s], \) therefore the posterior \( \mathbb {P}[a| \mathscr{R}^{m}] \) is a truncated Gaussian which is analytically tractable. Figure 2 shows we can evaluate the true expected output by taking a slice through the surface \( \theta ({x},{a}={a}^{*}) \). However, we can only estimate a distribution \( \mathbb {P}[{a}|\mathscr{R}^m] \) through collected data.

Fig. 2. Surface \( \theta ({x},{a}) \) sliced by the unknown true parameter \( {a}^{*} \) (red) and uncertainty distribution \( \mathbb {P}[{a}|\mathscr{R}^m] \) (blue) over the possible input values given.

4.3 Action Space

At any iteration \( t=m+n \), the algorithm can choose a simulation point \( ({x},{a})\in X\times A \) and observe \( y=f({x},{a}) \), or it may choose a parameter data source \( s \in S = \lbrace 1, \ldots , N\rbrace \) and observe \( r \sim \mathbb {P}[r|a^*, s] \). Therefore, the set of actions available to the algorithm is \( \lbrace X\times A, S\rbrace \). In the following, we follow the VoI procedure to derive the expected realised benefit of performing a given action (i.e., an acquisition function over the action set). The algorithm, in each iteration, then selects the action with the largest value.

4.4 Predicted Performance

Firstly, we consider the output at the end of executing the algorithm. After exhausting the budget \( {B } \), the algorithm must return a recommended solution \( {x}_r \) to the user. The true value of any given solution \( {x} \) is the expected output of the perfect simulator, the objective, \( \theta ({x}, {a}^*) \). However, both \( \theta ({x},{a}) \) and \( {a}^* \) are unknown, hence we need to make two approximations. Firstly, approximate \( \theta (x,a) \) with the Gaussian process prediction \( \mu ^n(x,a) \). Secondly, replace the fixed point \( a^* \) with the posterior \( \mathbb {P}[a|\mathscr{R}^{m}] \). Thus, the best estimate of the true quality of solution x, \( \theta (x, a^*) \), given the data so far \( \mathscr{F}^n, \mathscr{R}^{m} \) is denoted as \( G({x}; \mathscr{R}^m, \mathscr{F}^n) \) and given by (6) \( \begin{equation} G({x}; \mathscr{R}^m, \mathscr{F}^n)= \mathbb {E}_{a}\left[\mathbb {E}[\theta ({x},{a})|\mathscr{F}^n]\big |\mathscr{R}^{m}\right] =\int _{A}\mu ^n({x},{a})\mathbb {P}[{a}|\mathscr{R}^{m}]d{a}. \end{equation} \) Then, the best solution to recommend, \( {x}_{r} \), is the solution that maximises the model’s current prediction of true output (7) \( \begin{align} \begin{split} x_r(\mathscr{R}^m,\mathscr{F}^n) &= \text{arg max}_{x} G({x}; \mathscr{R}^m, \mathscr{F}^n). \end{split} \end{align} \) By using the preceding \( {x}_{r} \), the corresponding predicted true output is the maximum of \( G(\cdot) \) which we denote as (8) \( \begin{equation} G^*(\mathscr{R}^m, \mathscr{F}^n) = \max _{x}G({x}; \mathscr{R}^m, \mathscr{F}^n). \end{equation} \) We use \( G^*(\mathscr{R}^m, \mathscr{F}^n) \) as the measure of value or quality of the data we currently have. A VoI procedure quantifies the value of an action by computing the one-step look-ahead future expectation of this value and performing the action with maximum value.

The difference between using the true parameter \( {a}^{*} \) and the parameter distribution can be seen in Figure 3. The predicted solution quality \( G(x) \) with the recommended solution \( {x}_{r} \) and true quality \( \theta (x,a^*) \) with true best solution \( {x}^{*} \) may differ substantially. Simulation data helps to improve \( \mu ^n(x,a) \) to converge towards \( \theta (x,a) \). However, even with full simulator information, \( \mu ^n(x,a)= \theta (x,a) \), the predicted output \( G(x) \) must marginalise over a by Equation (6) which is still imperfect and \( x_r\ne x^* \).

Fig. 3. True expected output defined using true parameter \( {a}^{*} \) (red) and estimated performance using the parameter distribution \( \mathbb {P}[{a}|\mathscr{R}^m] \) .

We next derive the VoI of performing any action. This is computed by assuming an action is taken and considering the hypothetical predicted performance at the next timestep, either \( G^*(\mathscr{R}^{m+1}, \mathscr{F}^n) \) or \( G^*(\mathscr{R}^m, \mathscr{F}^{n+1}) \).

4.5 VoI for Simulation Data

If a simulation point \( ({x},{a},y)^{n+1} \) were to be collected thereby augmenting \( \mathscr{F}^{n+1} = \mathscr{F}^n\cup \lbrace ({x},{a},y)^{n+1}\rbrace \), then the updated predicted performance would be \( G(\mathscr{R}^m, \mathscr{F}^{n+1}) \). At time \( t=m+n \), given the next simulation point \( ({x},{a})^{n+1} \) and before collecting the new performance \( y^{n+1} \), we may compute the one-step look-ahead incremental increase in predicted performance which is the VoI of taking the action \( ({x},{a})^{n+1} \), (9) \( \begin{equation} \text{VoI}(({x},{a}); \mathscr{R}^m, \mathscr{F}^n) = \mathbb {E}_{y^{n+1}}\left[\frac{ G^*(\mathscr{R}^m, \mathscr{F}^{n+1}) - G^*(\mathscr{R}^m, \mathscr{F}^n)}{c_{f}}\Big | ({x},{a})^{n+1}=(x,a), \mathscr{F}^n\right] , \end{equation} \) where \( c_f \) is the cost of running a simulation. Assuming the datasets are given, \( \text{VoI}(({x},{a})^{n+1}; \cdot):X\times A\rightarrow \mathbb {R} \) is a scalar valued function over the domain of the simulator. It returns the expected increase in the peak of the predicted simulator output, \( G({x}; \mathscr{R}^m, \mathscr{F}^n) \), per unit cost of running the simulator.

To evaluate \( \text{VoI}(({x},{a})^{n+1};) \), we next derive the predictive distribution of \( G({x};\mathscr{R}^m, \mathscr{F}^{n+1}) \) given data at time \( t=n+m \). This requires an updating formula for the posterior mean \( \mu ^{n+1}(x,a) \). By setting the posterior mean and covariance after n samples, \( \mu ^{n}({x},{a}) \), \( k^{n}(({x},{a}),({x}^{\prime },{a}^{\prime })) \), as the prior mean and covariance in Equation (3), we can write the formula for the mean after the \( (n+1) \)-th sample as (10) \( \begin{align} \begin{split}\mu ^{n+1}({x},{a}) = \mu ^{n}({x},{a}) + \frac{k^{n}(({x},{a}),({x},{a})^{n+1})}{k^{n}(({x},{a})^{n+1},({x},{a})^{n+1}) + \sigma ^{2}_{\epsilon }}(y^{n+1} - \mu ^{n}({x},{a})), \end{split} \end{align} \) where \( ({x},{a})^{n+1} \) is a given argument to \( \text{VoI}(\cdot) \) and \( y^{n+1} \) is unknown. The Gaussian process model provides a predictive distribution for the new function value (11) \( \begin{align} \begin{split}y^{n+1} \sim N(\mu ^{n}({x},{a})^{n+1},k^{n}(({x},{a})^{n+1},({x},{a})^{n+1}) + \sigma ^{2}_{\epsilon }). \end{split} \end{align} \) By writing \( y^{n+1} = \mu ^n(x,a) + \sqrt {k^{n}(({x},{a})^{n+1},({x},{a})^{n+1}) + \sigma ^{2}_{\epsilon }}Z \) with \( Z\sim N(0,1) \), substituting into Equation (10) and simplifying leads to the following parametrisation of \( \mu ^{n+1}({x},{a}) \), (12) \( \begin{align} \begin{split}\mu ^{n+1}({x},{a}) = \mu ^{n}({x},{a}) + \tilde{\sigma }^{n}(({x},{a}),({x},{a})^{n+1})Z, \end{split} \end{align} \) where \( \tilde{\sigma }^{n}((x,a),(x,a)^{n+1}) \) is a deterministic function parametrised by \( (x,a)^{n+1} \) that is the additive update to the posterior mean scaled by Z (13) \( \begin{align} \begin{split}\tilde{\sigma }^{n}((x,a),(x,a)^{n+1}) = \frac{k^{n}(({x},{a}),({x},{a})^{n+1})}{\sqrt {k^{n}(({x},{a})^{n+1},({x},{a})^{n+1}) + \sigma ^{2}_{\epsilon }}}. \end{split} \end{align} \) Therefore, the predictive distribution of the new posterior mean is given by (14) \( \begin{align} \begin{split}\mu ^{n+1}({x},{a}) \sim N(\mu ^{n}({x},{a}),\tilde{\sigma }^{n}((x,a),(x,a)^{n+1})^{2}), \end{split} \end{align} \) and the predicted performance after a new sample \( ({x},{a})^{n+1} \) can then be written as (15) \( \begin{align} \begin{split}G({x}; \mathscr{R}^m, \mathscr{F}^{n+1}) &= \int _{A}\mu ^{n+1}({x},{a})\mathbb {P}[{a}|\mathscr{R}^m]d{a}, \end{split} \end{align} \) (16) \( \begin{align} \begin{split}&= \int _{A}\mu ^{n}({x},{a})\mathbb {P}[{a}|\mathscr{R}^m]d{a}+Z\int _{A}\tilde{\sigma }^{n}((x,a),(x,a)^{n+1})\mathbb {P}[{a}|\mathscr{R}^m]d{a}, \end{split} \end{align} \) (17) \( \begin{align} \begin{split}&= G({x}; \mathscr{R}^m, \mathscr{F}^n) + Z\tilde{\Sigma }^{n}({x},({x},{a})^{n+1}), \end{split} \end{align} \) where \( \tilde{\Sigma }^{n}({x},({x},{a})^{n+1}) \) is the final term in Equation (16). The predictive distribution of a new observation after evaluating \( (x,a)^{n+1} \) is then given by (18) \( \begin{align} \begin{split}G({x}; \mathscr{R}^m, \mathscr{F}^{n+1}) \sim N(G({x}; \mathscr{R}^{m}, \mathscr{F}^{n}),\tilde{\Sigma }^{n}({x},({x},{a})^{n+1})^{2}). \end{split} \end{align} \) The new sample at \( (x,a)^{n+1} \) causes the posterior mean to change at other solutions and inputs according to the additive update \( Z\tilde{\Sigma }^{n}(x,(x,a)^{n+1}) \). So, replacing the derived \( G({x}, \mathscr{R}^m, \mathscr{F}^{n+1}) \) (Equation (18)) in the VoI of acquiring a new simulation point \( ({x},{a}) \) (Equation (9)) results in (19) \( \begin{align} \begin{split}\text{VoI}((x,a)^{n+1}; \mathscr{R}^m, \mathscr{F}^n)&=\frac{1}{c_f}\mathbb {E}_{y^{n+1}}\left[ G^*(\mathscr{R}^m, \mathscr{F}^{n+1})- G^*(\mathscr{R}^m, \mathscr{F}^{n}) \Big | ({x},{a})^{n+1}\right], \end{split} \end{align} \) (20) \( \begin{align} \begin{split}&= \frac{1}{c_f}\mathbb {E}_{Z} \Big [\max _{x}\big \lbrace G({x}; \mathscr{R}^m, \mathscr{F}^n) + Z\tilde{\Sigma }^{n}({x},({x},{a})^{n+1})\big \rbrace -G^*(\mathscr{R}^m, \mathscr{F}^{n})\Big |({x},{a})^{n+1}\Big ]. \end{split} \end{align} \) The final expectation is identical to KG under input uncertainty [Pearce and Branke 2017; Toscano-Palmerin and Frazier 2018]. Following these works, the expectation can be evaluated by traditional KG for continuous parameters using Gaussian processes [Frazier et al. 2009] where the maximisation over \( x\in X \) embedded within the expectation and within \( G^*(\cdot) \) are replaced with a maximisation over a disretised set \( x\in X_D\subset X \). With this replacement, the expectation over Z can be evaluated analytically. The \( \text{VoI}(({x},{a}); \mathscr{R}^m,\mathscr{F}^n) \) acquisition function may be optimised over the joint solution-input space to find the most beneficial \( ({x},{a})^{n+1} \) and corresponding \( \max \text{VoI}(\cdot) \).

The preceding VoI contains an integration over a, \( G(x,\mathscr{R}^{m}, \mathscr{F}^n) \), which must be approximated through Monte Carlo (special cases can be done analytically, although we do not consider them here). We use \( N_{A} \) sampled parameter values, \( a_k \sim \mathbb {P}[{a}|\mathscr{R}^{m}] \) for \( k \in {1,\dots , N_A} \), and we may write \( \begin{equation*} G(x, \mathscr{R}^{m}, \mathscr{F}^n) \,\approx \, \hat{G} (x, \mathscr{R}^{m}, \mathscr{F}^n) \,=\, \frac{1}{N_{A}} \sum _{k=1}^{N_{A}} \mu ^{n}(x, a_{k}), \end{equation*} \) with the same expression for \( \tilde{\Sigma }^n(x, (x,a)^{n+1}) \).

Figure 4 shows KG with input uncertainty [Pearce and Branke 2017]. At the start of sampling, initial samples are allocated by Latin hypercube sampling, the Gaussian process prediction of \( \theta ({x},{a}) \) and \( G({x}; \mathscr{R}^{0}, \mathscr{F}^{10}) \) after the initial allocation are shown in Figure 4(b) and (e) assuming a uniform distribution for \( \mathbb {P}[{a}] \). Then a budget of \( {B } \) samples is allocated sequentially according to Equation (20) (Figure 4(c)). Once all samples have been allocated, based on the learned Gaussian process model, the design \( {x} \) with the largest predicted performance, according to Equation (7), is recommended to the user (Figure 4(f)).

Fig. 4. In all plots, small points represent function evaluations. (a) Expected output \( \theta ({x},{a}) \) , and (d) shows \( G(x) \) using the expected output \( \theta ({x},{a}) \) and uniform parameter distribution. After 10 initial samples, (b) shows the surface \( \mu ^{10}(x,a) \) and (e) \( G(x, \mathscr{R}^{0},\mathscr{F}^{10}) \) . After 90 samples allocated by Equation (20), (c) shows the surface given by \( \mu ^{100}(x,a) \) , and (f) shows \( G^{100}(x,\mathscr{R}^{0},\mathscr{F}^{100}) \) .

4.6 VoI of Data from External Sources

Instead of collecting simulation data, we may collect data from a parameter data source \( r^{m+1}\sim \mathbb {P}[r|{a}^*,s^{m+1}] \) thereby augmenting the corresponding dataset \( \mathscr{R}^{m+1} = \mathscr{R}^m\cup \lbrace (s,r)^{m+1}\rbrace \). This also generates an improvement in predicted performance that we refer to as the VoI of collecting additional external data, given by (21) \( \begin{eqnarray} \text{VoI}(s; \mathscr{R}^m, \mathscr{F}^n) &=& \mathbb {E}_{r^{m+1}}\left[\frac{ G^*(\mathscr{R}^{m+1}, \mathscr{F}^n)- G^*(\mathscr{R}^m, \mathscr{F}^n)}{c_{s}} \big | s^{m+1}=s, \mathscr{R}^m\right], \end{eqnarray} \) with \( c_s \) being the cost of sampling external data source s. \( \text{VoI}(s;\mathscr{R}^m,\mathscr{F}^n) \) is computed using Monte Carlo with samples \( r^{m+1} \) generated from to the predictive density \( \mathbb {P}[r^{m+1}|s^{m+1}, \mathscr{R}^{m}] \). Each sample leads to a realisation of \( G^{*}(\mathscr{R}^{m+1},\mathscr{F}^n) \). The difference between the average of future \( G^*(\mathscr{R}^{m+1}, \mathscr{F}^n) \) realisations and the current \( G^*(\mathscr{R}^m, \mathscr{F}^n) \) yields a non-negative quantity that can be used to assess the benefit of sampling a parameter data source s.

However, naively using Monte Carlo integration can lead to high variance and expensive computation. We instead recommend to use importance sampling, common random numbers and control variates as follows.

We use the same Monte Carlo set \( A_{MC} = \lbrace a_1,\dots , a_{N_A}\rbrace \) as used to compute the benefit of simulation data \( \text{VoI}((x,a); \cdot) \), thereby the two types of actions, simulator \( (x, a) \) and external data s, are compared with common random numbers. Given a hypothetical \( (s,r)^{m+1} \) (where \( s^{m+1} \) is chosen and \( r^{m+1} \) is a sample), the one-step look-ahead expected performance function \( G(x, \mathscr{R}^{m+1}, \mathscr{F}^n) \) may be estimated using \( A_{MC} \) with importance sampling \( \begin{equation*} G(x, \mathscr{R}^{m+1}, \mathscr{F}^n) \,\approx \, \hat{G} (x, \mathscr{R}^{m+1}, \mathscr{F}^n) \,=\, \frac{1}{N_{A}} \sum _{k=1}^{N_{A}} \mu ^{n}(x, a_{k}) \frac{\mathbb {P}[{a}_k|\mathscr{R}^{m+1}]}{\mathbb {P}[{a}_k|\mathscr{R}^{m}]}, \end{equation*} \) and the corresponding \( \hat{G}^*(\mathscr{R}^{m+1}, \mathscr{F}^n) \) is found by a Nelder-Mead optimiser over \( x\in X \). Thus, to evaluate \( \text{VoI}(s; \cdot) \), given a source \( s^{m+1}=s \), we generate \( N_r \) samples \( r_{l}^{m+1}\sim \mathbb {P}[r^{m+1}|s^{m+1};\mathscr{R}^{m}] \) for \( l \in \lbrace 1, \dots N_r\rbrace \). Each sample \( r_l \) produces a realisation of a future dataset, \( \mathscr{R}^{m+1}_l \), a future posterior density, \( \mathbb {P}[a|\mathscr{R}^{m+1}_l] \), and importance weights, \( \frac{\mathbb {P}[{a}_{k}|\mathscr{R}_l^{m+1}]}{\mathbb {P}[{a}_{k}|\mathscr{R}^{m}]} \). The weights define a function \( \hat{G}(x, \mathscr{R}^{m+1}_l, \mathscr{F}^n) \) with peak \( \hat{G}^*(\mathscr{R}^{m+1}_l, \mathscr{F}^n) \). The VoI of collecting external data from source s is the average of new peaks minus the current peak (22) \( \begin{equation} \text{VoI}(s;\mathscr{R}^{m}, \mathscr{F}^n) = \frac{1}{c_s} \left(\left(\frac{1}{N_{R}} \sum _{l=1}^{N_{R}} \hat{G}^{*}(\mathscr{R}^{m+1}_{l}, \mathscr{F}^n)\right) - \hat{G}^{*}(\mathscr{R}^{m},\mathscr{F}^n)\right). \end{equation} \) See Figure 5 (centre left, centre right) illustrating the densities and the one-step look-ahead update.

Fig. 5. Left: The simulation metamodel given 10 points. Centre left: Black shows \( \mathbb {P}[a|\mathscr{R}^{10}], \) and green shows four realisations of \( \mathbb {P}[a|\mathscr{R}_l^{11}] \) . Centre right: Black shows \( G(x, \mathscr{R}^{10}, \mathscr{F}^{10}), \) and green shows the four realisations of \( G(x, \mathscr{R}_l^{11}, \mathscr{F}^{10}) \) . Right: Using the control variate \( G(x_r^{10,10}, \mathscr{R}_l^{11}, \mathscr{F}^{10}) \) reduces variance and enforces non-negativity.

Finally, we may further reduce variance by modifying \( \hat{G}^{*}(\mathscr{R}^{m},\mathscr{F}^n) \) to act as a control variate. This is achieved by decomposing the distribution over a as follows: (23) \( \begin{eqnarray} \mathbb {P}[a|\mathscr{R}^m] = \int _{r^{m+1}} \underbrace{\mathbb {P}[a|\mathscr{R}^{m+1}]\mathbb {P}[r^{m+1}|s^{m+1}, \mathscr{R}^m]}_{\mathbb {P}[a, r^{m+1}|s^{m+1}, \mathscr{R}^{m}]}dr^{m+1}, \end{eqnarray} \) which may be substituted into the expression for \( G^*(\mathscr{R}^m, \mathscr{F}^n) \) and after simplification yields \( \begin{equation*} G^*(\mathscr{R}^m, \mathscr{F}^n) = \mathbb {E}_{r^{m+1}}\left[ G\left(x^{m,n}_r,\, \mathscr{R}^{m+1}, \mathscr{F}^n\right) \big |s^{m+1}, \mathscr{R}^m \right], \end{equation*} \) where the argument \( x_r^{m,n} = x_r(\mathscr{R}^{m}, \mathscr{F}^n) \) is the current optimal solution. The final expression for the VoI of external data is therefore \( \begin{equation*} \text{VoI}(s;\mathscr{R}^{m}, \mathscr{F}^n) = \frac{1}{c_s} \frac{1}{N_{R}} \sum _{l=1}^{N_{r}} \left[\max _x \hat{G}(x, \mathscr{R}^{m+1}_{l}, \mathscr{F}^n) - \hat{G}(x^{m,n}_r,\mathscr{R}^{m+1}_l,\mathscr{F}^n)\right]. \end{equation*} \) Note that the argument within the summation is strictly non-negative, in contrast with Equation (22) whose summation terms may be positive or negative. See Figure 5 (right) illustrating realisations of the summation argument \( \hat{G}(x, \mathscr{R}^{m+1}_{l}, \mathscr{F}^n) - \hat{G}(x^{m,n}_r,\mathscr{R}^{m+1}_l,\mathscr{F}^n) \) whose peaks are all non-negative.

For the remainder of this work, we will use the shorthand \( \text{VoI}^t(\cdot) = \text{VoI}(\cdot ; \mathscr{R}^m, \mathscr{F}^n) \) to refer to the VoI at iteration \( m + n=t \). We note that extending the method to account for multiple parameter data sources is simply a case of computing \( \text{VoI}^t(\cdot) \) for each individual parameter data source \( s \in S \) and taking the maximum.

As external input parameter \( (s,r) \) pairs are collected, the distribution over input parameter a is updated, in turn updating \( G(x, \mathscr{R}^m, \mathscr{F}^n) \). Figure 6 illustrates the model updating with increasing external data.

Fig. 6. Iterations of BICO only collecting external data, the simulation metamodel \( \mu ^n(x,a) \) (left) is unchanged and only the input parameter uncertainty, \( \mathbb {P}[a|\mathscr{R}^{m}] \) , is updated. Remaining top plots give \( \mathbb {P}[a|\mathscr{R}^5] \) , \( \mathbb {P}[a|\mathscr{R}^{10}] \) and \( \mathbb {P}[a|\mathscr{R}^{30}] \) . Bottom plots show the corresponding \( G(x, \mathscr{R}^5, \mathscr{F}^{10}) \) , \( G(x, \mathscr{R}^{10}, \mathscr{F}^{10}) \) , \( G(x, \mathscr{R}^{30}, \mathscr{F}^{10}) \) which are found by integrating the simulation metamodel over the (vertical) \( a\in A \) dimension with the corresponding \( \mathbb {P}[a|\cdot ] \) . Dashed lines represent the confidence interval.

4.7 The Overall Algorithm

BICO is outlined in Algorithm 1. On line 1, the algorithm begins by fitting a Gaussian process model to a set of initial simulation points \( \mathscr{F}^n \) using a Latin hypercube (LHS) ‘space-filling’ experimental design. In addition, we compute the posterior parameter distribution for any collected parameter data source points \( \mathscr{R}^m \). After initialisation, the algorithm continues in an optimisation loop until the budget \( {B } \) has been consumed. During each iteration, we compute the VoI of collecting a new simulation point \( ({x}^{n+1},{a}^{n+1},y^{n+1}) \) according to \( \text{VoI}^{t}(({x},{a})) \) (line 3) and the VoI of collecting a new sample for each one of the parameter data sources \( s \in S \) \( \text{VoI}^{t}(s) \) (line 4). Figure 7(a) shows a Gaussian process simulation metamodel with the input parameter distribution. The corresponding VoI is displayed in Figure 7(b). The multi-modal surface shows the value of collecting a simulation point at \( \text{VoI}^t(x,a) \) while the value of collecting external data, \( \text{VoI}^t(s) \), is overlayed as a constant plane. The action that gives greatest value determines whether we collect a sample \( ({x},{a},y)^{n+1} \) or \( r^{m+1} \). In the first case, the Gaussian process model is updated according to the new solution sample (lines 6–9), and for the second case, the posterior parameter distribution is updated according to the new parameter data source sample (lines 11–14). At the end of \( {B } \) samples, the design \( {x} \) with the largest predicted performance \( G({x}) \) is recommended to the user (line 15). The source code can be downloaded from https://github.com/JuanUngredda/Input-Uncertainty.

Fig. 7. (a) The two inferred posterior distributions within BICO. The surface shows the Gaussian process simulation metamodel, \( \mu ^n(x,a) \) , after nine simulation points while the gray bell curve shows the estimated input parameter distribution, \( \mathbb {P}[a|\mathscr{R}^m] \) after 10 samples. (b) The VoI used within BICO. The multi-modal surface shows the value of collecting a simulation point at \( \text{VoI}^t(x,a) \) while the value of collecting external data, \( \text{VoI}^t(s) \) , is overlayed as a constant plane. The optimal action is to collect simulation data \( y^{n+1}=f(30, 0) \) .

4.8 Theoretical Properties of BICO

We first outline the proof of asymptotic consistency of BICO with an infinite sampling budget and leave full details to the appendix. We specifically show that if X is discrete and \( A \subset \mathbb {R}^{d} \) is continuous, the BICO algorithm will find the true optimal solution \( x^* \) as well as the true parameters \( a^* \) in the limit. This builds on a previous proof by Toscano-Palmerin and Frazier [2018] that shows consistency for input uncertainty and collection of simulation points. In the proof, we assume a discrete solution space which simplifies matters significantly. However, we would like to note that the discretisation may be arbitrarily fine.

Proposition 3 shows that if a single action is performed infinitely often, then the value of performing the action vanishes. This implies the value of all actions eventually vanishes.

Proposition 3.

Let \( (x^{\prime }, a^{\prime }) \in X \times A \) and suppose that \( f(x^{\prime },a^{\prime }) \) is repeatedly observed, then \( \text{VoI}^{t}((x^{\prime }, a^{\prime })) \rightarrow 0 \) as \( t \rightarrow \infty \). Let \( s^{\prime } \in S \) and suppose that \( \mathbb {P}[r|a^*, s^{\prime }] \) is repeatedly observed, then \( \text{VoI}^{t}(s^{\prime }) \rightarrow 0 \) as \( t \rightarrow \infty \).

Furthermore, if the VoI of all actions is zero, this implies that \( x^* \) is known and \( \mathbb {P}[a|\mathscr{R}^m] \) is a point-mass distribution on a*.

Proposition 4.

Given a squared exponential kernel with finite length scale, \( l_{a}\lt \infty \). If \( \text{VoI}((x,a); \mathscr{R}^m,\mathscr{F}^n)= 0 \) and \( \text{VoI}(s;\mathscr{R}^m,\mathscr{F}^n)= 0 \) for all \( (x,a) \) and s, then \( \text{arg max}_{x} G(x;\mathscr{R}^m,\mathscr{F}^{\infty }) =\text{arg max}_{x} \int _{A}\theta (x,a)\mathbb {P}[a|\mathscr{R}^m]da \) and \( \mathbb {P}[a|\mathscr{R}^m] = \delta _{a=a^{*}} \).

We next prove that BICO will not sample from irrelevant information sources. If an input parameter a does not influence the simulator outcome, then an algorithm should not waste budget trying collecting external data to reduce its uncertainty. If we use a squared exponential kernel, then the length scales for each input parameter dimension \( a\in A \) determine the relevance of that parameter. A long length scale suggests the model is less sensitive to change in such input parameter, hence learning more about such a parameter will have less impact and an algorithm should not sample such data. Remark 1 proves that BICO behaves exactly in this way and will not sample irrelevant external information sources.

Remark 1.

Let \( (x,a) \in X \times A \) where \( A\subset \mathbb {R} \) and assume a single information source \( N=1 \). Let the Gaussian process model use a squared exponential kernel, (24) \( \begin{align} \begin{split}k^{0}((x,a)(x,a)^{\prime })&=\sigma ^{2}_{f}e^{-\frac{1}{2}\big (\frac{(x-x^{\prime })^{2}}{l_{x}}+\frac{(a-a^{\prime })^{2}}{l_{a}} \big) }. \end{split} \end{align} \) Then \( \text{VoI}^t(s, \mathscr{R}^m,\mathscr{F}^n) \rightarrow 0 \) as \( l_{a} \rightarrow \infty \) for any m and n.

5 RESULTS AND DISCUSSION

To demonstrate the performance of BICO, we compare it against a two-stage algorithm which first collects m data source samples to update the input posterior distribution. Then, in stage two the remaining budget is dedicated to sequentially sample from the simulator. If there are two or more input distributions, we equally distribute the initial portion, m, over the different inputs. Since in practice it is not possible to know in advance how many data source samples m should be collected, we will show results for different settings of m. The best setting of m can then be regarded as an ideal case which cannot be obtained for real problems.

For simplicity and without loss of generality, in all figures, we consider the same acquisition cost, \( c_{s}=c_{f}=1 \). In practice, the only requirement is that \( c_{s} \) and \( c_{f} \) are comparable measures of ‘cost’ for data collection. They are application specific and often depend on the user’s preferences. Results for BICO are shown in orange, with OC error bars (see Equation (2)) as vertical lines and the average number of samples and its error bar as horizontal lines.

5.1 Gaussian Process Generated Experiments

We consider a test function with solution space \( X = [0,100] \) and either one parameter in \( A=[0,100] \) or two parameters with \( A=[0,100]^2 \) generated from a Gaussian process with a squared exponential kernel with known hyper-parameters \( l_{XA}=10 \), \( \sigma ^2_{0}=1 \) , \( \sigma ^2_{\epsilon }=(0.1)^2 \). The total budget in both cases was set to \( {B }=100 \). To model input uncertainty, we assume a uniform prior \( \mathbb {P}[a]=\frac{1}{100} \) and normally distributed data source samples for each source with unknown mean and known variance \( \sigma ^{2}_{s}=10 \) for all data sources.

Results are shown in semi-log scale in Figure 8, on the left for the case of a single parameter and on the right the case of two parameters with equal variance. The horizontal axis shows the number of samples m allocated to the parameter data source to update \( \mathbb {P}[a|\mathscr{R}^m] \), whereas the vertical axis shows the confidence interval of the OC after the budget \( {B } \) has been completely allocated. In both cases, BICO balances the sampling allocation effort in a sensible way, finding comparable results to taking the optimal initial number of parameter data sources samples. Somewhat surprisingly, it seems more effort should be allocated to external data collection if there is only one data source. A possible reason for this is that in case of two data sources, the space over which the simulator is defined is higher, requiring more simulation effort to build a credible Gaussian process model.

Fig. 8. Results for Gaussian process generated test functions. Mean and 95% confidence interval for the OC plotted in a semi-log scale for \( {B }=100 \) . (a) With one-dimensional solution space and one parameter. (b) With one-dimensional solution space and two parameters. In each experiment, each parameter has one parameter data source. Each confidence interval is generated using 20 replications.

Figure 9 shows the case of two parameter data sources with different variances. More specifically, we consider two sources \( S=\lbrace 1,2\rbrace \) with \( \sigma ^{2}_{s=1}=5 \) and \( \sigma ^{2}_{s=2}=10 \), where the horizontal and vertical axis show the number of samples allocated to parameter data source 1 and parameter data source 2, respectively. In both cases, for BICO (orange) and the fixed initial allocation (dotted yellow box), the best budget allocation is located around 13 samples for parameter data source 1 and 18 samples for parameter data source 2. Since all length scales are equal for this experiment, differences in the number of samples collected by BICO are mainly due to the difference in variance of the parameter data sources, and so as expected, it is best to allocate more samples to the data source with higher variance. The colour bar in Figure 9 compares the confidence interval of the OC for BICO with the benchmark.

Fig. 9. Grid plot: Number of samples allocated to the parameter data source 1 (horizontal axis) and the parameter data source 2 (vertical axis), where the BICO sample allocation (orange) is represented by a confidence interval for source 1 (horizontal) and source 2 (vertical). Colours inside the grid show the mean OC for the fixed initial proportion approach according to the colour bar for each allocation combination. The best OC for the fixed initial allocation is shown in yellow (dotted line). Colour bar: OC value in a log scale where BICO (orange) is represented as a vertical confidence interval. BICO and each allocation combination are averaged over 20 replications.

5.2 Newsvendor Simulation Optimisation

Here, we consider the problem of a newsvendor, who must decide how many copies of the day’s paper to stock in the face of uncertain demand and where any unsold copies will be worthless at the end of the day. The solution x is the number of newspapers ordered and demand is a random variable \( r \sim \text{Normal}(\Gamma ,\sigma _{r}^{2}) \) with true mean \( \Gamma =40 \) and \( \sigma _{r}^{2}=\sqrt {10} \). In this experiment, we assume the solution space is continuous. Both parameters, expected demand \( \Gamma \) and variance \( \sigma _{r}^{2} \), are unknown in the simulation and must be estimated through real data. The profit, \( f(x,r) \), can be calculated as \( \begin{equation*} f(x,r) = p \min (x,r) - lx, \end{equation*} \) where p is the selling price and l the production/purchase cost of a newspaper, with \( p\gt l \). For this experiment, we set \( p= \) 5, \( l = \) 3 and \( x \in [0, 100] \). We use an initial allocation of 10 samples to train the Gaussian process model from an overall budget of \( {B }=100 \). There is only one data source, from which the unknown mean and variance has to be estimated. To update input uncertainty, we assume a non-informative Gamma prior, which is a conjugate prior of the normal distribution so \( \mathbb {P}[\Gamma , \sigma ^{2}_{r}|\mathscr{R}^{m}] \) can be sampled hierarchically using \( \Gamma | \sigma _{r}^{2}, \mathscr{R}^{m} \sim \text{Normal}(\bar{r},\sigma _{r}^{2}/m) \) and \( 1/\sigma _{r}^{2}| \mathscr{R}^{m} \sim \text{Gamma}((m-1)/{2},s^{2}(m-1)/2) \). \( \mathbb {P}[r^{m+1}|\mathscr{R}^{m}] \) is distributed as Student t distribution \( t_{m-1}(\bar{r}, s^{2}(1+1/m)), \) where \( \bar{r} \) and \( s^{2} \) are the sample mean and unbiased sample variance of the collected data \( \mathscr{R}^{m} \).

Figure 10 shows that BICO (orange) again manages to allocate the budget \( {B } \) close to an optimal fixed initial number of samples (blue).

Fig. 10. Newsvendor with unknown mean and variance of demand. Mean and 95% confidence interval of OC where \( B=100 \) . Blue points: At the start, m input parameter samples are collected to estimate a*, and thereafter standard KG for (fixed) input uncertainty is applied to allocate the remaining budget to simulation points. There is no automatic trade-off between data types; m must be user specified. Orange: BICO algorithm, the horizontal confidence interval showing the range of sample sizes m chosen by BICO. BICO automatically collects an appropriate number of data source samples. Each confidence interval is generated using 25 replications.

5.3 Production Line Optimisation

We consider a production line of a manufacturing plant from the Simulation Optimisation Library [Pasupathy and Henderson 2011]. The production line presents three service queues of finite capacity in sequence. Parts leaving after service are immediately transferred to the next queue. In case the next machine is busy, and with full service queue, the previous machine must stop working until the next machine is free since there is no room in the next queue. Parts arrive to queue 1 according to a Poisson process with true rate \( \lambda =0.5, \) and the service time for each machine is exponentially distributed with rate \( \xi _{i}, i=1,2,3 \). The arrival rate is an unknown parameter in the simulation and must be estimated through real-data acquisition. The decision variables to be optimised are the (continuous) service times of the machines, \( \vec{\xi }^{T}=[\xi _{1},\xi _{2},\xi _{3}] \in [0,2]^{3} \).

The goal is to maximise the revenue function, \( \begin{equation*} R(\vec{\xi }, \lambda) = \frac{10,000 \rho (\vec{\xi }, \lambda)}{1+\vec{c}^{T}\vec{\xi }} - 400, \end{equation*} \) over a time horizon of \( t=1,000 \) timesteps. In the preceding revenue function, \( \rho (\vec{\xi }, \lambda) \) denotes the throughput of the production line and is defined as the time-averaged number of parts leaving the last queue, which is a function of the service times and the unknown rate \( \lambda \). \( \vec{c}^{T}=[1,5,9] \) are cost factors related to the chosen service times of the machines.

To update input uncertainty distribution, we assume a non-informative Gamma prior, which is a conjugate prior of the Gamma distribution. So \( \mathbb {P}[\lambda |\mathscr{R}^{m}] \) can be sampled using a Gamma distribution with shape \( 1/2 + m \) and rate \( \bar{r}m \). \( \mathbb {P}[r^{m+1}|\mathscr{R}^{m}] \) is distributed as a Pareto density with minimum value parameter 0, shape parameter \( 1/2 + m \) and scale parameter \( \bar{r}m, \) where \( \bar{r} \) is the sample mean of the collected data \( \mathscr{R}^m \). Figure 11 shows an adequate allocation for both methods that coincides with around \( 25\% \) of the overall budget \( (B=100) \).

Fig. 11. Production line optimisation experiment plot with mean and 95% confidence interval for the OC plotted in a semi-log scale for \( {B }=100 \) . Each confidence interval is generated using 50 replications.

6 CONCLUSION

In this article, we proposed a novel unified algorithm for simulation optimisation under input uncertainty. In each iteration, it automatically determines whether to perform more simulation experiments or instead collect more real-world data to reduce the uncertainty about the input parameters. A comparison with an algorithm that allocates a fixed, pre-determined budget m of the available budget to external data collection demonstrated that BICO’s allocation mechanism is very powerful and results in a solution performance and fraction of budget allocated to external data collection similar to what can be achieved with the optimal allocation, which is not known in practice.

There are some interesting extensions of this work with concrete practical applications which are possible to pursue. One example is the extension to multi-objective optimisation, where the uncertainty about a user’s preferences over objectives can be reduced by querying the user. Further computational efficiency may be attained by transfer learning to carry over information from one stage to the next and speed up the optimisation since the problems solved at every stage are very similar. Lastly, the scalability of BICO with respect to parameter and solution space dimensionality should be investigated.

A APPENDICES

A.1 Monte Carlo Convergence of External Data Source

In this section, we present convergence rates for the Monte Carlo approximation to estimate \( \text{VoI}(s; \mathscr{R}^m, \mathscr{F}^n) \). Finding an overall error convergence rate, \( \phi _{N_{R}, N_{A}} \), involves two applications of Monte Carlo integration. Firstly, the inner expectation over a to approximate \( G^{*}(\cdot) \approx \hat{G}^{*}(\cdot) \) and the outer expectation over \( r^{m+1} \) to approximate the next timestep. We define the error as the difference between the (theoretical) truth and the Monte Carlo integrated approximation, \( \begin{align*} \phi _{N_{R}, N_{A}} = \mathbb {E}_{r^{m+1}}\left[G^*(\mathscr{R}^{m+1}, \mathscr{F}^n)- G^*(\mathscr{R}^m, \mathscr{F}^n) \big | s, \mathscr{R}^m\right] - \frac{1}{N_{R}} \frac{1}{N_{A}} \sum _{l=1}^{N_{R}}\bigg [\sum _{k=1}^{N_{A}}\big [\mu ^{n}({x}^{m+1,n}_{r},{a}_{k}) - \mu ^{n}({x}^{m,n}_{r},{a}_{k})\big ]\bigg ]. \end{align*} \)

We may decompose \( \phi _{N_{R}, N_{A}} \) into the approximation errors from each Monte Carlo approximation. We may consider the differences from the expression with only one Monte Carlo expression \( \begin{equation*} \mathbb {E}_{r^{m+1}}\left[\frac{1}{N_{A}}\sum _{k=1}^{N_{A}}\big [\mu ^{n}({x}^{m+1,n}_{r},{a}_{k}) - \mu ^{n}({x}^{m,n}_{r},{a}_{k})\big ]\right]. \end{equation*} \)

For example, we may write \( \phi _{N_{A}} \) that represents error dependent only on the inner integration over a given by \( \begin{align*} \phi _{N_{A}} = \mathbb {E}_{r^{m+1}}\left[G^*(\mathscr{R}^{m+1}, \mathscr{F}^n)- G^*(\mathscr{R}^m, \mathscr{F}^n) \big | s, \mathscr{R}^m\right] - \mathbb {E}_{r^{m+1}}\left[\frac{1}{N_{A}}\sum _{k=1}^{N_{A}}\big [\mu ^{n}({x}^{m+1,n}_{r},{a}_{k}) - \mu ^{n}({x}^{m,n}_{r},{a}_{k})\big ]\right]. \end{align*} \)

We may then also write \( \phi _{N_{R}} \) that represents error dependent only on the outer integration over \( r^{m+1} \) given by \( \begin{align*} \phi _{N_{R}} = \mathbb {E}_{r^{m+1}}\left[\frac{1}{N_{A}}\sum _{k=1}^{N_{A}}\big [\mu ^{n}({x}^{m+1,n}_{r},{a}_{k}) - \mu ^{n}({x}^{m,n}_{r},{a}_{k})\big ]\right] - \frac{1}{N_{R}} \frac{1}{N_{A}} \sum _{l=1}^{N_{R}}\bigg [\sum _{k=1}^{N_{A}}\big [\mu ^{n}({x}^{m+1,n}_{r},{a}_{k}) - \mu ^{n}({x}^{m,n}_{r},{a}_{k})\big ]\bigg ]. \end{align*} \)

And finally note that we have \( \phi _{N_{R},N_A} = \phi _{N_{R}} + \phi _{N_{A}} \) as the intermediate term cancels out.

Each sum is composed of independent and identically distributed random samples, thus the error rates for \( \phi _{N_{A}} \) is \( O(N_{A}^{-1/2}) \) and \( \phi _{N_{R}} \) is \( O(N_{R}^{-1/2}), \) which constitutes an overall error rate \( \phi _{N_{A}, N_{R}} \) of \( O(N_{A}^{-1/2}+ N_{R}^{-1/2}) \). In some cases, \( G^{*}(\cdot) \) may be derived in closed form—that is, when normally distributed \( \mathbb {P}[a|\mathscr{R}^{m}] \) is considered [Le and Branke 2020]—thus \( \phi _{N_A}=0 \).

We computed all Monte Carlo estimations with a fixed size \( N_{A}=150 \) and varying \( N_R \) on the newsvendor simulation optimisation problem (see Section 5.2). Mean and standard error are presented in Figure 12.

Fig. 12. At a given iteration of BICO, we fix \( N_{A} \) samples as follows. (a) \( \text{VoI}(s; \mathscr{R}^m,\mathscr{F}^n) \) mean and error bars with 95% confidence interval over number of predictive distribution samples \( N_{R} \) . (b) The Monte Carlo standard error over number of predictive distribution samples \( N_{R} \) .

A.2 Computational Complexity of BICO

We start with the complexity of Gaussian process model fitting, \( \mu ^n(x,s) \), \( k^n((x,a), (x^{\prime },a^{\prime })) \). Each iteration of BICO starts by fitting the Gaussian process hyper-parameters by maximising the marginal likelihood. Each marginal likelihood evaluation requires computing the inverse of the covariance matrix \( k^0(\tilde{X}^n,\tilde{X}^n)+I\sigma \) costing \( O(n^{3}) \). By caching this inverted matrix, all following posterior mean and variance computations are reduced to \( O(n) \) and \( O(n^{2}), \) respectively.

Next we must compute the VoI for simulation data \( \text{VoI}^t(x,a) \). This can be explicitly computed using a finite discretisation on the space \( X\times A \). We discretise X and A by generating \( N_{X} \) Latin hypercube samples, \( X_{MC} \), for X and \( N_A \) posterior samples \( A_{MC}\sim \mathbb {P}[{a}|\mathscr{R}^m] \) for A. Therefore, the size of the finite discretisation of the product space \( X \times A \) is \( N_{A} N_{X} \). However, we also include the candidate simulation point \( (x,a)^{n+1} \) from the optimiser which enlarges the number of mean and covariance computations to \( N_{A} (N_{X} + 1) \). Since \( \tilde{X}_{MC} = X_{MC}\times A_{MC} \) is fixed during the optimisation runs, the posterior mean over these points, \( \mu ^n(\tilde{X}_{MC}) \), may be computed once and stored. Posterior covariance computations may be reduced as follows: \( \begin{equation*} k^n(\tilde{X}_{MC};({x},{a})^{n+1}) = k^0(\tilde{X}_{MC};({x},{a})^{n+1})-k^0(\tilde{X}_{MC};\tilde{X}^n)(k^0(\tilde{X}^n,\tilde{X}^n)+I\sigma)^{-1}k^0(\tilde{X}^n;({x},{a})^{n+1}). \end{equation*} \) Notice that \( k^0(\tilde{X}_{MC};\tilde{X}^n)(k^0(\tilde{X}^n,\tilde{X}^n)+I\sigma)^{-1} \) does not change with the new sample \( (x,a)^{n+1} \) and need only be computed once and stored. Therefore, we reduce the mean and covariance computation to \( O(N_{A} N_{X} n) \) for the points \( N_{A} N_{X} \). The remaining points (\( (x^{n+1},A_{MC}) \)) result in a posterior mean and covariance cost computation of \( O(N_{A} n^{2}) \). We then apply the algorithm developed in the work of Scott et al. [2011] that takes \( O((N_{X}+1)log(N_{X}+1)) \) to compute \( VoI(x,a) \). Finally, let K be the number of times \( \text{VoI}(x,a\cdot) \) is called during optimisation of \( \text{VoI}(x,a\cdot) \). Therefore, the overall optimisation cost results in \( O(K(N_{A} N_{X} n + N_{A} n^{2} + N_{X}log(N_{X}))) \).

Finally, we must compute the VoI of data from external sources \( \text{VoI}(s; \cdot) \). Let \( N_{R} \) be the number of predictive distribution samples from \( \mathbb {P}[r^{m+1}|s^{m+1};\mathscr{R}^{m}] \). For each sample we perform continuous optimisation over \( x \in X \), each optimisation call requires computing the posterior mean at each \( (x, A_{MC}) \) locations costing \( O(N_{A} n) \). This computation is repeated for each of the K optimiser iterations for each of the \( N_R \) samples for each of the N sources. Thus, the resulting cost is \( O(K N N_{A} N_{R} n) \).

The computational time of a BICO iteration in the production line optimisation problem (see Section. 5.3) with \( n=50 \), \( N_{A}=200 \) and \( N_{R}=200 \) is approximately 50 seconds, from which 20 seconds are dedicated to evaluate VoI(s). Time was measured on a computer with an Intel Core CPU i7-10610U 1.80-GHz processor with 16 GB of RAM and running Ubuntu 20.04.1, and using Python as the programming language.

A.3 BICO Asymptotic Consistency

In this section, we show that if X is discrete and \( A \subset \mathbb {R}^{d} \) is continuous, then when given an infinite sampling budget B, or \( t\rightarrow \infty \), the BICO algorithm will find the true optimal solution \( x^* \) as well as the true parameters a*.

The proof is composed of three parts. Firstly, Proposition 1 shows that the \( \text{VoI}(\cdot ; \mathscr{R}^m,\mathscr{F}^n) \) for any action is non-negative. Secondly, for any single action that is performed infinitely often, the value of performing the action again vanishes \( \lim _{t\rightarrow \infty }\text{VoI}(\cdot ; \mathscr{R}^m,\mathscr{F}^n)\rightarrow 0 \). Together, these imply that any action repeated infinitely often results in that action becoming a minimum of the \( \text{VoI}(\cdot) \) function. As BICO performs the action that is a maximum of \( \text{VoI}(\cdot) \), the value of all actions eventually vanishes. Thirdly, if the value of all actions is zero, this implies that x* is known.

The first proposition shows that the \( \text{VoI}(\cdot ; \mathscr{R}^m,\mathscr{F}^n) \) is non-negative, meaning that there is always a benefit in collecting more data. These results follow naturally from Jensen’s inequality.

Proposition 1.

\( \text{VoI}^{t}(\cdot ; \mathscr{R}^m, \mathscr{F}^n) \ge 0 \) for all \( (x,a) \in X \times A \) and all \( s \in \lbrace 1,\dots ,N\rbrace \).

Proof of Proposition 1

The proof for both types of action follows from the tower property and Jensen’s inequality. We first prove the result for simulation data \( \text{VoI}^t((x,a))\ge 0 \). (25) \( \begin{align} \begin{split}\mathbb {E}_{y^{n+1}}\left[ \max _{x}G(x;\mathscr{R}^m, \mathscr{F}^{n+1}) \big | (x,a), \mathscr{F}^n\right] &\ge \max _{x}\mathbb {E}_{y^{n+1}}\left[G(x;\mathscr{R}^m, \mathscr{F}^{n+1}) \big | (x,a), \mathscr{F}^n\right] \end{split} \end{align} \) (26) \( \begin{align} \begin{split} &=\max _{x}\mathbb {E}_{y^{n+1}}[\mathbb {E}_{a}[\mu ^{n+1}(x,a)]] \end{split} \end{align} \) (27) \( \begin{align} \begin{split} &=\max _{x}\mathbb {E}_{a}[\mathbb {E}_{y^{n+1}}[\mu ^{n+1}(x,a)]] \end{split} \end{align} \) (28) \( \begin{align} \begin{split} &=\max _{x}\mathbb {E}_{a}[\mu ^{n}(x,a)] \end{split} \end{align} \) (29) \( \begin{align} \begin{split} &=\max _{x}G(x;\mathscr{F}^n,\mathscr{R}^m) \end{split} \end{align} \) Therefore, \( \text{VoI}^{t}(x,a; \cdot) =\mathbb {E}_{y^{n+1}}\left[ \max _{x}G(x;\mathscr{R}^m, \mathscr{F}^{n+1}) \big | (x,a), \mathscr{F}^n\right] -\max _{x}G(x;\mathscr{F}^n,\mathscr{R}^m)\ge 0 \) and there is always some benefit in measuring simulation data. For external data, we may adopt a similar argument to show \( \text{VoI}^{t}(s)\ge 0 \). (30) \( \begin{align} \begin{split}\mathbb {E}_{r^{m+1}}\left[ \max _{x}G(x;\mathscr{R}^{m+1}, \mathscr{F}^n) \big | s, \mathscr{R}^m\right] &\ge \max _{x}\mathbb {E}_{r^{m+1}}\left[G(x;\mathscr{R}^{m+1}, \mathscr{F}^n) \big | s, \mathscr{R}^m\right] \end{split} \end{align} \) (31) \( \begin{align} \begin{split} &=\max _{x}\int _{r^{m+1}}\int _{A}\mu ^{n}({x},{a})\mathbb {P}[{a}|\mathscr{R}^{m+1}]\mathbb {P}[r^{m+1}|\mathscr{R}^m]d{a}dr^{m+1} \end{split} \end{align} \) (32) \( \begin{align} \begin{split} &=\max _{x}\int _{A}\mu ^{n}({x},{a})\int _{r^{m+1}}\mathbb {P}[{a}|\mathscr{R}^{m+1}]\mathbb {P}[r^{m+1}|\mathscr{R}^m]dr^{m+1} d{a}\end{split} \end{align} \) (33) \( \begin{align} \begin{split} &=\max _{x}\int _{A}\mu ^{n}({x},{a})\mathbb {P}[{a}|\mathscr{R}^m]d{a}\end{split} \end{align} \) (34) \( \begin{align} \begin{split} &=\max _{x}G(x;\mathscr{R}^m, \mathscr{F}^n) \end{split} \end{align} \)□

For the second part of the proof, we show that if an action is performed infinitely often, the value of performing the action tends to zero. The information gain in repeating an action decreases and eventually becomes a minimum of the \( VoI^t(\cdot) \) function. Again, this is shown in for each type of action.

We first present required preliminary results regarding the limit of collecting infinite simulation data (Proposition 2) or infinite external data (Theorem 1).

Proposition 2.

Let \( x,x^{\prime } \in X \); \( a,a^{\prime } \in A \), and \( n\in \mathbb {N} \). The limits of the series \( (\mu ^{n}(x,a)) \) and \( (V^{n}((x,a),(x^{\prime },a^{\prime }))) \) (shown in the following) exist. (35) \( \begin{align} \begin{split} \mu ^{n}(x,a) &= \mathbb {E}_{n}[f(x,a)] \end{split} \end{align} \) (36) \( \begin{align} \begin{split} V^{n}((x,a),(x^{\prime },a^{\prime })) &=\mathbb {E}_{n}[f(x,a)\cdot f(x^{\prime },a^{\prime })] \end{split} \end{align} \) (37) \( \begin{align} \begin{split} &= k^{n}((x,a),(x^{\prime },a^{\prime }))+ \mu ^{n}(x,a)\cdot \mu ^{n}(x^{\prime },a^{\prime }) \end{split} \end{align} \)

Denote their limits by \( \mu ^{\infty }(x,a) \) and \( V^{\infty }=((x,a),(x^{\prime },a^{\prime })) \) respectively. (38) \( \begin{align} \begin{split} \lim _{n \rightarrow \infty } \mu ^{n}(x,a)&=\mu ^{\infty }(x,a) \end{split} \end{align} \) (39) \( \begin{align} \begin{split} \lim _{n \rightarrow \infty } V^{n}((x,a),(x^{\prime },a^{\prime }))&=V^{\infty }((x,a),(x^{\prime },a^{\prime })) \end{split} \end{align} \)

If \( (x^{\prime },a^{\prime }) \) is sampled infinitely often, then \( \lim _{n\rightarrow \infty } V^{n}((x,a),(x^{\prime },a^{\prime })) = \mu ^{\infty }(x,a)\cdot \mu ^{\infty }(x^{\prime },a^{\prime }) \) holds almost surely.

Proof of Proposition 2

Cinlar [2011] states in Proposition 2.8 that any sequence of conditional expectations of an integrable random variable under an increasing convex function is a uniformly integrable martingale. Thus, both sequences converge almost surely to their respective limit. If \( (x^{\prime }, a^{\prime }) \) is sampled infinitely often, then its posterior variance goes to zero, and \( \mathbb {E}_{n}\big [f(x,a)\cdot f(x^{\prime },a^{\prime }) \big ] \rightarrow \mu ^{\infty }(x,a) \cdot \mu ^{\infty }(x^{\prime },a^{\prime }) \).□

Theorem 1.

If \( {a} \) is defined on a compact set and C is a neighbourhood of a* with nonzero prior probability, then \( \mathbb {P}(a \in C | \mathscr{R}^m) \rightarrow 1 \) as \( m \rightarrow \infty \), where a* is the value of C that minimises the Kullback-Leibler divergence.

A proof is given in Appendix B in the work of Gelman et al. [2014] and is omitted here for brevity.

Proposition 3.

Let \( (x^{\prime }, a^{\prime }) \in X \times A \) and suppose that \( f(x^{\prime },a^{\prime }) \) is repeatedly observed, then \( \text{VoI}^{t}((x^{\prime }, a^{\prime })) \rightarrow 0 \) as \( t \rightarrow \infty \). Let \( s \in S \) and suppose that \( \mathbb {P}[r|a^*, s] \) is repeatedly observed, then \( \text{VoI}^{t}(s) \rightarrow 0 \) as \( t \rightarrow \infty \).

Proof of Proposition 3

First assume \( (x^{\prime }, a^{\prime }) \) was observed infinitely often, then consider the Gaussian process one-step look-ahead update (40) \( \begin{align} \begin{split} \lim _{n \rightarrow \infty }\tilde{\Sigma }^{n}(x;(x^{\prime }, a^{\prime })) &= \lim _{n \rightarrow \infty }\int _{A}\tilde{\sigma }^{n}((x,a);(x,a)^{\prime })\mathbb {P}[a|\mathscr{R}^m]da. \end{split} \end{align} \)

Since the integral is independent of \( n, \) we may proceed as follows: (41) \( \begin{align} \begin{split} \lim _{n \rightarrow \infty }\int _{A}\tilde{\sigma }^{n}((x,a);(x^{\prime }, a^{\prime }))\mathbb {P}[a|\mathscr{R}^m]da&= \int _{A}\lim _{n \rightarrow \infty }\tilde{\sigma }^{n}((x,a);(x,a)^{\prime })\mathbb {P}[a|\mathscr{R}^m]da, \end{split} \end{align} \) (42) \( \begin{align} \begin{split} &= \int _{A}\lim _{n \rightarrow \infty }\frac{k^{n}((x,a);(x,a)^{\prime })}{\sqrt {k^{n}((x,a)^{\prime };(x,a)^{\prime }) + \sigma ^{2}_{\epsilon }}}\mathbb {P}[a|\mathscr{R}^m]da, \end{split} \end{align} \) (43) \( \begin{align} \begin{split} &=0. \end{split} \end{align} \)

Where the last line is by noting the limit of the denominator \( \lim _{n \rightarrow \infty }k^n((x,a), (x^{\prime }, a^{\prime })) = 0 \) (see Pearce et al. [2019], Lemma 11), and therefore \( \Sigma ^{\infty }(x;(x,a)^{\prime })=0 \). Thus, we may write as follows: (44) \( \begin{align} \begin{split} \lim _{n \rightarrow \infty } VoI((x,a);\mathscr{F}^n,\mathscr{R}^m) &= \frac{\int _{-\infty }^{\infty }\phi (Z)\max _{x^{\prime \prime }}\lbrace G(x;\mathscr{F}^{\infty } \mathscr{R}^m)+ \overbrace{\Sigma ^{\infty }(x;(x^{\prime }, a^{\prime }))}^{=0}Z\rbrace - \max _{x^{\prime \prime }}\lbrace G(x;\mathscr{F}^{\infty } \mathscr{R}^m)\rbrace }{c_{f}}, \end{split} \end{align} \) and the integral over Z equates to unity and the denominator terms cancel out. Therefore, \( \begin{equation*} \lim _{n \rightarrow \infty } VoI((x,a);\mathscr{F}^n,\mathscr{R}^m)=0. \end{equation*} \)

For external data, we rely on Theorem 1 that states that the parameter distribution \( \mathbb {P}[a|\mathscr{R}^m] \) converges to \( \delta _{a=a^{*}} \) as m increases.

For the case when s is observed infinitely often, as shown in Theorem. 1, \( \mathbb {P}[a|\mathscr{R}^m] \rightarrow \delta _{a=a^{*}} \) as \( m \rightarrow \infty \), therefore (45) \( \begin{align} \begin{split} G({x}; \mathscr{R}^{\infty }, \mathscr{F}^n) &= \int _{A}\mu ^{n}({x},{a})\delta _{a=a*}da, \end{split} \end{align} \) (46) \( \begin{align} \begin{split} &= \mu ^{n}({x},{a}^{*}). \end{split} \end{align} \)

Replacing \( G({x}; \mathscr{R}^{\infty }, \mathscr{F}^n) \) in \( \text{VoI}(s;\mathscr{R}^{\infty },\mathscr{F}^n) \) results in (47) \( \begin{align} \begin{split}\lim _{m}\text{VoI}(s; \mathscr{R}^m,\mathscr{F}^n)&= \lim _{m}\mathbb {E}_{r^{m+1}}\left[ \max _{x}G(x;\mathscr{R}^{m+1}, \mathscr{F}^n) \big | s, \mathscr{R}^m\right] - \max _{x}G(x;\mathscr{R}^m, \mathscr{F}^n), \end{split} \end{align} \) (48) \( \begin{align} \begin{split}&= \mathbb {E}_{r^{\infty }}\left[ \max _{x}\mu ^{n}({x},{a}^{*})\big | s, \mathscr{R}^{\infty } \right] - \max _{x}\mu ^{n}({x},{a}^{*}), \end{split} \end{align} \) (49) \( \begin{align} \begin{split}&= \max _{x}\mu ^{n}({x},{a}^{*}) - \max _{x}\mu ^{n}({x},{a}^{*}), \end{split} \end{align} \) (50) \( \begin{align} \begin{split}&= 0. \end{split} \end{align} \)□

Proposition 4.

Let \( l_{a}\lt \infty \). If \( \text{VoI}((x,a);\mathscr{R}^m,\mathscr{F}^n)= 0 \) and \( \text{VoI}(s;\mathscr{R}^m,\mathscr{F}^n)= 0 \) for all \( (x,a) \) and s, then \( \text{arg max}_{x \in X} G(x;\mathscr{R}^m,\mathscr{F}^{\infty }) =\text{arg max}_{x \in X}\int _{A}\theta (x,a)\mathbb {P}[a|\mathscr{R}^m]da \) and \( \mathbb {P}[a|\mathscr{R}^m] = \delta _{a=a^{*}} \).

Proof of Proposition 4

Firstly, let us consider the case when \( \text{VoI}((x,a);\mathscr{R}^m,\mathscr{F}^n)= 0 \) for all \( (x,a) \). By Proposition 2, \( \lim _{n \rightarrow \infty } \tilde{k}^{n}((x,a),(x,a)^{\prime })=\tilde{k}^{\infty }((x,a),(x,a)^{\prime }) \) almost surely for all \( x, x^{\prime } \in X \) and \( a,a^{\prime } \in A \). If the posterior variance \( \tilde{k}^{\infty }((x,a),(x,a))=0 \) for all \( (x,a) \in X \times A, \) then we know the global optimiser. Now, let us define \( (\hat{x},\hat{a}) \in \hat{X} = \lbrace x,a \in X \times A | \tilde{k}^{\infty }((x,a),(x,a))\gt 0)\rbrace \), then \( \begin{equation*} \tilde{\Sigma }^{\infty }(x;(\hat{x},\hat{a}))=\frac{\int _{A}k^{\infty }((x,a),(\hat{x},\hat{a}))\mathbb {P}[a|\mathscr{R}^m]da}{\sqrt {k^{\infty }((\hat{x},\hat{a}) ,(\hat{x},\hat{a}))+\sigma _{\epsilon }^{2}}}\gt 0. \end{equation*} \)

Let us first assume \( \tilde{\Sigma }^{\infty }(x_{1};(\hat{x},\hat{a}))\ne \tilde{\Sigma }^{\infty }(x_{2};(\hat{x},\hat{a})) \) for \( x_{1},x_{2} \in X \). Then, \( VoI((x,a);\mathscr{R}^m\mathscr{F}^{\infty }) \) must be strictly positive since for a value of \( Z_{0} \in Z \), \( G(x_{1};\mathscr{R}^m\mathscr{F}^{\infty }) + \tilde{\Sigma }^{\infty }(x_{1};(\hat{x},\hat{a})) \gt G(x_{2};\mathscr{R}^m\mathscr{F}^{\infty }) + \tilde{\Sigma }^{\infty }(x_{2};(\hat{x},\hat{a})) \) for \( Z\gt Z_{0} \) and vice versa. Therefore, \( \tilde{\Sigma }^{\infty }(x^{\prime \prime \prime };(\hat{x},\hat{a}))= \tilde{\Sigma }^{\infty }(x^{\prime \prime };(\hat{x},\hat{a})) \) must hold for any \( x^{\prime \prime \prime }, x^{\prime \prime } \in X \) for \( VoI((x,a))=0 \), which results in \( \begin{equation*} \frac{\int _{A}k^{\infty }((x^{\prime \prime \prime },a),(\hat{x},\hat{a}))\mathbb {P}[a|\mathscr{R}^m]da}{\sqrt {k^{\infty }((\hat{x},\hat{a}) ,(\hat{x},\hat{a}))+\sigma _{\epsilon }^{2}}} = \frac{\int _{A}k^{\infty }((x^{\prime \prime },a),(\hat{x},\hat{a}))\mathbb {P}[a|\mathscr{R}^m]da}{\sqrt {k^{\infty }((\hat{x},\hat{a}) ,(\hat{x},\hat{a}))+\sigma _{\epsilon }^{2}}} . \end{equation*} \)

Since \( \sigma _{\epsilon }^{2}\gt 0 \), \( \begin{equation*} \int _{A}\big [k^{\infty }((x^{\prime \prime \prime },a),(\hat{x},\hat{a}))-k^{\infty }((x^{\prime \prime },a),(\hat{x},\hat{a}))\big ]\mathbb {P}[a|\mathscr{R}^m]da = 0. \end{equation*} \)

So \( \tilde{\Sigma }^{\infty }(x;(\hat{x},\hat{a})) \) does not change for all \( x \in X \). Moreover, by integrating with respect to \( \hat{a} \), as \( \tilde{K}(x;\hat{x}) = \int \tilde{\Sigma }^{\infty }(x^{\prime \prime };(\hat{x},\hat{a})) d\hat{a} \) the resulting kernel does not vary with respect to x, it must be positive semidefinite, and symmetric. Therefore, by symmetry, the resulting \( \tilde{K}(x;\hat{x}) \) does not change with respect to \( \hat{x} \) and it must follow that the covariance matrix \( \tilde{K}(x;\hat{x}) \) is proportional to an all-ones matrix and the optimiser is known \( argmax_{x\in X}G(x) = argmax_{x\in X}\int _{A}\theta (x,a)\mathbb {P}[a|\mathscr{R}^m]da \) but not necessarily its true value.

Let us now consider the case when \( \text{VoI}(s;\mathscr{R}^m,\mathscr{F}^n)= 0 \) for all s as (51) \( \begin{eqnarray} 0 = \text{VoI}(s;\mathscr{R}^m,\mathscr{F}^n) = \mathbb {E}_{r^{m+1}}\left[ \max _x \int _{a^{\prime }} \mu ^n(x,a^{\prime })\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } \right] - \max _x \int _{a} \mu ^n(x,a)\mathbb {P}[a|\mathscr{R}^{m}]da. \end{eqnarray} \)

Denote the current recommended solution as \( x_r^t = \text{arg max}_{x}\int _{a^{\prime }} \mu ^n(x,a)\mathbb {P}[a|\mathscr{R}^{m}]da \), and the \( \text{VoI}^t(s) \) can be rewritten as (52) \( \begin{eqnarray} 0 &=& \mathbb {E}_{r^{m+1}}\left[ \max _x \int _{a^{\prime }} \mu ^n(x,a^{\prime })\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } \right] - \int _{a} \mu ^n(x_r^t,a)\mathbb {P}[a|\mathscr{R}^{m}]da, \end{eqnarray} \) (53) \( \begin{eqnarray} &=& \mathbb {E}_{r^{m+1}}\left[ \max _x \int _{a^{\prime }} \mu ^n(x,a^{\prime })\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } - \int _{a} \mu ^n(x_r^t,a)\mathbb {P}[a|\mathscr{R}^{m+1}]da \right], \end{eqnarray} \) (54) \( \begin{eqnarray} &=& \mathbb {E}_{r^{m+1}}\left[ \max _x \int _{a^{\prime }} \mu ^n(x,a^{\prime })-\mu ^n(x_r^t,a^{\prime })\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } \right]. \end{eqnarray} \) Note that the random variable within the expectation is non-negative for all \( r^{m+1} \). Since the expectation of the non-negative random variable is zero, every realisation of the random variable must be zero, for all \( r^{m+1} \) (55) \( \begin{eqnarray} \max _x \int _{a^{\prime }} \mu ^n(x,a^{\prime })- \mu ^n(x_r^t,a)\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } = 0. \end{eqnarray} \) If we denote the maximiser (which is a function of \( r^{m+1} \)) as \( x_r^{t+1}(r^{m+1}) \), the preceding equality may be written as (56) \( \begin{eqnarray} \int _{a^{\prime }} \mu ^n(x_r^{t+1}(r^{m+1}),a^{\prime })- \mu ^n(x_r^t,a)\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } = 0. \end{eqnarray} \) This equality holds if (57) \( \begin{eqnarray} \mu ^n(x_r^{t+1}(r^{m+1}),a^{\prime }) = \mu ^n(x_r^t,a) \end{eqnarray} \) or equivalently \( x_r^{t+1}(r^{m+1}) = x^{t}_{r} \). Thus, the new maximiser does not depend on \( r^{m+1} \). (58) \( \begin{eqnarray} \max _{x} \int _{a^{\prime }} \mu ^n(x,a^{\prime })\mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}]da^{\prime } &=& \max _{x} \int _{a^{\prime }} \mu ^n(x,a^{\prime })\mathbb {P}[a^{\prime }|\mathscr{R}^{m}]da^{\prime } \end{eqnarray} \) for all \( r^{m+1} \) and for all \( \mu ^n \). The left-hand side also does not depend on \( r^{m+1} \) and therefore \( \mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}] \) does not depend on \( r^{m+1} \) and we have that \( \mathbb {P}[a^{\prime }|\mathscr{R}^{m+1}] = \mathbb {P}[a^{\prime }|\mathscr{R}^{m}] \).

Under some regularity conditions, as \( m \rightarrow \infty \), the posterior distribution of a approaches normality with mean \( a^{*} \) and variance \( (mJ(a^{*}))^{-1} \), where \( a^{*} \) is the value that minimises the Kullback-Leibler divergence and J is the Fisher information. Therefore, if the variance is reduced at a rate of \( m^{-1} \), the equality for the posterior distribution at m and \( m+1 \) is reached only when the posterior distribution is concentrated around the true parameter as \( \mathbb {P}[a^{\prime }|\mathscr{R}^{m}] = \delta _{a=a^{*}} \) , where \( \delta _{a=\hat{a}} \) is a Dirac delta function on the condition a = a*.□

Therefore, BICO converges to finding the true parameter a* and true optimal solution x* as t increases. However, Section 4.3 in the work of Gelman et al. [2004] describes when the regularity conditions may not hold, including underspecified models, non-identified parameters, cases in which the number of parameters grows with the sample size, unbounded likelihood functions, improper posterior distributions, or convergence to a boundary of the parameter space.

A.4 BICO Relevance Determination

In this section, we show that if we use a squared exponential kernel, then the hyper-parameters determine the relevance of parameter data sources. Therefore, non-relevant parameter data sources will not be sampled by BICO.

Remark 1.

Assuming a squared exponential kernel, (59) \( \begin{align} \begin{split}k^{0}((x,a)(x,a)^{\prime })&=\sigma ^{2}_{f}e^{-\frac{1}{2}\big (\frac{(x-x^{\prime })^{2}}{l_{x}}+\frac{(a-a^{\prime })^{2}}{l_{a}} \big) }, \end{split} \end{align} \) and without loss of generality, a parameter \( a \in A \), and a solution \( x \in X \). Then, \( \text{VoI}^t(s, \mathscr{R}^m,\mathscr{F}^n) = 0 \) as \( l_{a} \rightarrow \infty \) for any m and n.

Proof of Remark 1

As \( l_a \rightarrow \infty \) the posterior mean \( \mu ^n({x},{a}) \) only depends on the solution x, (60) \( \begin{align} \begin{split}\lim _{l_a \rightarrow \infty }k^{0}((x,a)(x,a)^{\prime }) = \sigma ^{2}_{f}e^{-\frac{1}{2}\big (\frac{(x-x^{\prime })^{2}}{l_{x}} \big)} = k^{0}(x;x^{\prime }). \end{split} \end{align} \)

Let us denote \( \tilde{X}_{x}^{n}=\lbrace x^1,\dots ,x^n\rbrace \) and assume \( \mu ^{0}(x,a)=0 \), then it follows from (60), (61) \( \begin{align} \begin{split}\mu ^n(x,a) &=-k^0((x,a),\tilde{X}^n)(k^0(\tilde{X}^n,\tilde{X}^n)+I\sigma)^{-1}Y^n, \end{split} \end{align} \) (62) \( \begin{align} \begin{split} &= -k^0(x,\tilde{X}_{x}^n)(k^0(\tilde{X}_{x}^n,\tilde{X}_{x}^n)+I\sigma)^{-1}Y^n, \end{split} \end{align} \) (63) \( \begin{align} \begin{split} &=\mu ^n(x). \end{split} \end{align} \)

Since \( \mu ^{n} \) does not depend on a, \( G(x;\mathscr{R}^m, \mathscr{F}^n)= \mu ^{n}(x) \), and the \( VoI^{t}(\cdot) \) for \( s \in \lbrace 1,\dots ,N\rbrace \) is (64) \( \begin{align} \begin{split}\text{VoI}^t(\cdot) &= \text{VoI}(s; \mathscr{R}^m, \mathscr{F}^n), \end{split} \end{align} \) (65) \( \begin{align} \begin{split}&= \mathbb {E}_{r^{m+1}}\left[ \max _{x}G(x;\mathscr{R}^{m+1}, \mathscr{F}^n) \big | s, \mathscr{R}^m\right] - \max _{x}G(x;\mathscr{R}^m, \mathscr{F}^n), \end{split} \end{align} \) (66) \( \begin{align} \begin{split}&= \mathbb {E}_{r^{m+1}}\left[ \max _{x}\mu ^{n}(x)\big | s, \mathscr{R}^m\right] - \max _{x}\mu ^{n}(x), \end{split} \end{align} \) (67) \( \begin{align} \begin{split}&= \max _{x}\mu ^{n}(x) - \max _{x}\mu ^{n}(x), \end{split} \end{align} \) (68) \( \begin{align} \begin{split}&= 0. \end{split} \end{align} \)□

Therefore, external data is never collected if a is not ‘influential’ on the predicted simulation output \( \mu ^{n}(x,a) \).

REFERENCES

Ankenman B., Nelson B. L., and Staum J.. 2008. Stochastic kriging for simulation metamodeling. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 362–370.Google Scholar
Reference
Barton R. and Meckesheimer M.. 2006. Metamodel-based simulation optimization. In Simulation, Henderson Shane G. and Nelson Barry L. (Eds.). Handbooks in Operations Research and Management Science, Vol. 13. Elsevier, 535–574.Google ScholarCross Ref
Reference
Barton R. and Schruben L.. 2001. Resampling methods for input modeling. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 372–378.Google Scholar
Reference
Barton R. R., Nelson B. L., and Xie W.. 2014. Quantifying input uncertainty via simulation confidence intervals. INFORMS Journal on Computing 26, 1 (2014), 74–87.Google ScholarDigital Library
Reference
Cheng R. C. H. and Holloand W.. 1997. Sensitivity of computer simulation experiments to errors in input data. Journal of Statistical Computation and Simulation 57, 1–4 (1997), 219–241.Google ScholarCross Ref
Reference
Chick S. E.. 2001. Input distribution selection for simulation experiments: Accounting for input uncertainty. Operations Research 49, 5 (2001), 744–758.Google ScholarDigital Library
Reference
Cinlar E.. 2011. Probability and Stochastics. Graduate Texts in Mathematics, Vol. 261. Springer.Google ScholarCross Ref
Corlu C. G., Akcay A., and Xie W.. 2020. Stochastic simulation under input uncertainty: A review. Operations Research Perspectives 7 (2020), 100162.Google ScholarCross Ref
Reference
Durrande N., Ginsbourger D., and Roustant O.. 2012. Additive covariance kernels for high-dimensional Gaussian process modeling. Annales de la Faculté Des Sciences De Toulouse : Mathématiques Ser. 6, 21, 3 (2012), 481–499.Google Scholar
Reference
Frazier P., Powell W., and Dayanik S.. 2009. The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing 21, 4 (2009), 599–613.Google ScholarCross Ref
Reference
Freimer M. and Schruben L.. 2002. Simulation input analysis: Collecting data and estimating parameters for input distributions. In Proceedings of the Winter Simulation Conference. 393–399.Google ScholarCross Ref
Reference
Gelman A., Carlin J., Stern H., Dunson D., Vehtari A., and Rubin D.. 2014. Bayesian Data Analysis. CRC Press, Boca Raton, FL.Google Scholar
Gelman A., Carlin J. B., Stern H. S., and Rubin D. B.. 2004. Bayesian Data Analysis (2nd ed.). Chapman & Hall/CRC.Google Scholar
Schonlau M. Jones, D. R. and Welch W. J.. 1998. Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13, 4 (1998), 455–492.Google Scholar
Reference
Lam H., Roeder T. M. K., Frazier P. I., Szechtman R., Zhou E., Huschka T., and Chick S. E.. 2016. Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In Proceedings of the Winter Simulation Conference. 178–192.Google Scholar
Reference 1Reference 2
Le H. and Branke J.. 2020. Bayesian optimization searching for robust solutions. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 362–370.Google Scholar
Liu T. and Zhou E.. 2020. Online quantification of input model uncertainty by two-layer importance sampling. arXiv:1912.11172 (2020).Google Scholar
Reference
Ng Szu Hui and Chick Stephen E.. 2006. Reducing parameter uncertainty for stochastic systems. ACM Transactions on Modeling and Computer Simulation 16, 1 (Jan. 2006), 26–51.Google ScholarDigital Library
Reference
Pasupathy R. and Henderson S. G.. 2011. Simulation Optimization Library. Retrieved February 22, 2022 from http://www.simopt.org.Google Scholar
Reference
Pearce M. and Branke J.. 2017. Bayesian simulation optimization with input uncertainty. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 2268–2278.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
M. Pearce, M. Poloczek, and Juergen Branke. 2019. Bayesian optimization allowing for common random numbers. arXiv preprint arXiv:1910.09259 (2019).Google Scholar
Rasmussen C. E. and Williams C. K. I.. 2006. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.Google ScholarDigital Library
Reference
Scott W., Frazier P., and Powell W.. 2011. The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM Journal on Optimization 21, 3 (2011), 996–1026.Google ScholarCross Ref
Reference
Shahriari B., Swersky K., Wang Z., Adams R. P., and Freitas N. de. 2016. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104, 1 (2016), 148–175.Google ScholarCross Ref
Reference
Song E. and Nelson B. L.. 2015. Quickly assessing contributions to input uncertainty. IIE Transactions 47 (2015), 893–909.Google ScholarCross Ref
Reference
Song E., Nelson B. L., and Hong L. J.. 2015. Input uncertainty and indifference-zone ranking amp; selection. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 414–424.Google Scholar
Reference
Song E. and Shanbhag U. V.. 2019. Stochastic approximation for simulation optimization under input uncertainty with streaming data. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 3597–3608.Google ScholarDigital Library
Reference
Staum J.. 2009. Better simulation metamodeling: The why, what, and how of stochastic kriging. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 119–133.Google Scholar
Reference
Toscano-Palmerin S. and Frazier P.. 2018. Bayesian optimization with expensive integrands. arXiv:1803.08661 [cs.LG] (2018).Google Scholar
Reference 1Reference 2Reference 3
Vanmarcke E.. 2010. Random Fields. World Scientific.Google ScholarCross Ref
Reference
Wang H., Yuan J., and Ng S. H.. 2018. Informational approach to global optimization with input uncertainty for homoscedastic stochastic simulation. In Proceedings of the International Conference on Industrial Engineering and Engineering Management. IEEE, Los Alamitos, CA, 1396–1400.Google Scholar
Reference
Wu D. and Zhou E.. 2017. Ranking and selection under input uncertainty: A budget allocation formulation. In Proceedings of the Winter Simulation Conference. Article 179, 12 pages.Google Scholar
Reference
Wu D. and Zhou E.. 2019. Fixed confidence ranking and selection under input uncertainty. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 3717–3727.Google ScholarDigital Library
Reference
Xiao H. and Gao S.. 2018. Simulation budget allocation for selecting the top-m designs with input uncertainty. IEEE Transactions on Automatic Control 63, 9 (2018), 3127–3134.Google ScholarCross Ref
Reference
Yi Y. and Xie W.. 2017. An efficient budget allocation approach for quantifying the impact of input uncertainty in stochastic simulation. ACM Transactions on Modeling and Computer Simulation 27, 4 (2017), Article 25, 23 pages.Google ScholarDigital Library
Reference
Yin J., Ng S. H., and Ng K. M.. 2011. Kriging metamodel with modified nugget-effect: The heteroscedastic variance case. Computers and Industrial Engineering 61, 3 (2011), 760–777.Google ScholarDigital Library
Reference
Yuan J. and Ng S. H.. 2013. A sequential approach for stochastic computer model calibration and prediction. Reliability Engineering and System Safety 111 (2013), 273–286.Google ScholarCross Ref
Reference
Yuan J. and Ng S. H.. 2020. An integrated method for simultaneous calibration and parameter selection in computer models. ACM Transactions on Modeling and Computer Simulation 30, 1 (2020), Article 7, 23 pages.Google ScholarDigital Library
Reference 1Reference 2
Zhou E. and Xie W.. 2015. Simulation optimization when facing input uncertainty. In Proceedings of the Winter Simulation Conference. IEEE, Los Alamitos, CA, 3714–3724.Google Scholar
Reference

Index Terms

Bayesian Optimisation vs. Input Uncertainty Reduction
1. Computing methodologies

Recommendations

A Bayesian Framework for Quantifying Uncertainty in Stochastic Simulation

When we use simulation to estimate the performance of a stochastic system, the simulation often contains input models that were estimated from real-world data; therefore, there is both simulation and input uncertainty in the performance estimates. In ...
Read More
How Bayesian should Bayesian optimisation be?
GECCO '21: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Bayesian optimisation (BO) uses probabilistic surrogate models - usually Gaussian processes (GPs) - for the optimisation of expensive black-box functions. At each BO iteration, the GP hyperparameters are fit to previously-evaluated data by maximising ...
Read More
Bayesian Optimisation for Constrained Problems
Many real-world optimisation problems such as hyperparameter tuning in machine learning or simulation-based optimisation can be formulated as expensive-to-evaluate black-box functions. A popular approach to tackle such problems is Bayesian optimisation, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Modeling and Computer Simulation Volume 32, Issue 3
July 2022
119 pages
ISSN:1049-3301
EISSN:1558-1195
DOI:10.1145/3514182
Editor:
Francesco Quaglia
University of Rome Tor Vergata, Italy
Issue’s Table of Contents
Copyright © 2022 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2022
- Online AM: 9 February 2022
- Accepted: 1 January 2022
- Revised: 1 December 2021
- Received: 1 May 2020
Published in tomacs Volume 32, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Input uncertainty
simulation optimisation
Bayesian optimisation
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 873
  Total Downloads
- Downloads (Last 12 months)360
- Downloads (Last 6 weeks)59
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Bayesian Optimisation vs. Input Uncertainty Reduction

ACM Transactions on Modeling and Computer Simulation

Abstract

1 INTRODUCTION

2 LITERATURE REVIEW

3 PROBLEM FORMULATION

4 THE BICO ALGORITHM

4.1 Statistical Model for the Expected Simulator Output

4.2 Statistical Model for the True Parameters

4.3 Action Space

4.4 Predicted Performance

4.5 VoI for Simulation Data

4.6 VoI of Data from External Sources

4.7 The Overall Algorithm

4.8 Theoretical Properties of BICO

5 RESULTS AND DISCUSSION

5.1 Gaussian Process Generated Experiments

5.2 Newsvendor Simulation Optimisation

5.3 Production Line Optimisation

6 CONCLUSION

A APPENDICES

A.1 Monte Carlo Convergence of External Data Source

A.2 Computational Complexity of BICO

A.3 BICO Asymptotic Consistency

A.4 BICO Relevance Determination

REFERENCES

Cited By

Index Terms

Recommendations

A Bayesian Framework for Quantifying Uncertainty in Stochastic Simulation

How Bayesian should Bayesian optimisation be?

Bayesian Optimisation for Constrained Problems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media