1 Introduction

Bayesian Optimization (BO) [1,2,3] optimizes an expensive-to-evaluate black-box function where derivative information cannot be obtained. BO uses a surrogate model, e.g. Gaussian Processes (GP) [4], to learn the response surface and designs an acquisition function to trade-off exploration vs. exploitation. BO maximizes the acquisition function at each iteration to locate the next sample. Classic BO studies unconstrained problems, but practical problems may involve constraints, e.g., a machine learning model may yield poor results for certain hyper-parameter combinations [5]. In chemistry, ingredient concentrations and environmental variables should lie inside safe intervals and/or satisfy extra constraints [6,7,8]. If these constraints are known a priori, they can be handled in the acquisition function [9,10,11,12]. Otherwise, constrained BO (cBO) can handle black-box constraints.

A popular treatment for cBO is using surrogate models to learn constraints and constructing an acquisition function that could lead to feasible region(s). Constrained Expected Improvement (cEI) [13], a widely-used acquisition function, multiplies Expected Improvement (EI) [14] and the Probability of Feasibility (PoF), i.e., the probability of a point satisfying all constraints. Variants of cEI may use alternatives to EI and/or different ways to calculate PoF. For example, a classic way to compute PoF is training a classifier [11, 15,16,17,18,19,20]. Training this classifier allows one to handle multiple constraints with binary or continuous outputs since the only information needed to train a classification model is whether a point is feasible or not. Alternatively, Boukouvala and Ierapetritou [9] applied a regression model to predict their proposed feasibility function. These type of approaches do not require prior knowledge and is flexible to types of constraints. As the number and complexity of constraints increases, however, a single model is insufficient to learn the feasible region(s), which could be multiple-piece, non-convex, and very small. One solution is equipping each constraint with a surrogate [5, 21,22,23,24,25]. Some papers propose acquisition functions beyond cEI [8, 22, 24]. By incorporating constraints into the objective, some works transform constrained problems to unconstrained ones [23, 25, 26].

Table 1 Different methods in cBO

Table 1 compares the aforementioned methods: note that combining an acquisition function from BO with PoF is common for cBO. For multiple models, most research assumes independence, which brings simplicity but loses correlations between constraints. The first cEI paper justified the independence assumption as the covariance between predictions for pairs of responses are not so tractable [13]. Other papers also mention dependence [8, 21, 25, 26]. Dependence certainly has merit, e.g. Multi-task BO [31] gains information about expensive tasks from cheap tasks by capturing the correlations between tasks. When Gelbart et al. [5] considered decoupled cases, in which scenario the objective and multiple constraints could be evaluated independently, they regarded all these functions as tasks and solved it using Multi-task BO. Our setting is different than Gelbart et al. [5] in two ways: we remove their independence assumption and assume that the objective and all the constraints are evaluated simultaneously. We ask: What happens if we do not assume independence between multiple surrogate models?

To remove the widely-used independence assumption in cBO, we implement Multiple Output Gaussian Processes (MOGPs) [32] to capture the correlations between constraints, and use Expectation Propagation for Multivariate Gaussian Probabilities (EPMGP) [33] to calculate PoF. Our modified, dependent PoF, or ’Dep-PoF’, replaces the commonly-used PoF in any acquisition function without disturbing other parts in classical cBO. To verify its effectiveness, we substitute Dep-PoF into two acquisition functions and test them on a variety of case studies.


Paper structure. Section 2 introduces the background. Section 3 defines Dep-PoF and the acquisition functions and then provides convergence analysis of Dep-PoF and complexity analysis. Section 4 shows the experimental results. We further discuss Dep-PoF in Sect. 5 and conclude in Sect. 6.


Supplementary materials. Open Resource 1 gives implementation details for all methods and models involved in this paper. Open Resource 2 includes more experimental results for model consideration. Open Resource 3 provides detailed information for all benchmarks used in this paper.

2 Background

2.1 Multiple output Gaussian processes (MOGPs)

A Gaussian Process (GP) generalizes the Gaussian probability distribution to infinite variables, any finite subset of which owns a joint Gaussian distribution [4]. Mathematically, a GP is a distribution \(f(\cdot )\sim {\mathcal {G}}{\mathcal {P}} (\mu (\cdot ), k(\cdot ,\cdot '))\) over a scalar function \(f:{\mathcal {X}}\rightarrow {\mathbb {R}}\) on input domain \({\mathcal {X}}\). All properties of a GP is fully specified by its mean function \(\mu (\cdot )\) and covariance function, or kernel \(k(\cdot ,\cdot ')\). Conditioned on an observed dataset \(\{(\varvec{x_i},f(\varvec{x_i}))\}_{i=1}^n\), a GP computes the predictive mean and covariance for any finite number of query points in closed form. GPs are powerful tools for scalar functions. For a multi-output, or vector-valued function, a naive and common-used treatment is using equal number of GPs to separately model each output. The correlations between related outputs are thereby ignored. Multiple Output Gaussian Processes (MOGPs) [32] can represent the output correlations.

Consider a multi-output function \(\varvec{f}(\cdot ):{\mathcal {X}}\rightarrow {\mathbb {R}}^p\) with input domain \({\mathcal {X}}={\mathbb {R}}^d\). A prior over this function is a MOGP if the distribution of any vector with the form \(\{f_{j_i}({\varvec{x}}_i)\}_{i=1}^n\in {\mathbb {R}}^n\) is Gaussian distributed, where \(f_{j_i}({\varvec{x}}_i)\) denotes the \(j_i\)th output at the ith input \({\varvec{x}}_i\). Like a GP, a MOGP is determined by its mean function and kernel (W.l.o.g., assume zero mean functions). For homotopic data, where we could observe all output dimensions for each input, MOGP kernels can be viewed as matrix-valued [34, 35]:

$$\begin{aligned} k: {\mathcal {X}}\times {\mathcal {X}}&\rightarrow \quad {\mathbb {R}}^{p\times p}\\ ({\varvec{x}},{\varvec{x}}')&\mapsto k({\varvec{x}},{\varvec{x}}') ={\mathbb {C}}_{{\varvec{f}}(\cdot )}[{\varvec{f}}({\varvec{x}}),{\varvec{f}}({\varvec{x}}')]. \end{aligned}$$

For heterotopic data from where only a subset of outputs is observed each time, a common treatment is including the indexes of outputs in input space, i.e., ’output as input’ [32, 35, 36]. In this paper, we consider homotopic data and inherit relevant notations of MOGP from [35] with corresponding implementation in GPflow [37].

2.2 Expectation propagation for multivariate Gaussian probabilities (EPMGP)

Gaussian cumulative probabilities are difficult to compute for dimensions greater than one. Numerical integration can achieve an accurate approximation but is computationally expensive [38]. Instead, we use Expectation Propagation for Multivariate Gaussian Probabilities (EPMGP) [33] to calculate PoF. Given a mean vector \({\varvec{m}}\in {\mathbb {R}}^n\) and a covariance matrix \({\varvec{K}}\in {\mathbb {R}}^{n\times n}\), a multivariate Gaussian distribution \(p({\varvec{x}})={\mathcal {N}}({\varvec{x}};{\varvec{m}},{\varvec{K}}), {\varvec{x}}\in {\mathbb {R}}^n\) is defined as:

$$\begin{aligned} p({\varvec{x}})=\frac{1}{(2\pi )^{\frac{n}{2}}|{\varvec{K}}|^{\frac{1}{2}}} \exp \left\{ -\frac{1}{2}({\varvec{x}}-{\varvec{m}})^T{\varvec{K}}^{-1} ({\varvec{x}}-{\varvec{m}})\right\} . \end{aligned}$$

We are interested in the probability that a sample from \(p({\varvec{x}})\) falls in a region \({\mathcal {A}}\subset {\mathbb {R}}^n\), that is:

$$\begin{aligned} F({\mathcal {A}})=\int _{{\mathcal {A}}}p({\varvec{x}})\textrm{d}{\varvec{x}} =\int _{l_1({\varvec{x}})}^{u_1({\varvec{x}})}\cdots \int _{l_n({\varvec{x}})}^{u_n({\varvec{x}})}p({\varvec{x}})\textrm{d}{\varvec{x}_n} \cdots \textrm{d}{\varvec{x}_1}, \end{aligned}$$

where \(l_i\) and \(u_i\) denote the lower and upper bounds of \({\mathcal {A}}\). Cunningham et al. [33] analytically approximate \(F({\mathcal {A}})\) using Expectation Propagation [39]. Their EPMGP algorithm approximates accurately for hyperrectangular \({\mathcal {A}}\), which admits its application to most common cases including computing cumulative density function (cdf).

2.3 The acquisition function

2.3.1 Constrained expected improvement (cEI)

cEI, which incorporates constraint uncertainty into EI [14], holds many names, e.g., Integrated Expected Conditional Improvement [15], Constraint-Weighted EI [5], and Expected Constrained Improvement [21]. cEI is defined by:

$$\begin{aligned} cEI({\varvec{x}})=EI({\varvec{x}})PoF({\varvec{x}}), \end{aligned}$$

where \(PoF({\varvec{x}})\) is the probability of \({\varvec{x}}\) satisfying all constraints, and the target of EI is the optimal objective value for observed feasible points so far. Note cEI is not well-defined if all current samples are infeasible. In this case, the EI term is not considered until a feasible point is sampled. Gelbart [40] fully analyzes cEI.

2.3.2 Constrained adaptive sampling (cAS)

Acquisition functions using PoF commonly have a product form, about which we concern: (i) the probability term and improvement term directly influence each other, and (ii) the probability term has little exploration nature. Therefore, we consider constrained Adaptive Sampling (cAS) [8] with a sum form and compare its performance with cEI. cAS is a weighted sum of three terms [8]: (i) a border-find term, (ii) a PoF term for a single constraint (the authors do not consider multiple constraints), (iii) an exploration term. Open Resource 1 describes cAS.

Remark

There are many state-of-the-art acquisition functions, for example those in Table 1. Our idea is to compare PoF versus Dep-PoF, so we chose acquisition functions cEI and cAS because they directly incorporate PoF. Most acquisition functions, including IECI [15], EV [22], and PESC [24], incorporate the assumption of independence between the constraints. But the independence assumption in other methods is used in more sophisticated ways than cEI and cAS, for example approximating the acquisition function in PESC and integrating in IECI and EV. Initially, we considered a comparison incorporating additional acquisition functions. However, Eriksson and Poloczek [41] have already provided a detailed comparison between state-of-the-art methods including cEI and PESC. Our cEI approach may be compared, for instance, to the cEI line in [41].

3 Method

3.1 The model

We optimize a black-box objective with multiple (\(p>1\)) black-box constraints:

$$\begin{aligned} \min \limits _{{\varvec{x}}\in \Omega }\;f({\varvec{x}}) \quad \mathrm{s.t.}\;c_i({\varvec{x}})\le 0, 1\le i\le p, \end{aligned}$$

over a compact set \(\Omega \subset {\mathbb {R}}^d\) where \(f,c_i: \Omega \rightarrow {\mathbb {R}},1\le i\le p\). For any input \({\varvec{x}}\in \Omega\), we can evaluate the objective and constraints with/without noise, but cannot get gradient information. Each black-box function evaluation is expensive. Under the independence assumption, we observe an i.i.d. \((p+1)\)-dimensional vector \((y_0({\varvec{x}}),y_1({\varvec{x}}),\dots ,y_p({\varvec{x}}))\) given by \(y_0(x)\sim {\mathcal {N}}(f({\varvec{x}}), \sigma _0^2({\varvec{x}}))\) and \(y_i({\varvec{x}})\sim {\mathcal {N}}(c_i({\varvec{x}}),\sigma _i^2({\varvec{x}}))\), \(i \in \{1, \ldots , p \}\). Without assuming independence between constraints, the observed vector is \(y_0(x)\sim {\mathcal {N}}(f({\varvec{x}}), \sigma _0^2({\varvec{x}}))\) and \((y_1({\varvec{x}}),\dots ,y_p({\varvec{x}}))\sim {\mathcal {N}}({\varvec{m}}({\varvec{x}}),{\varvec{K}}({\varvec{x}}))\) with \({\varvec{m}}({\varvec{x}})=(c_1({\varvec{x}}),\dots ,c_p({\varvec{x}}))\). The diagonal elements in \({\varvec{K}}({\varvec{x}})\) are \(\sigma _1^2({\varvec{x}}),\dots ,\sigma _p^2({\varvec{x}})\), so we ask: When are off-diagonal \({\varvec{K}}({\varvec{x}})\) elements important?

In practice, the input domain \(\Omega\) is rescaled to \([0,1]^d\). Under independence assumption, we train \(p+1\) GPs for the \(p+1\) black-boxes. Otherwise, we train one GP for objective and one MOGP for constraints. For noise-free cases, we tend to find a feasible point with minimal objective value. For noisy cases, one can relax feasible region(s) with tolerance \(\epsilon _i\) and confidence \(\delta _i,1\le i\le p\) [5, 24]:

$$\begin{aligned} \{{\varvec{x}}\in \Omega ~|~P(y_1({\varvec{x}})\le \epsilon _1) \ge 1-\delta _1,\dots ,P(y_p({\varvec{x}})\le \epsilon _p)\ge 1-\delta _p\}. \end{aligned}$$
(1)

3.2 Probability of feasibility with dependence (Dep-PoF)

By independence/dependence in the paper we refer to the prior distributions \(y_i\) over constraints \(c_i\) instead of constraints themselves. Before discussing PoF with/without dependence, we clarify the concept of PoF. If we could easily evaluate the constraints at \({\varvec{x}}\), PoF reduces to a binary value:

$$\begin{aligned} PoF({\varvec{x}})=I(c_1({\varvec{x}})\le 0, \cdots , c_p({\varvec{x}})\le 0). \end{aligned}$$

Here, there is no need to consider dependence or non-trivial probability as:

$$\begin{aligned} PoF({\varvec{x}})=I(c_1({\varvec{x}})\le 0)\cdots I(c_p({\varvec{x}})\le 0), \end{aligned}$$

where \(I(\cdot )\) is indicator function. However, since \(c_i\) are black-boxes, we cannot afford checking the feasibility for any queried input. PoF cheaply approximates the constraint feasibility. Mathematically, Dep-PoF is defined:

$$\begin{aligned} PoF_{\textrm{dep}}({\varvec{x}})=P(y_1({\varvec{x}}) \le 0,\dots , y_p({\varvec{x}})\le 0). \end{aligned}$$

With the independence assumption, PoF is the product of univariate Gaussian distributions’ cdf:

$$\begin{aligned} PoF({\varvec{x}})=\prod \limits _{i=1}^pP(y_i({\varvec{x}})\le 0). \end{aligned}$$

Otherwise, calculating Dep-PoF involves the cdf for the multivariate Gaussian distribution \((y_1({\varvec{x}}),\dots ,y_p({\varvec{x}}))\sim {\mathcal {N}}({\varvec{m}}({\varvec{x}}),{\varvec{K}}({\varvec{x}}))\), which is complicated. This partly explains why most works assume independence. EPMGP (see Sect. 2.2) approximates Dep-PoF (\(l_i=-\infty ,u_i=0\)) with an acceptable additional computational cost, and allows us to compute Dep-PoF and test whether dependence helps.

3.3 Constrained acquisition function

3.3.1 Constrained expected improvement with dependence (Dep-cEI)

Implementing Dep-PoF for cEI is straightforward. Replacing the PoF term in cEI with Dep-PoF, we define Dep-cEI as:

$$\begin{aligned} {cEI}_{\textrm{dep}}({\varvec{x}})=EI({\varvec{x}}){PoF}_{\textrm{dep}}({\varvec{x}}), \end{aligned}$$

where \(EI({\varvec{x}})\) is calculated by the predicted mean and variance from the trained GP model, \(PoF_{\textrm{dep}}({\varvec{x}})\) is computed by using EPMGP algorithm given the predicted mean and covariance matrix from the trained MOGP model.

3.3.2 Constrained adaptive sampling with dependence (Dep-cAS)

This section implements cAS to our problem setting. To optimize our continuous objective function, we replace the Ludl et al. [8] border-find term with the optimization term in Adaptive Sampling (AS) strategy proposed by [42]. We still call our modified version ’cAS’, which consists of: (i) optimization, (ii) constraint, and (iii) exploration terms. Dep-cAS replaces the PoF in constraint term with Dep-PoF. Open Resource 1 gives details of cAS and Dep-cAS.

Note that Dep-cAS adds extra parameters, which is not preferred in machine learning even with the discussion of selecting these parameters [8] and the flexibility to different focus brought by weight parameters. The reason this paper introduces cAS and modifies it into Dep-cAS is neither to recommend nor criticize this acquisition function. By considering Dep-PoF in both cEI and cAS, we want to check if dependence helps in different settings from different fields.

3.4 Convergence analysis of Dep-PoF

The section is motivated by experimental results, where we found Dep-PoF and PoF share similar performance even in cases that the covariance matrix \({\varvec{K}}({\varvec{x}})\) has non-zero non-diagonal elements (which means that dependence exists). The following analysis only involves constraints and corresponding surrogate models as they are the only differences between Dep-PoF and PoF.

Intuitively, when we have sufficient samples and suitable hyper-parameters, GPs/MOGP well-represent the underlying functions. Mathematically, the mean \(y_i({\varvec{x}})\) converges to true function \(c_i({\varvec{x}})\) and the covariance (\(\sigma _i^2({\varvec{x}})\) for GPs, \({\varvec{K}}({\varvec{x}})\) for MOGP) converges to 0 as the number of samples \(n\rightarrow \infty\). Teckentrup [43] proved the convergence of GP with estimated hyper-parameters and provided explicit bounds for mean and covariance.

We analyze the performance of PoF and Dep-PoF in an ideal setting: the predictive mean functions match our constraints, i.e. \({\varvec{m}}({\varvec{x}})=(c_1({\varvec{x}}),\) \(\dots , c_p({\varvec{x}}))\). In this way, we can exclude the influence of mean and focus on the covariance matrix, which is the key difference between PoF and Dep-PoF. Numerically, we report the predictive accuracy for both models in Open Resource 2 to support this setting. Under this setting, Proposition 1 analyzes the convergence of Dep-PoF to PoF on non-boundary domain. Proposition 2 shows a ’decay’ behaviour of PoF on the boundary.

Proposition 1

(Non-boundary domain) For any non-boundary point \(\varvec{x}\), we have \(PoF_\mathrm{{dep}}({\varvec{x}})\rightarrow PoF({\varvec{x}}), \text {as }n\rightarrow \infty\).

Proof

For a non-boundary feasible point \(\varvec{x}\), we have \(c_i({\varvec{x}})<0\) for \(1\le i\le p\). Using the continuity of the cdf function of multivariate Gaussian distribution, for any \(\delta >0\), there exists \(\epsilon >0\) so that if \(\Vert {\varvec{K}}({\varvec{x}})\Vert _2<\epsilon\),

$$\begin{aligned} PoF_{\textrm{dep}}({\varvec{x}})=P(y_1(\varvec{x})\le 0, \dots , y_p({\varvec{x}})\le 0)>1-\delta . \end{aligned}$$

Recall that for any \({\varvec{x}}\in \Omega\), we have \(\Vert {\varvec{K}}({\varvec{x}})\Vert _2\rightarrow 0, \text {as }n\rightarrow \infty\), which means that there exists \(N_\epsilon \in {\mathbb {N}}\) such that \(\Vert {\varvec{K}}({\varvec{x}})\Vert _2<\epsilon\) as long as \(n>N_\epsilon\), \(\forall {\varvec{x}}\in \Omega\).

Since \(P(y_i({\varvec{x}})\le 0)\ge PoF_{\textrm{dep}}({\varvec{x}}),1\le i\le p\), then \(PoF({\varvec{x}})>(1-\delta )^p\). Both \(PoF({\varvec{x}})\) and \(PoF_{\textrm{dep}}({\varvec{x}})\) are bounded by \((1-\delta )^p\) (since \(1-\delta \ge (1-\delta )^p\)) and 1. Therefore, we have

$$\begin{aligned} |PoF_{\textrm{dep}}({\varvec{x}})-PoF({\varvec{x}})|< 1-(1-\delta )^p\rightarrow 0, \text {as } \delta \rightarrow 0. \end{aligned}$$

Same conclusion also holds for a non-boundary infeasible point \(\varvec{x}\). \(\square\)

Proposition 2

(Boundary) For any boundary point \(\varvec{x}\), denote its number of active constraints is q, then we have \(PoF({\varvec{x}})\rightarrow \frac{1}{2^q}, \text {as } n\rightarrow \infty .\)

Proof

For a point \({\varvec{x}}\) on the boundary, without loss of generality, assume that the first q constraints are active, that is, \(c_i({\varvec{x}})=0\) for \(1\le i\le q\) and \(c_j({\varvec{x}})<0\) for \(q<j\le p\). Following the proof of Proposition 1, we can show that

$$\begin{aligned} P(y_{q+1}({\varvec{x}})\le 0,\dots , y_{p}({\varvec{x}})\le 0)\rightarrow 1, \text {as }n\rightarrow \infty . \end{aligned}$$

Since \({\varvec{m}}({\varvec{x}})=(c_1({\varvec{x}})\dots , c_p({\varvec{x}}))\) and \(c_i({\varvec{x}})=0\) for \(1\le i\le q\), we obtain that \(P(y_i({\varvec{x}})\le 0)= \frac{1}{2}\) for \(1\le i\le q\). Thus, as \(n\rightarrow \infty\), we have

$$\begin{aligned} PoF({\varvec{x}})=\prod \limits _{i=1}^qP(y_i({\varvec{x}})\le 0)\cdot P(y_{q+1}({\varvec{x}})\le 0,\dots , y_{p}({\varvec{x}})\le 0)\rightarrow \frac{1}{2^q}. \end{aligned}$$
(2)

\(\square\)

The ’decay’ shown in (2) slows down the optimization process if the global optimum is a boundary point with multiple active constraints. A way to handle it is relaxing feasible regions (even for noise-free cases) as mentioned in (1), which turns the boundary optimum into an interior one. However, this relxation requires more hyper-parameters and possibly finds an infeasible ’optimum’.

Dep-PoF also suffers from the decay of probability near boundary. When the dependence between constraints is slight, its decay is closed to (2). However, for some cases, dependence could relieve the decay and then speed up the process to find a boundary optimum. For instance, in Sect. 4.1 we construct a case with boundary optimum and repeated constraints. In such setting, Dep-PoF outperforms PoF since Dep-PoF is ideally equivalent to PoF for one constraint. Although this case seems to overestimate the importance of dependence, we regard it as an upper bound for the improvement that Dep-PoF may bring. Practically, the correlations between unknown constraints are also unknown, so as the location of optimum. If Dep-PoF shares similar performance to PoF for general cases but outperforms PoF for some special cases, then it still has merit.

3.5 Time complexity analysis

From a modeling perspective, the time complexity for training p independent GPs is \(O(pn^3)\), while for training a MOGP is \(O(p^3n^3)\), where n is the number of samples and p is the number of constraints. The complexity for MOGP could be reduced through approximated inference, e.g., by introducing (latent) inducing points [35]. This paper implements exact inference for both choices of model to eliminate their possibility of influencing the comparison between Dep-PoF and PoF. From a probability perspective, calculating PoF is O(p) as computing the cdf for univariate Normal distribution is O(1), while calculating Dep-PoF using EPMGP [33] requires \(O(p^3+mp^2)\), where m is the number of iterations within EPMGP. Obviously, the time complexity of EPMGP is incomparable with univariate case. For moderate number of constraints, nevertheless, EPMGP is efficient.

4 Experimental results

To consider the performance of Dep-PoF, we compare cEI, Dep-cEI, cAS, Dep-cAS, and Random Search for all benchmarks. All benchmarks are observed without noise. All methods begin with 10 initial points given by a Latin hypercube design and have a budget of 100 evaluations. The plots also show the optimal objective value for each benchmark. For any infeasible point, set its value to the maximal objective value among all evaluated feasible points. All experimental results are realized by using GPflow [37] and Scipy [44], and we report the mean with one standard error over 50 replications. Open Resource 1 provides full implementation details and justifies using GPflow. In this section, d means the input dimension, and p means the number of constraints.

Fig. 1
figure 1

Results for 2D Gardner function with repeated constraints, i.e. \(p = \{2, \dots , 5\}\)

4.1 A toy example with repeated constraints

Although Sect. 3.2 discusses dependence between constraints, we do not know how to measure it for general cases. Therefore, to see the difference between Dep-PoF and PoF, we construct an extreme case based on the 2D Gardner [21] problem with 1 black-box constraint (see Open Resource 3 for details), where we repeat this single constraint several times. 2D Gardner problem itself is challenging due to two small feasible regions and two boundary optimum (one of which is a global optimum). After repeating its constraint p times, denote \(p_0({\varvec{x}})=P(c_1({\varvec{x}})\le 0)\), then \(PoF_{\textrm{dep}}({\varvec{x}})=p_0({\varvec{x}})\) while PoF becomes \(PoF({\varvec{x}})=p_0^p({\varvec{x}})\). Intuitively, as p increases, more samples are needed to increase the confidence for \({\varvec{x}}\) being feasible. Experimental results in Fig. 1 verify this intuition, from which we notice:

Dep-PoF helps exploiting feasible regions. When the samples are scarce, Dep-PoF improves the objective earlier since it calculates the true probabilities which admit it samples points more confidently. For PoF, however, since the probabilities decay near the boundary, it needs more samples to gain more information about feasible regions.

Dep-PoF performs better when p increases. Ideally, Dep-PoF for different p should be identical in this example. Their different performances seem to be counterintuitive. However, note that these (repeated) constraints will share information during training MOGP, which may explain better performance for larger p. This explanation is also supported by our experimental results in Open Resource 2.

4.2 Synthetic benchmarks

This section tests synthetic benchmarks from literature. Table 2 introduces these benchmarks and Fig. 2 diagrams part of our experimental results. Open Resource 3 gives full details about these problems and results for 2D benchmarks.

Fig. 2
figure 2

Results for synthetic benchmarks

As shown in Fig. 2, cEI almost always outperforms cAS no matter what kind of PoF is used. This is not surprising since cAS only cares about immediate improvement based on samples with currently optimal values. Then we consider the impact of considering dependence between constraints (i.e., comparison between PoF and Dep-PoF). Among these benchmarks, Dep-PoF outperforms PoF in G7 for both acquisition functions and in G10 for cAS, probably because that both G7 and G10 have very small feasible region(s) (When computing G7’s feasible ratio in Table 2, we only sample 1 feasible point in \(10^6\) uniform samples!).

In the benchmarks other than G7 and G10 problem, the performances of Dep-PoF and PoF are hard to differentiate. We identify several possible explanations: (i) as shown in Table 2, Gramacy, Sasena, and G4 problems have large feasible region(s). BO probably focuses on improving the objective instead of finding feasible solutions, (ii) as shown in Open Resource 2, some problems such as G8 and G9 problems have slight correlations between constraints, (iii) the strong correlations between constraints are negative, for example, G4 and G6 problems. If two constraints are negatively correlated, a larger probability of satisfying one constraint implies a smaller probability of satisfying another one. In this case, considering dependence can barely help find the feasible domain(s).

Table 2 Benchmarks. Recall that d is the input dimension and p is the number of constraints. For each problem, we also give the type of optimal solution and the active set that consists of the indexes of active constraints at optimum. To roughly show the size of feasible region(s), we uniformly sample \(10^6\) points and check their feasibility

4.3 Real-world applications

Practically, we compare PoF and Dep-PoF on four real-world scenarios:

3D tension-compression string design [41, 47, 50, 51]: minimize the weight of a tension or compression spring under 4 mechanical constraints.

4D pressure vessel design [41, 47, 50, 51]: minimize the cost of designing a cylindrical vessel subject to 4 constraints.

4D welded beam design [41, 47, 50, 51]: minimize the cost of design a welded beam satisfying 5 constraints.

7D speed reducer [41, 50, 51]: minimize the weight of a speed reducer constrained by 11 mechanical constraints.

Open Resource 3 provides more details about these applications. Figure 3 reports numerical results. Although Dep-PoF stills shares similar performance as PoF, both of them perform well in these practical problems with real constraints instead of artificial ones. It is noteworthy that these applications are also considered in [41], where the proposed SCBO algorithms and several state-of-art methods including cEI are tested and compared. The cEI’s performance reported in [41] is usually not satisfying w.r.t. the speed and the extent of approaching the optimum. The differences between our results and theirs may result from our practical treatments (see Open Resource 1), whose effectiveness is shown here.

Fig. 3
figure 3

Results for engineering benchmarks. Since the ranges of objective values for Pressure Vessel and Welded Beam are too wide, we set the maximal value to 30000 and 30, respectively

5 Discussion

Based on the convergence analysis and empirical results, we observe:


Dep-PoF is not worse than PoF. Theoretically, Dep-PoF represents accurate probability while PoF is a special case for diagonal covariance matrix \({\varvec{K}}\). They are equivalent for cases with very little dependence. For cases with dependence, Dep-PoF is more likely to positively influence the results because \(\varvec{K}\) is more accurate calculation. Practically, to conclude that Dep-PoF is not worse than PoF, several assumptions should be satisfied: (i) both models fit well, (ii) dependence is captured and learned, (iii) Dep-PoF is calculated correctly. Open Resource 2 verifies the first two assumptions are verified in Open Resource 2, and the results in [33] guarantee the last assumption.


Dep-PoF is very similar to PoF after sufficient samples. Intuitively, when surrogate models well-approximate the underlying constraints, Dep-PoF is largely determined by the predicted mean (equivalent to PoF) rather than the predicted covariance matrix (where dependence matters). This ’convergence’ indicates that Dep-PoF only helps in early stages of the optimization process.


Improvement versus Additional cost. Though EPMGP is effective, it cannot be as quick as computing the cdf for univariate case. Moreover, training a MOGP model is more expensive than training p GPs as MOGP involves more parameters. The improvement of Dep-PoF, if exists, may be limited compared to the extra computational effort. The reason we still consider dependence is for real expensive black-box functions requiring days to evaluate: If we truly know nothing about the underlying constraints, then Dep-PoF may help. Otherwise, if the dependence does not help or helps little, a user will not feel guilty to assume independence.

6 Conclusions

This paper removes the independence assumption between multiple black-box constraints and implements the Probability of Feasibility with dependence (Dep-PoF) via MOGP and EPMGP. Theoretically, we show that Dep-PoF converges to PoF with enough samples for a non-boundary region and we highlight their difference on/near the boundary. The former implies that Dep-PoF finally shares similar performance to PoF, while the latter explores the potential benefit of Dep-PoF for cases with boundary optimum and constraints with strong correlations.

The experimental results are largely negative: Dep-PoF and PoF typically perform similarly. These (negative) results are useful because they allow us to make a simplifying assumption without concerns. If the dependence between constraints is slight, it hardly helps. For stronger dependence, the surrogate model cannot fully capture it when the samples are scarce. After we have enough samples, the dependence is learned well but is not so important since Dep-PoF converges to PoF. These results encourage modeling simplicity and optimization efficiency.