1 Introduction

Out-of-distribution (OOD) detection (Yang et al., 2021), i.e. identifying input data samples that do not belong to the distribution that a model was trained on, is a task that is receiving an increasing amount of attention in the domain of deep learning (Liang et al., 2018; Liu et al., 2020b; Du et al., 2022; Hendrycks & Gimpel, 2017; Hendrycks & Dietterich, 2019; Fort et al., 2021; Hsu et al., 2020; Techapanurak et al., 2020; Sun et al., 2021; Sun et al., 2022; Wang et al., 2022; Huang & Li, 2021; Lee et al., 2018; Pearce et al., 2021; Yang et al., 2021; Zhang et al., 2021; Nalisnick et al., 2019). The task is often motivated by safety-critical applications of deep learning, such as healthcare and autonomous driving. For these scenarios, there may be a large cost associated with sending a prediction on OOD data downstream. For example, it could be potentially dangerous for a self-driving car to unknowingly classify a grizzly bear as one of the classes in its training set.Footnote 1

However, in spite of a plethora of existing research, there is generally a lack of focus with regards to the specific motivation behind OOD detection in the literature, other than it is often performed as part of the pipeline of another primary task, e.g. image classification. As such OOD detection tends to be evaluated in isolation, formulated as binary classification between in-distribution (ID) and OOD data.

In this work, we consider the question why exactly do we want to do OOD detection during deployment? We focus on the problem setting where the primary objective is classification, and we are motivated to detect and then reject OOD data, as predictions on those samples will incur a cost. That is to say, the task is selective classification (El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) where OOD data is present within the input samples. Kim et al. (2021) term this problem setting unknown detection. However, we prefer to use Selective Classification in the presence of Out-of-Distribution data (SCOD) as we would like to emphasise the downstream classification task as the primary objective and will refer to the task as such in the remainder of this paper.

The key difference between this problem setting and OOD detection is that both OOD data and incorrect predictions on ID data will incur a cost (Kim et al., 2021). It does not matter if we reject an ID sample if it would be incorrectly classified anyway. As such we can view the task as separating correctly predicted ID samples (ID✓) from misclassified ID samples (ID✗) and OOD samples. This reveals a potential blind spot in designing approaches solely for OOD detection, as the cost of ID misclassifications is ignored if the aim is only to separate OOD|ID.

The key contributions of this work are:

  1. 1.

    Building on initial results reported by Kim et al. (2021) that show poor SCOD performance for existing methods designed for OOD detection, we show novel insight into the behaviour of different post-hoc (after-training) detection methods for the task of SCOD. Improved OOD detection often comes directly at the expense of SCOD performance, through the conflation of ID✗ and ID✓. Moreover, the relative SCOD performance of different methods varies with the proportion of OOD data found in the test distribution, the relative cost of accepting ID✗ vs OOD, as well as the distribution from which the OOD data samples are drawn.

  2. 2.

    We propose a novel method, targeting SCOD, Softmax Information Retaining Combination (SIRC). Our approach aims to improve the OOD|ID✓ separation of softmax-based confidence scores, by combining them with a secondary, class-agnostic confidence score, whilst retaining their ability to identify ID✗. It consistently outperforms or matches the baseline maximum softmax probability (MSP) approach over a wide variety of OOD datasets and convolutional neural network (CNN) architectures. On the other hand, existing OOD detection methods fail to achieve this.

  3. 3.

    We find that the secondary scores investigated for SIRC perform inconsistently over different OOD datasets. That is to say, a given secondary score may improve SCOD for some OOD datasets, but won’t help on other datasets. Also, different scores appear to be better suited to detecting different distribution shifts. Thus, we extend SIRC to incorporate a combination of multiple secondary scores (SIRC+). This results in generally even better SCOD performance, as well as more consistent performance gains over a wider range of OOD data.

A preliminary version of this work has been published in ACCV 2022 (Xia & Bouganis, 2022a), which covers points {1, 2}. In this work, we extend the aforementioned preliminary version through:

  • more detailed discussion of {1, 2},

  • the inclusion of an additional secondary confidence score—KNN (Sun et al., 2022),

  • evaluation on an additional OOD dataset–SpaceNet (Etten et al., 2018),

  • the novel developments described in 3 (SIRC+).

2 Preliminaries

Neural Network Classifier For a K-class classification problem we learn the parameters \(\varvec{\theta }\) of a discriminative model \(P(y\mid \varvec{x};\varvec{\theta })\) over labels \(y \in \mathcal Y = \{\omega _k\}_{k=1}^K\) given inputs \(\varvec{x} \in \mathcal X = \mathbb R^D\), using finite training dataset \(\mathcal D_\text {tr} = \{y^{(n)}, \varvec{x}^{(n)}\}_{n=1}^{N}\) sampled independently from true joint data distribution \(p_\text {tr}(y, \varvec{x})\). This is done in order to make predictions \(\hat{y}\) given new inputs \(\varvec{x}^* \sim p_\text {tr}(\varvec{x})\) with unknown labels,

$$\begin{aligned} \hat{y} = f(\varvec{x}^*) = \mathop {\mathrm {arg\,max}}\limits _\omega P(\omega \mid \varvec{x}^*;\varvec{\theta })~, \end{aligned}$$
(1)

where f refers to the classifier function. In our case, the parameters \(\varvec{\theta }\) belong to a deep neural network with categorical softmax output \(\varvec{\pi }\in [0, 1]^K\),

$$\begin{aligned} P(\omega _i\mid \varvec{x};\varvec{\theta }) = \pi _i(\varvec{x};\varvec{\theta }) = \frac{\exp v_i(\varvec{x})}{\sum _{k=1}^K \exp v_k(\varvec{x})}~, \end{aligned}$$
(2)

where the logits \(\varvec{v} = \varvec{W} \varvec{z} + \varvec{b} \quad (\in \mathbb R^K)\) are the output of the final fully-connected layer with weights \(\varvec{W} \in \mathbb R^{K\times L}\), bias \(\varvec{b} \in \mathbb R^K\), and final hidden layer features \(\varvec{z} \in \mathbb R^L\) as inputs. Typically \(\varvec{\theta }\) are learnt by minimising the cross entropy loss, such that the model approximates the true conditional distribution \(P_\text {tr}(y\mid \varvec{x})\),

$$\begin{aligned} \mathcal L_\text {CE}(\varvec{\theta })&= -\frac{1}{N}\sum _{n=1}^{N}\sum _{k=1}^K \delta (y^{(n)}, \omega _k)\log P(\omega _k\mid \varvec{x}^{(n)};\varvec{\theta }) \nonumber \\&\approx -\mathbb E_{p_\text {tr}(\varvec{x})}\left[ \sum _{k=1}^K P_\text {tr}(\omega _k\mid \varvec{x})\log P(\omega _k\mid \varvec{x};\varvec{\theta })\right] \nonumber \\&= \mathbb {E}_{p_\text {tr}(\varvec{x})}\left[ \text {KL}\left[ P_\text {tr}(\omega _k\mid \varvec{x})\mid \mid P(\omega _k\mid \varvec{x};\varvec{\theta })\right] \right] + A, \end{aligned}$$
(3)

where \(\delta (\cdot , \cdot )\) is the Kronecker delta, A is a constant with respect to \(\varvec{\theta }\) and KL\([\cdot \Vert \cdot ]\) is the Kullback–Leibler divergence.

Selective Classification A selective classifier (El-Yaniv & Wiener, 2010) can be formulated as a pair of functions, the aforementioned classifier \(f(\varvec{x})\) [in our case given by Eq. (1)] that produces a prediction \(\hat{y}\), and a binary rejection function

$$\begin{aligned} g(\varvec{x};t) = {\left\{ \begin{array}{ll} 0\text { (reject prediction)}, &{}\text {if }S(\varvec{x}) < t\\ 1\text { (accept prediction)}, &{}\text {if }S(\varvec{x}) \ge t~, \end{array}\right. } \end{aligned}$$
(4)

where t is an operating threshold and S is a scoring function which is typically a measure of predictive confidence (or \(-S\) measures uncertainty). Intuitively, a selective classifier chooses to reject if it is uncertain about a prediction.

2.1 Problem Setting Selective Classification with OOD Data (SCOD)

We consider a scenario where, during deployment, classifier inputs \(\varvec{x}^*\) may be drawn from either the training distribution \(p_\text {tr}(\varvec{x})\) (ID) or another distribution \(p_\text {OOD}(\varvec{x})\) (OOD). That is to say,

$$\begin{aligned}&\varvec{x}^* \sim p_\text {mix}(\varvec{x}) \nonumber \\&p_\text {mix}(\varvec{x}) = \alpha p_\text {tr}(\varvec{x}) + (1-\alpha )p_\text {OOD}(\varvec{x})~, \end{aligned}$$
(5)

where \(\alpha \in [0, 1]\) reflects the proportion of ID to OOD data found in the wild. Here “Out-of-Distribution” inputs are defined as those drawn from a distribution with label space that does not intersect with the training label space \(\mathcal Y\) (Yang et al., 2021). For example, an image of a car is considered OOD for a CNN classifier trained to discriminate between different types of pets. We use this definition as it means that OOD samples are fundamentally incompatible with the primary classifier, and any classification predictions made on them will be automatically invalid. Note that in our case we assume no knowledge of \(p_\text {OOD}\) before deployment.

We now define the predictive loss on an accepted sample as

$$\begin{aligned} \mathcal L_\text {pred}(f(\varvec{x})) = {\left\{ \begin{array}{ll} 0, &{}\text {if } f(\varvec{x}) = y, (y, \varvec{x}) \sim p_\text {tr}\\ \beta , &{}\text {if } f(\varvec{x}) \ne y, (y, \varvec{x}) \sim p_\text {tr}\\ 1-\beta , &{}\text {if } \varvec{x} \sim p_\text {OOD} \end{array}\right. } \end{aligned}$$
(6)

for classifier \(f(\varvec{x})\) [Eq. (1)], where \(\beta \in [0, 1]\). We define the selective risk as in (Geifman & El-Yaniv, 2017),

$$\begin{aligned} R(f, g;t) = \frac{\mathbb E_{p_\text {mix}(\varvec{x})}[g(\varvec{x};t)\mathcal L_\text {pred}(f(\varvec{x}))]}{\mathbb E_{p_\text {mix}(\varvec{x})}[g(\varvec{x};t)]}~, \end{aligned}$$
(7)

which can be intuitively understood as the average loss of only the accepted samples, when using rejection function \(g(\varvec{x};t)\) [Eq. (4)]. We are only concerned with the relative cost of ID✗ and OOD samples, so we use a single parameter \(\beta \).

The objective is to find a classifier and rejection function (fg) that minimise R(fgt) for some given setting of t. We focus on comparing post-hoc (after-training) methods in this work, where g (or equivalently S) is varied with f fixed. This removes confounding factors that may arise from the interactions of different training-based and post-hoc methods, as they can often be freely combined. In practice, both \(\alpha \) and \(\beta \) will depend on the deployment scenario. However, whilst \(\beta \) can be set freely by the practitioner depending on their own evaluation of costs, \(\alpha \) is outside of the practitioner’s control and their knowledge of it is likely to be very limited.

It is worth contrasting the SCOD problem setting with OOD detection. SCOD aims to separate OOD, ID✗ |ID✓, whilst for OOD detection the data is grouped as OOD|ID✗, ID✓ (see Fig. 1). The key difference is in the categorisation of ID✗.

Fig. 1
figure 1

Illustrative sketch showing how SCOD differs to OOD detection. Densities of OOD samples, misclassifications (ID✗) and correct predictions (ID✓) are shown with respect to confidence score S. For OOD detection the aim is to separate OOD|ID✗ID✓, whilst for SCOD the data is grouped as OODID✗|ID✓

SCOD and Types of Uncertainty We note that previous work (Kendall & Gal, 2017; Malinin & Gales, 2018; Malinin et al., 2020; Mukhoti et al., 2021; Pearce et al., 2021) refer to different types of predictive uncertainty, namely aleatoric and epistemic. The former arises from uncertainty inherent in the data (i.e. the true conditional distribution \(P_\text {tr}(y\mid \varvec{x})\)) and as such is irreducible, whilst the latter can be reduced by having the model learn from additional data. Typically, it is argued that it is useful to distinguish these types of uncertainty at prediction time. Epistemic uncertainty estimates should indicate distributional shift away from the training distribution, i.e. whether a test input \(\varvec{x}^*\) is OOD. On the other hand, aleatoric uncertainty estimates should reflect the level of class ambiguity of an ID input. An interesting result within our problem setting is that the conflation of these different types of uncertainties may not be an issue, as there is no need to separate ID✗ from OOD, as both should be rejected.

3 Existing OOD Detectors Applied to SCOD

As the explicit objective of OOD detection is different to SCOD, it is of interest to understand how existing detection methods behave for SCOD. Previous work (Kim et al., 2021) has empirically shown that some existing OOD detection approaches don’t perform very well, and in this section we shed additional light as to why this is the case.

Fig. 2
figure 2

Illustrations of how a detection method can improve over a baseline. Top: for OOD detection we can either have OOD further away from ID✓ or ID✗ closer to ID✓. Bottom: for SCOD we want both OOD and ID✗ to be further away from ID✓. Thus, we can see how improving OOD detection may in fact be at odds with SCOD

Improving Performance: OOD Detection vs SCOD In order to build an intuition, we can consider, qualitatively, how detection methods can improve performance over a baseline, with respect to the distributions of OOD and ID✗ relative to ID✓. This is illustrated in Fig. 2.

  • For OOD detection the objective is to better separate the distributions of ID and OOD data. Thus, we can either find a confidence score S that, compared to the baseline, has OOD distributed further away from ID✓, and/or has ID✗ distributed closer to ID✓.

  • For improving SCOD, we want both OOD and ID✗ to be distributed further away from ID✓ than the baseline.

Thus there is a conflict between the two tasks. For the distribution of ID✗, the desired behaviour of confidence score S will be different.

Existing Approaches Sacrifice SCOD by Conflating ID and ID

Fig. 3
figure 3

Left: false positive rate (FPR\(\downarrow \)) of OOD samples (negative class) plotted against true positive rate (TPR) of ID (✓+✗) samples (positive class), i.e. how well each confidence score distinguishes OOD from ID. Energy performs better (lower) for OOD detection relative to the MSP baseline. Right: FPR\(\downarrow \) of ID✗ and OOD samples (negative classes) against TPR of ID✓ (positive class). Energy is worse than the baseline at separating ID✗|ID✓ and no better for OOD|ID✓, meaning it is worse for SCOD. Energy’s improved OOD detection arises from pushing ID✗ closer to ID✓ (Fig. 2). The ID dataset is ImageNet-200, OOD data is iNaturalist and the model is ResNet-50

Considering post-hoc methods, the generally accepted baseline approach for both selective classification and OOD detection is the Maximum Softmax Probability (MSP) (Hendrycks & Gimpel, 2017; Geifman & El-Yaniv, 2017) confidence score. Improvements in OOD detection are often achieved by moving away from the softmax \(\varvec{\pi }\) in order to better capture the differences between ID and OOD data. Confidence scores such as Energy (Liu et al., 2020b) and Max Logit (Hendrycks et al., 2022) consider the logits \(\varvec{v}\) directly, whereas the Mahalanobis detector (Lee et al., 2018) and DDU (Mukhoti et al., 2021) build generative models using Gaussians over the features \(\varvec{z}\). ViM (Wang et al., 2022) and Gradnorm (Huang et al., 2021) incorporate class-agnostic, feature-based information into their scores.

Recall that typically a neural network classifier learns a model \(P(y\mid \varvec{x};\varvec{\theta })\) to approximate the true conditional distribution \(P_\text {tr}(y\mid \varvec{x})\) of the training data [Eqs. (2) and (3), Sect. 2]. As such, scores S extracted from the softmax outputs \(\varvec{\pi }\) should best reflect how likely a classifier prediction on ID data is going to be correct or not (and this is indeed the case in our experiments in Sect. 5). As the above (post-hoc) OOD detection approaches all involve moving away from the modelled \(P(y\mid \varvec{x};\varvec{\theta })\), we would expect worse separation between ID✗ and ID✓ even if overall OOD is better distinguished from ID.

Figure 3 shows empirically how well different types of data are separated using MSP (\(\pi _\text {max}\)) and Energy (\(\log \sum _k\exp v_k\)), by plotting false positive rate (FPR) against true positive rate (TPR). Lower FPR indicates better separation of the negative class away from the positive class.

Although Energy has better OOD detection performance compared to MSP, this is actually because the separation between ID✗ and ID✓ is much less for Energy, so ID as a whole is better separated from OOD. On the other hand the behaviour of OOD relative to ID✓ is not meaningfully different to the MSP baseline. Therefore, SCOD performance for Energy is worse in this case. Another way of looking at it would be that for OOD detection, MSP does worse as it conflates ID with OOD. However, this doesn’t harm SCOD performance as much, as those ID samples that are confused with OOD are mostly incorrect anyway. The ID dataset is ImageNet-200 (Kim et al., 2021), OOD dataset is iNaturalist (Huang & Li, 2021) and the model is ResNet-50 (He et al., 2016).

4 Targeting SCOD—Retaining Softmax Information

We would now like to develop an approach that is tailored to the task of SCOD. We have discussed how we expect softmax-based methods, such as MSP, to perform best for distinguishing ID✗ from ID✓, and how existing approaches for OOD detection improve over the baseline, in part, by sacrificing this. As such, to improve over the baseline for SCOD, we will aim to retain the ability to separate ID✗ from ID✓ whilst increasing the separation between OOD and ID✓.

Combining Confidence Scores Inspired by Gradnorm (Huang et al., 2021) and ViM (Wang et al., 2022) we consider the combination of two different confidence scores \(S_1, S_2\). We shall consider \(S_1\) our primary score, which we wish to augment by incorporating \(S_2\). For \(S_1\) we investigate scores that are strong for selective classification on ID data, but are also capable of detecting OOD data—MSP and (the negative of) softmax entropy, \((-)\mathcal H[\varvec{\pi }]\). For \(S_2\), the score should be useful in addition to \(S_1\) in determining whether data is OOD or not. We should consider scores that capture different information about OOD data to the post-softmax \(S_1\) if we want to improve OOD|ID✓. We choose to examine the \(l_1\)-norm of the feature vector \(\Vert \varvec{z}\Vert _1\) (Huang et al., 2021), the negative of the ResidualFootnote 2 score \(-\Vert \varvec{z}^{P^\bot }\Vert _2\) (Wang et al., 2022) and the negative of the k-th nearest neighbour distanceFootnote 3 (KNN) (Sun et al., 2022). These scores were chosen as they capture class-agnostic information at the feature level. Note that although \(\Vert \varvec{z}\Vert _1\), Residual and KNN have previously been shown to be useful for OOD detection (Huang et al., 2021; Wang et al., 2022; Sun et al., 2022), we do not expect them to be useful for identifying misclassifications. They are separate from the classification layer defined by \((\varvec{W}, \varvec{b})\), so they are far removed from the categorical \(P(y\mid \varvec{x};\varvec{\theta })\) explicitly modelled by the softmax.

Softmax Information Retaining Combination (SIRC) We want to create a combined confidence score \(C(S_1, S_2)\) that retains \(S_1\)’s ability to distinguish ID✗ |ID✓ but is also able to incorporate \(S_2\) in order to augment OOD|ID✓. We develop our approach based on the following set of assumptions about the behaviour of \(S_1\) and \(S_2\):

  • \(S_1\) will be higher for ID✓ and lower for ID✗ and OOD.

  • \(S_1\) is bounded by maximum value \(S_1^\text {max}\).Footnote 4

  • \(S_2\) is unable to distinguish ID✗ |ID✓ well, but is lower for OOD compared to ID.

  • \(S_2\) is useful in addition to \(S_1\) for separating OOD|ID.

These assumptions are illustrated roughly in Fig. 4. We expect our choices of \(S_1\) (MSP, \(-\mathcal {H}\)) and \(S_2\) (\(\Vert \varvec{z}\Vert _1\), Res., KNN) to conform to these assumptions for the reasons stated earlier. Moreover, future choices of confidence score should conform as well.

Fig. 4
figure 4

Illustration on the (\(S_1, S_2\))-plane that satisfies the assumptions behind SIRC. (1) \(S_1\) is higher for ID✓ and lower for ID✗ and OOD. (2) \(S_1\) has maximum value \(S_1^{\text {max}}\). (3) \(S_2\) is not useful for ID✗ |ID✓ but is lower for OOD. (4) \(S_2\) is useful in addition to \(S_1\) for detecting OOD

Given the aforementioned assumptions, we propose to combine \(S_1\) and \(S_2\) using

$$\begin{aligned} C(S_1, S_2) = -(S^{\max }_1-S_1)\left( 1+\exp (-b[S_2-a])\right) , \end{aligned}$$
(8)

or equivalently taking logs,Footnote 5

$$\begin{aligned} C(S_1, S_2)&= -\log (S^{\max }_1-S_1) \nonumber \\&\quad -\, \log \left( 1+\exp (-b[S_2-a])\right) ~, \end{aligned}$$
(9)

where ab are parameters chosen by the practitioner. The idea is for the accept/reject decision boundary of C to be in the shape of a sigmoid on the \((S_1, S_2)\)-plane (see Figs. 5, 6). As such the behaviour of only using the softmax-based \(S_1\) is recovered for ID✗ |ID✓ for high \(S_2\), as the decision boundary tends to a vertical line. However, C becomes increasingly sensitive to \(S_2\) as \(S_2\) decreases, and less sensitive to \(S_1\) as \(S_1\) decreases (Fig. 5). This allows for improved OOD|ID✓ as \(S_2\) is “activated” towards the bottom left of the (\(S_1, S_2\))-plane. We term this approach Softmax Information Retaining Combination (SIRC).

Fig. 5
figure 5

Left: SIRC isocontours on the (\(S_1, S_2\))-plane—they are sigmoids. Centre: Plot of how the first term in SIRC [Eq. (9)] varies with \(S_1\) – its sensitivity to \(S_1\) (gradient) is high close to \(S_1^\text {max}\) and gradually decreases with \(S_1\). Right: Plot of how the second term in SIRC varies with \(S_2\)—its sensitivity increases from zero as \(S_2\) approaches a from above, eventually tending to a linear relationship proportional to b

Fig. 6
figure 6

Comparison of different methods of combining confidence scores \(S_1, S_2\) for SCOD. OOD, ID✗ and ID✓ distributions are displayed using kernel density estimate contours. Graded contours for the different combination methods are then overlayed (lighter means higher combined score). We see that our method, SIRC (centre right) is able to better retain ID✗|ID✓ whilst improving OOD|ID✓. An alternate parameter setting for SIRC, with a stricter adherence to \(S_1\), is also shown (far right). The ID dataset is ImageNet-200, the OOD dataset iNaturalist and the model ResNet-50. SIRC parameters are found using ID training data; the plotted distributions are test data

The parameters ab allow the method to be adjusted to different distributional properties of \(S_2\). Rearranging Eq. (8),

$$\begin{aligned} S_1 = S_1^\text {max} + C/[1+\exp (-b[S_2-a])]~, \end{aligned}$$
(10)

we see that a controls the placement of the sigmoid with respect to \(S_2\), and b the sensitivity of the sigmoid to \(S_2\). Figure 5 shows that the sensitivity of SIRC to \(S_2\) (gradient) increases from zero as \(S_2\) approaches a from above, and then tends to a linear relationship (constant sensitivity proportional to b).

We use the empirical mean and standard deviation of \(S_2\), \(\mu _{S_2}, \sigma _{S_2}\) on ID data (training or validation) to set the parameters. We choose \(a = \mu _{S_2}-3\sigma _{S_2}\) so the centre of the sigmoid is below the ID distribution of \(S_2\), and we set \(b=1/\sigma _{S_2}\), to match the ID variations of \(S_2\). We find the above approach to be empirically effective, however, other parameter settings are of course possible. Practitioners are free to tune ab however they see fit. This may be done using only ID data (training or validation) as we have, or by additionally using synthetic validation OOD data (Hendrycks et al., 2019; Sun et al., 2022).

SIRC Compared to Other Combination Approaches Fig. 6 compares different methods of combination by plotting ID✓, ID✗ and OOD data densities on the \((S_1, S_2)\)-plane. Other than SIRC we consider the combination methods used in ViM, \(C=S_1 + cS_2\), where c is a user set parameter, and in Gradnorm, \(C=S_1 S_2\). The overlayed contours of C represent decision boundaries for values of t [Eq. (4)].

We see that the linear decision boundary of \(C=S_1 + cS_2\) must trade-off significant performance in ID✗ |ID✓ in order to gain OOD|ID✓ (through varying c), whilst \(C=S_1 S_2\) sacrifices the ability to separate ID✗ |ID✓ well for higher values of \(S_1\). We also note that \(C=S_1S_2\) is not robust to different ID means of \(S_2\). For example, arbitrarily adding a constant D to \(S_2\) will completely change the behaviour of the combined score. On the other hand, SIRC is designed to be robust to this sort of variation between different \(S_2\). Figure 6 also shows an alternative parameter setting for SIRC, where a is lower and b is higher. The sigmoid is shifted down and steeper. Here more of the behaviour of only using \(S_1\) is preserved, but \(S_2\) contributes less. It is also empirically observable that the assumption that \(S_2\) (in this case \(\Vert \varvec{z}\Vert _1\)) is not useful for distinguishing ID✓ from ID✗ holds, and in practice this can be verified on ID validation data when selecting \(S_2\).

We also note that although we have chosen specific \(S_1, S_2\) in this work, SIRC can be applied to any S that satisfy the above assumptions. It is a combination method, rather than a specific confidence score. As such it has the potential to improve beyond the results we present, especially given the rapid pace of development of new confidence scores for uncertainty estimation.

Limitations We note that one limitation of SIRC is that it does not aim to improve ID✗ |ID✓, only OOD|ID✓. Moreoever, although the approach aims to limit this effect, we expect inevitable minor degradation in ID✗ |ID✓ as a result of the inclusion of \(S_2\).

5 Experimental Results—SIRC

We present experiments across a range of CNN architectures and ImageNet-scale OOD datasets. Extended results can be found in Appendix B.

Data For our ID dataset we use ImageNet-200 (Kim et al., 2021), which contains a subset of 200 ImageNet-1k (Russakovsky et al., 2015) classes. It has separate training, validation and test sets. We use a variety of OOD datasets for our evaluation that display a wide range of semantics and difficulty in being identified. Near-ImageNet-200 (Near-IN-200) (Kim et al., 2021) is constructed from remaining ImageNet-1k classes semantically similar to ImageNet-200, so it is especially challenging to detect. Caltech-45 (Kim et al., 2021) is a subset of the Caltech-256 (Griffin et al., 2007) dataset with non-overlapping classes to ImageNet-200. Openimage-O (Wang et al., 2022) is a subset of the Open Images V3 (Krasin et al., 2017) dataset selected to be OOD with respect to ImageNet-1k. iNaturalist (Huang & Li, 2021) and Textures (Wang et al., 2022) are the same for their respective datasets (Van Horn et al., 2017; Cimpoi et al., 2014). SpaceNet (Etten et al., 2018) contains satellite images of Rio De Janeiro. Colorectal (Kather et al., 2016) is a collection of histological images of human colorectal cancer, whilst Colonoscopy is a dataset of frames taken from colonoscopic video of gastrointestinal lesions (Mesejo et al., 2016). Noise is a dataset of square images where the resolution, contrast and pixel values are randomly generated (for details see Appendix A.2). Finally, ImageNet-O (Hendrycks et al., 2021) is a dataset OOD to ImageNet-1k that is adversarially constructed using a trained ResNet. Note that we exclude a number of OOD datasets from Kim et al. (2021) and Huang and Li (2021) as a result of discovering samples within said datasets that match ID labels.

Models and Training We train ResNet-50 (He et al., 2016), DenseNet-121 (Huang et al., 2017) and MobileNetV2 (Sandler et al., 2018) using hyperparameters based around standard ImageNet settings.Footnote 6 Full training details can be found in Appendix A.1. For each architecture, we train 5 models independently using random seeds \(\{1, \dots , 5\}\) and report the mean result over the runs. Appendix B additionally contains results on single pre-trained ImageNet-1k models, BiT ResNetV2-101 (Kolesnikov et al., 2020) and PyTorch DenseNet-121.

Detection Methods for SCOD We consider six variations of SIRC using the components {MSP, \(\mathcal H\)} \(\times \) {\(\Vert \varvec{z}\Vert _1, \)Residual, KNN}, as well as the components individually. We additionally evaluate various existing post-hoc methods: MSP (Hendrycks & Gimpel, 2017), Energy (Liu et al., 2020b), ViM (Wang et al., 2022) and Gradnorm (Huang et al., 2021). For the Residual score (used in SIRC and ViM) we use the full ID ImageNet-200 train set to determine parameters. For KNN we sample 12,500 feature vectors from the training set and use \(k=10\). Results for additional approaches, as well as further details pertaining to the methods, can be found in Appendices B and A.3.

5.1 Evaluation Metrics

For evaluating different scoring functions S for the SCOD problem setting we consider a number of metrics. Arrows (\(\uparrow \downarrow \)) indicate whether higher/lower is better (For graphical illustrations and additional metrics see Appendix A.4).

Area Under the Risk-Recall curve (AURR)\(\downarrow \) We consider how empirical risk [Eq. (7)] varies with recall of ID✓, and aggregate performance over different t by calculating the area under the curve. As recall is only measured over ID✓, the base accuracy of f is not properly taken into account. Thus, this metric is only suitable for comparing different g with f fixed. To give an illustrative example, a fg pair where the classifier f is only able to produce a single correct prediction will have perfect AURR as long as S assigns that correct prediction the highest confidence (lowest uncertainty) score. Note that results for the AURC metric (Kim et al., 2021; Geifman et al., 2019) can be found in Appendix B, although we omit them from the main paper as they are not notably different to AURR.

Risk@Recall=0.95 (Risk@95)\(\downarrow \) Since a rejection threshold t must be selected at deployment, we also consider a particular setting of t such that 95% of ID✓ is recalled. In practice, the corresponding value of t could be found on a labelled ID validation set before deployment, without the use of any OOD data. It is worth noting that differences tend to be greater for this metric between different S as it operates around the tail of the positive class.

Area Under the ROC Curve (AUROC)\(\uparrow \) Since we are interested in rejecting both ID✗ and OOD, we can consider ID✓ as the positive class, and ID✗, OOD as separate negative classes. Then we can evaluate the AUROC of OOD|ID✓ and ID✗ |ID✓ independently. The AUROC for a specific value of \(\alpha \) would then be a weighted average of the two different AUROCs. This is not a direct measure of risk, but does measure the separation between different empirical distributions. Note that due to similar reasons to AURR this method is only valid for fixed f.

False Positive Rate@Recall=0.95 (FPR@95)\(\downarrow \) FPR@95 is similar to AUROC, but is taken at a specific t. It measures the proportion of the negative class accepted when the recall of the positive class (or true positive rate) is 0.95.

5.2 Separation of ID✗ |ID✓ and OOD|ID✓ Independently

Table 1 shows %AUROC and %FPR@0.95 with ID✓ as the positive class and ID✗, OOD independently as different negative classes (see Sect. 5.1). It is important for a confidence score to have strong ID✗ |ID✓ performance as ID✗ will always be presentFootnote 7 regardless of the volume or type of OOD data. It is also important for a confidence score to perform consistently over different OOD data, as we assume no knowledge at the time of deployment of what distribution shifts may occur.

Fig. 7
figure 7

The change in %FPR@95\(\downarrow \) relative to the MSP baseline across different OOD datasets. SIRC is able to consistently match or improve over the baseline, whilst ViM is inconsistent depending on the OOD dataset

In general, we see that SIRC, compared to \(S_1\), is able to improve OOD|ID✓ whilst incurring only a small (\(<0.2\)%AUROC) reduction in the ability to distinguish ID✗ |ID✓, across all 3 architectures. On the other hand, non-softmax methods designed for OOD detection show poor ability to identify ID✗, with performance ranging from \(\sim 8\) worse %AUROC than MSP to \(\sim 50\%\) AUROC (random guessing). Furthermore, they cannot consistently outperform the baseline when separating OOD|ID✓, in line with the discussion in Sect. 3.

We note that in some cases SIRC slightly improves ID✗ |ID✓, however, the impact is minimal and inconsistent over model architectures and \(S_2\). We provide some additional empirical analysis in Appendix B.1.1.

SIRC is Robust to Weak \(S_2\) Although for the majority of OOD datasets in Table 1 SIRC is able to outperform \(S_1\), this is not always the case. When SIRC does not provide a boost over \(S_1\), we can see that \(S_2\) individually is not useful for OOD|ID✓. For example, for ResNet-50 on Colonoscopy, Residual performs worse than random guessing. However, in cases like this the performance is still close to that of \(S_1\). As \(S_2\) will tend to be higher for these OOD datasets, the behaviour of SIRC is similar to that of for ID✗ |ID✓, with the decision boundaries close to vertical (see Figs. 5, 8). As such SIRC is robust to \(S_2\) performing poorly, but is able to improve on \(S_1\) when \(S_2\) is of use. In comparison, ViM, which linearly combines Energy and Residual, is more sensitive to when the latter stumbles. This is shown in Fig. 8. On iNaturalist ViM has \(\sim 25\) worse %FPR@95 compared to Energy, whereas SIRC (\(-\mathcal H\), Res.) loses \(<0.5\)% compared to \(-\mathcal H\). Note that the issue of \(S_2\) being inconsistent is addressed in Sect. 6, where we further extend SIRC.

Fig. 8
figure 8

Comparison (similar to Fig. 5) between SIRC and ViM with OOD data iNaturalist. For this OOD dataset \(S_2=\)Res. cannot distinguish OOD|ID✓. SIRC mostly ignores \(S_2\) in this case (close-to-vertical decision boundaries) leading to performance very close to \(S_1\). On the other hand, ViM incurs a large penalty in FPR@95 from relying on \(S_2\). The ID dataset is ImageNet-200 and the model is ResNet-50

Table 1 %AUROC and %FPR@95 with ID✓ as the positive class, considering ID✗ and each OOD dataset separately. Full results are for ResNet-50 trained on ImageNet-200

We additionally remark that regardless of the choice of \(S_2\), there is little to no improvement for Near-ImageNet-200. This suggests that softmax-based scores are best suited to capturing this type of distributional shift. For Near-ImageNet-200 the semantic shift from ImageNet-200 is purposely very small (e.g. “cricket” vs “grasshopper”), and there is no higher level overarching shift (e.g. photographs vs cartoons).

OOD Detection Methods are Inconsistent Over Different Data In Table 1 the performance of existing methods for OOD detection relative to the MSP baseline varies considerably from dataset to dataset. This is directly illustrated in Fig. 7. Even though ViM is able to perform very well on Textures, Noise and ImageNet-O (>50 better %FPR@95 on Noise), it does worse than the baseline on many other OOD datasets (>20 worse %FPR@95 for Near-ImageNet-200 and iNaturalist). This suggests that the inductive biases incorporated, and assumptions made, when designing existing OOD detection methods may prevent them from generalising across a wider variety of OOD data. This behaviour is problematic as we assume no knowledge of the OOD data prior to deployment. In this case, a practitioner may be “unlucky” with the OOD data encountered and incur significant additional loss for choosing ViM over MSP.

In contrast, SIRC more consistently, albeit modestly, improves over the baseline (Fig. 7), due to its aforementioned robustness. These results suggest that methods designed to deal with OOD data should be evaluated on benchmarks that represent a wider range of distributional shifts than what is currently commonly found in the literature.

5.3 Varying the Importance of OOD Data Through \(\alpha \) and \(\beta \)

At deployment, there will be a specific ratio of ID:OOD data exposed to the model. Thus, it is of interest to investigate the risk over different values of \(\alpha \) (Eq. 5). Similarly, an incorrect ID prediction may or may not be more costly than a prediction on OOD data so we investigate different values of \(\beta \) (Eq. 6). Figure 9 shows how AURR and Risk@95 are affected as \(\alpha \) and \(\beta \) are varied independently (with the other fixed to 0.5). We use the full test set of ImageNet-200, and pool OOD datasets together and uniformly sample different quantities of data randomly in order to achieve different values of \(\alpha \). We use 3 different groupings of OOD data: All, “Close” {Near-ImageNet-200, Caltech-45, Openimage-O, iNaturalist} and “Far” {Textures, SpaceNet, Colonoscopy, Colorectal, Noise}. These groupings are based on relative qualitative semantic difference to the ID dataset (see Appendix A.2 for example images from each dataset). Although the grouping is not formal, it serves to illustrate OOD-data-dependent differences in SCOD performance.

Fig. 9
figure 9

AURR\(\downarrow \) and Risk@95\(\downarrow \) (\(\times 10^2\)) for different methods as \(\alpha \) and \(\beta \) vary [Eqs. (5), (6)] on a mixture of all the OOD data. We also split the OOD data into qualitatively “Close” and “Far” subsets (Sect. 5.3). For high \(\alpha , \beta \), where ID✗ dominates in the risk, the MSP baseline is the best. As \(\alpha , \beta \) decrease, increasing the effect of OOD data, other methods improve relative to the baseline. SIRC is able to most consistently improve over the baseline. OOD detection methods perform better on “Far” OOD. The ID dataset is ImageNet-200, and the model is ResNet-50. We show the mean over 5 independent training runs. We multiply all values by \(10^2\) for readability

Fig. 10
figure 10

The change in %FPR@95\(\downarrow \) relative to the MSP baseline of different methods. Different data classes are shown negative|positive. Although OOD detection methods are able to improve OOD|ID, they do so mainly at the expense of ID✗ |ID✓ rather than improving OOD|ID✓. SIRC is able to improve OOD|ID✓ with minimal loss to ID✗ |ID✓, alongside modest improvements for OOD|ID. Results for OOD are averaged over all OOD datasets. The ID dataset is ImageNet-200 and the model is ResNet-50

Relative Performance of Methods Changes with \(\alpha \) and \(\beta \) At high \(\alpha \) and \(\beta \), where ID✗ dominates the risk, the MSP baseline performs best. However, as \(\alpha \) and \(\beta \) are decreased, and OOD data is introduced, we see that other methods improve relative to the baseline. There may be a crossover after which the ability to better distinguish OOD|ID✓ allows a method to surpass the baseline. Thus, which method to choose for deployment will depend on the practitioner’s setting of \(\beta \) and (if they have any knowledge of it at all) of \(\alpha \).

SIRC Most Consistently Improves Over the Baseline SIRC \((-\mathcal H, \text {Res.})\) is able to outperform the baseline most consistently over the different scenarios and settings of \(\alpha , \beta \), only doing worse for ID✗ dominated cases (\(\alpha , \beta \) close to 1). This is because SIRC has close to baseline ID✗ |ID✓ performance and is superior for OOD|ID✓ (Table 1). In comparison, ViM and Energy, which conflate ID✗ and ID✓, are often worse than the baseline for most (if not all) values of \(\alpha , \beta \). Their behaviour on the different groupings of data illustrates how these methods may be biased towards different OOD datasets, as they significantly outperform the baseline at lower \(\alpha \) for the “Far” grouping, but always do worse on “Close” OOD data.

5.4 Comparison Between SCOD and OOD Detection

Figure 10 shows the difference in %FPR@95 relative to the MSP baseline for different combinations of negative|positive data classes (ID✗ |ID✓, OOD|ID✓, OOD|ID), where OOD results are averaged over all datasets and training runs. In line with the discussion in Sect. 3, we observe that the non-softmax OOD detection methods are able to improve over the baseline for OOD|ID. However, this comes at the cost of significantly degraded ID✗ |ID✓, with only small improvements in OOD|ID✓. Thus their SCOD performance is poor compared to the MSP baseline. SIRC on the other hand is able to retain much more ID✗ |ID✓ performance whilst improving on OOD|ID✓, allowing it to have better OOD detection and SCOD performance compared to the baseline.

6 Extending SIRC—Improving Performance over Diverse Distribution Shifts

A salient result from the previous section is that for certain OOD datasets, certain \(S_2\) fail to improve the OOD|ID✓ performance of SIRC compared to \(S_1\) by itself (e.g. Residual on iNaturalist in Table 1). SIRC is robust to scenarios where \(S_2\) fails, as its behaviour defaults to being similar to only using \(S_1\) (Sect. 5.2). However, ideally we want performance improvements over as wide a range of distribution shifts as possible. Furthermore, it appears that different \(S_2\) are better suited for different OOD datasets, so there is not necessarily a “best overall choice” for \(S_2\). This is further illustrated in Fig. 11, which shows the improvement of SIRC vs only \(S_1\) for different \(S_2\) and OOD datasets.

Fig. 11
figure 11

Comparison of SIRC performance (\(\Delta \)FPR@95\(\downarrow \)) compared to only using \(S_1 (-\mathcal {H})\), over a range of different OOD datasets and \(S_2\). Performance improvements are inconsistent over different distributional shifts (e.g. Residual does not contribute at all for iNaturalist). Moreover, different \(S_2\) seem better suited to different OOD datasets, with no single score being best in all cases. The ID dataset is ImageNet-200, and the model is ResNet-50

Additionally, each choice of secondary score {\(\Vert \varvec{z}\Vert _1, \)Residual, KNN} captures information about distributional shift in a different way. This suggest that by choosing only one, we are leaving information that could be used to further improve SCOD performance on the table. Consequently, we suggest an extension to SIRC, in order to:

  1. 1.

    improve the consistency of performance over a wider range of distribution shifts

  2. 2.

    generally boost SCOD performance.

Using Multiple Secondary Scores Given we have access to a selection of options to use as \(S_2\), a natural question to ask is, can we combine the information from multiple secondary scores, in order to achieve the above aims? We propose to extend Eq. 8,

$$\begin{aligned}{} & {} C(S_1, \dots , S_M) \nonumber \\{} & {} \quad = -(S^{\max }_1-S_1)\prod _{m=2}^M[1+\exp (-b_m[S_m-a_m])]~, \end{aligned}$$
(11)

and the log version Eq. 9,

$$\begin{aligned} C(S_1, \dots , S_M) =&-\log (S^{\max }_1-S_1) \nonumber \\&\quad +\, \sum _{m=2}^M-\log \left( 1+\exp (-b_m[S_m-a_m])\right) ~, \end{aligned}$$
(12)

to include \(M-1\) secondary scores.Footnote 8 Fig. 5 can help with intuition for how the different components contribute in Eq. (12). Multiple secondary scores (righthand plot) contribute additively. We refer to this extended version of SIRC as SIRC+.

Fig. 12
figure 12

Conditional plots of KNN given Residual. Left and centre: conditional histograms showing empirical distributions. Right: conditional means. KNN is useful in addition to Residual for detecting OOD Openimage-O

More Consistent Improvements over Different Distribution Shifts By incorporating multiple secondary scores as in Eq. (12), the idea is that only a single secondary score in SIRC+ needs to contribute usefully in order for OOD|ID✓ to improve. As long as a single score moves into the “sensitive zone” past or around a (Fig. 5) for OOD data samples, then SCOD should improve compared to only using \(S_1\).

Thus, different secondary scores may be able to compensate for each other’s failures, resulting in more consistent improvements in SCOD over different OOD data. We aim to increase the likelihood of SIRC responding to an unknown distribution shift. In a sense, this approach is an attempt to “safeguard” against as wide a range of distribution shifts as possible, where we do not trust any single secondary score to be able to detect all shifts. This is illustrated for the Colonoscopy OOD dataset in Fig. 13. It shows how the additional useful information from the KNN score can be exploited to improve SCOD even if the Residual score fails to distinguish OOD from ID.

Generally Improved OOD|ID✓ Additionally, when multiple secondary scores react to a distribution shift, we intuitively expect the OOD|ID✓ performance of SIRC+ to be better than using the scores individually. If the different secondary scores provide different information about the distribution shift, then they should contribute in a complementary manner, further improving detection. This is illustrated for the Openimage-O OOD dataset in Fig. 12. KNN is lower for OOD given the value of Residual is known, meaning it is additionally useful for detection. Figure 13 then shows how SIRC+ is able to utilise the information in both scores together.

Note that by including more secondary scores in SIRC+, we do expect increased degradation in ID✗ |ID✓. Although SIRC is insensitive to secondary scores for ID✗ |ID✓ (for which we do not expect them to contribute useful information), we still expect the (slight) negative effects to add up as M [Eq. (12)] increases.

Fig. 13
figure 13

Visualisation of the combination of multiple secondary scores in SIRC+. OOD, ID✗ and ID✓ distributions are displayed using kernel density estimate contours. Graded contours reflect equidistant values of the second term in Eq. (12). We show two different OOD datasets that illustrate different scenarios. Colonoscopy: Residual is not useful for detecting OOD, but KNN is. By considering both scores we are more likely to improve SCOD performance for an unknown distributional shift. Openimage-O: both scores are useful and intuitively capture different information about OOD data. We expect to improve OOD|ID✓ vs using either score individually. The ID dataset is ImageNet-200 and the model is ResNet-50. SIRC parameters are found using ID training data; the plotted distributions are test data

7 Experimental Results—SIRC+

We extend the evaluation in Sect. 5.2, where we consider ID✗ |ID✓ and OOD|ID✓ separately, to include SIRC+ where all 3 sary scores are used together (\(-\mathcal H\), KNN, Res., \(\Vert \varvec{z}\Vert _1\)). Figure 14 shows, for ResNet-50, the difference in SCOD performance between \(-\mathcal H\) (only using \(S_1\)) and different variants of SIRC over the full range of OOD datasets. Full results for other architectures can be found in Appendix B, as well as tables in the format of Table 1 including SIRC+.

Fig. 14
figure 14

\(\Delta \)%AUROC and \(\Delta \)%FPR@95 (where ID✓ is the positive class) with respect to \(-\mathcal H\) (\(S_1\) only). Results are for ResNet-50 trained on ImageNet-200. We show the mean over models from 5 independent training runs. SIRC+ is able to provide more consistent improvements over \(-\mathcal {H}\) over the different OOD datasets compared to SIRC with a single secondary score. Additionally, on a number of OOD datasets, SIRC+ is able to further improve SCOD performance compared to SIRC

SIRC+ Improves over \(S_1\) More Consistently than SIRC Fig. 14 shows that, compared to SIRC with each individual \(S_2\), SIRC+ is able to more consistently boost SCOD performance over the whole range of OOD datasets. For example, for the two OOD datasets iNaturalist and Colonoscopy, SIRC with a single score (\(-\mathcal {H}\), Res.) is unable to improve over \(-H\). This is because the Residual score fails to recognise samples from these two datasets as OOD. On the other hand, SIRC+ is able to leverage the information in the other two scores KNN and \(\Vert \varvec{z}\Vert _1\), leading to better SCOD performance, even if the Residual score fails.

SIRC+ Generally Improves SCOD Compared to SIRC For a number of OOD datasets (e.g. Openimage-O), Fig. 14 also shows that SIRC+ is able to achieve better SCOD performance compared to using any of the secondary scores by themselves. This is in line with the discussion in Sect. 6, supporting the idea that even better OOD|ID✓ performance can be achieved by combining multiple secondary scores.

We note that we also observe a slight increase in the degradation of ID✗ |ID✓ as expected. However, it is small compared to the improvements in OOD|ID✓, which we believe justifies this trade-off. This is shown in Fig. 15, which reproduces part of Fig. 9 and shows that SIRC+ is able to further improve SCOD over SIRC for the scenarios considered in Sect. 5.3.

Fig. 15
figure 15

Part of Fig. 9 reproduced to include SIRC+. SIRC+ is able to further improve SCOD performance compared to SIRC, especially on the “Far” OOD data

8 Related Work

OOD Detection There is extensive existing research into OOD detection, a survey of which can be found in Yang et al. (2021). To improve over the MSP baseline in Hendrycks and Gimpel (2017), early post-hoc approaches, primarily experimenting on CIFAR-scale data, such as ODIN (Liang et al., 2018), Mahalanobis (Lee et al., 2018) and Energy (Liu et al., 2020b) explore how to extract non-softmax information from a trained network. They investigate the use of logits and features, as well as the idea of using input perturbations (inspired by the adversarial attacks literature (Goodfellow et al., 2015)).

More recent work has moved to larger-scale, higher-resolution image datasets (Huang & Li, 2021; Hendrycks et al., 2022; Wang et al., 2022), designed to reflect more realistic computer vision applications. Gradnorm (Huang et al., 2021), although motivated by the information in gradients, at its core combines information from the softmax and features together. Similarly, ViM (Wang et al., 2022) linearly combines Energy with the class-agnostic Residual score. ReAct (Sun et al., 2021) aims to improve logit/softmax-based scores by clamping the magnitude of final layer features. KNN (Sun et al., 2022) takes a non-parametric approach, using the distance to the k-th nearest ID neighbour of a test feature vector.

There are also many training-based approaches. Outlier Exposure (Hendrycks et al., 2019) explores training networks to be uncertain on “known” existing OOD data, so that this behaviour generalises to unseen test OOD data. On the other hand VOS (Du et al., 2022) instead generates virtual outliers during training for this purpose. Hsu et al. (2020) and Techapanurak et al. (2020) propose the network explicitly learn a scaling factor for the logits to improve softmax behaviour. There also exists a line of research that explores the use of generative models, \(p(\varvec{x};\varvec{\theta })\), for OOD detection (Caterini & Loaiza-Ganem, 2021; Zhang et al., 2021; Ren et al., 2019; Nalisnick et al., 2019). These approaches are separate from classification, however, so are less relevant to this work.

Selective Classification Selective classification, or misclassification detection, has also been investigated for deep learning scenarios. Initially examined in Geifman and El-Yaniv (2017) and Hendrycks and Gimpel (2017), there are a number of approaches to the task that target the classifier f through novel training losses and/or architectural adjustments (Moon et al., 2020; Corbière et al., 2019; Geifman & El-Yaniv, 2019). Post-hoc approaches are fewer. DOCTOR (Granese et al., 2021) provides theoretical justification for using the \(l_2\)-norm of the softmax output \(\Vert \varvec{\pi }\Vert _2\) as a confidence score for detecting misclassifications, however, we find its behaviour similar to MSP and \(\mathcal H\) (See Appendix B). The comparatively smaller advancement in the selective classification literature, compared to OOD detection, suggests that improving performance on this task is much more challenging. This makes sense given the discussion in Sect. 3. The MSP baseline works well for detecting ID✗ as the softmax directly models \(P(y\mid \varvec{x})\), but is inherently ill-suited to OOD detection as it tends to conflate ID✗ with OOD.

General Methods for Uncertainty Estimation There also exist general approaches for uncertainty estimation. These approaches are typically more broadly motivated and aim to improve the quality of uncertainties over a wider range of potential downstream objectives. Earlier methods place neural networks in a Bayesian framework (MacKay, 1995; Jospin et al., 2022), of which a popular and simple-to-implement approach is MC-Dropout (Gal & Ghahramani, 2016). Deep Ensembles (Lakshminarayanan et al., 2017), where multiple models are trained independently using different random seeds, can also be viewed as Bayesian (Wilson, 2020). They offer consistent, and therefore compelling improvements in downstream tasks (Ovadia et al., 2019; Xia & Bouganis, 2022b, 2023; Malinin & Gales, 2021), however, their costs scale linearly with the number of ensemble members. Dirichlet Networks (Malinin & Gales, 2018; Malinin et al., 2020; Ulmer et al., 2023) model a distribution over categorical distributions in order to capture different types of uncertainty. SNGP (Liu et al., 2020a) and DDU (Mukhoti et al., 2021) use spectral normalisation so that shifts in the input space better correspond to shifts in the output space.

Selective Classification with Distribution Shift Here we discuss work that is most closely related to this work (some of which was published after the preliminary version of this paper (Xia & Bouganis, 2022a)). Kamath et al. (2020) investigate selective classification under covariate shift for the natural language processing task of question and answering. In the case of covariate shift, valid predictions can still be produced on the shifted data, which by our definition is not possible for OOD data (see Sect. 2). Thus the problem setting here is different to our work. They propose that g be a random forest classifier trained on a mixture of ID and covariate-shifted data, after f is fully trained.

Kim et al. (2021) introduce the idea that ID✗ and OOD data should be rejected together and investigate the performance of a range of existing approaches on an image-classification-based benchmark. They examine both training and post-hoc methods (comparing different f and g) on SCOD (which they term unknown detection). They also evaluate performance on misclassification detection and OOD detection independently. They find that Deep Ensembles (Lakshminarayanan et al., 2017) perform best overall. They do not provide a novel approach targeting SCOD, and consider a single setting of (\(\alpha , \beta \)), where the \(\alpha \) is not specified and \(\beta = 0.5\).

Jaeger et al. (2023) echo a similar sentiment to Kim et al. (2021), presenting a unified evaluation of selective classification with both OOD data and covariate-shifted data for image classification, without presenting a novel approach.

Cen et al. (2023) evaluate the SCOD performance of many approaches under different training regimes. They also propose a SIRC-inspired approach for a “few-shot” problem scenario, where a few OOD samples are available before deployment. They infact benchmark SIRC and report strong results (see their Table 5). We note that whilst both (Cen et al., 2023; Jaeger et al., 2023) are concurrent work to ours, they do not propose any methods that directly compete with SIRC(+), and perform similar classification-based experiments to those in our work [and (Kim et al., 2021)].

9 Future Work

In the future, it would be valuable to explore the ideas in SCOD in problem settings such as Object Detection and Semantic Segmentation that include classification as a sub-task. These scenarios are more complex compared to our definition of SCOD in Sect. 2 for vanilla classification. For example, in the case of Object Detection with OOD objects (Dhamija et al., 2020; Du et al., 2022), one can imagine a scenario where it is desirable to reject OOD objects as non-objects alongside low-confidence class predictions (just like SCOD), for which a SIRC-like approach may be suitable. However, it may alternatively be desirable to specifically detect OOD objects as unknown objects with a corresponding bounding box, which would require a different style of approach. In the case of semantic segmentation with OOD objects (Hendrycks et al., 2022), there are complications arising from the need to separate uncertainty relating to the edges of objects and uncertainty relating to the overall class of an object. One can easily imagine a SCOD-like problem setting where incorrect pixel predictions on edges would be irrelevant, whereas object-level misclassifications/OOD samples need to be detected.

Additionally, selective prediction for regression problems under distributional shift (Malinin & Gales, 2021) is an underexplored problem setting currently. It could also be possible in this case to leverage methods similar to SIRC, that combine multiple confidence scores together.

10 Concluding Remarks

In this work, we consider the performance of existing methods for OOD detection on selective classification in the presence of out-of-distribution data (SCOD). We show how their improved OOD detection vs the MSP baseline often comes at the cost of inferior SCOD performance. Furthermore, we find their performance is inconsistent over different OOD datasets.

In order to improve SCOD performance over the baseline, we develop SIRC. Our approach aims to retain information useful for detecting misclassifications from a softmax-based confidence score, whilst incorporating additional information useful for identifying OOD samples from a secondary score. Experiments show that SIRC is able to consistently match or improve over the baseline approach for a wide range of datasets, CNN architectures and problem scenarios. Moreover, by extending SIRC to include information from multiple secondary scores, we are able to further improve overall SCOD performance, as well as the consistency of SIRC over different distribution shifts.

We hope this work encourages the further investigation of SCOD or other new problem settings that involve detecting or distinguishing distributional shifts during deployment.