Abstract
Detecting out-of-distribution (OOD) data is a task that is receiving an increasing amount of research attention in the domain of deep learning for computer vision. However, the performance of detection methods is generally evaluated on the task in isolation, rather than also considering potential downstream tasks in tandem. In this work, we examine selective classification in the presence of OOD data (SCOD). That is to say, the motivation for detecting OOD samples is to reject them so their impact on the quality of predictions is reduced. We show under this task specification, that existing post-hoc methods perform quite differently compared to when evaluated only on OOD detection. This is because it is no longer an issue to conflate in-distribution (ID) data with OOD data if the ID data is going to be misclassified. However, the conflation within ID data of correct and incorrect predictions becomes undesirable. We also propose a novel method for SCOD, Softmax Information Retaining Combination (SIRC), that augments a softmax-based confidence score with a secondary class-agnostic feature-based score. Thus, the ability to identify OOD samples is improved without sacrificing separation between correct and incorrect ID predictions. Experiments on a wide variety of ImageNet-scale datasets and convolutional neural network architectures show that SIRC is able to consistently match or outperform the baseline for SCOD, whilst existing OOD detection methods fail to do so. Interestingly, we find that the secondary scores investigated for SIRC do not consistently improve performance on all tested OOD datasets. To address this issue, we further extend SIRC to incorporate multiple secondary scores (SIRC+). This further improves SCOD performance, both generally, and in terms of consistency over diverse distribution shifts. Code is available at https://github.com/Guoxoug/SIRC.
Similar content being viewed by others
1 Introduction
Out-of-distribution (OOD) detection (Yang et al., 2021), i.e. identifying input data samples that do not belong to the distribution that a model was trained on, is a task that is receiving an increasing amount of attention in the domain of deep learning (Liang et al., 2018; Liu et al., 2020b; Du et al., 2022; Hendrycks & Gimpel, 2017; Hendrycks & Dietterich, 2019; Fort et al., 2021; Hsu et al., 2020; Techapanurak et al., 2020; Sun et al., 2021; Sun et al., 2022; Wang et al., 2022; Huang & Li, 2021; Lee et al., 2018; Pearce et al., 2021; Yang et al., 2021; Zhang et al., 2021; Nalisnick et al., 2019). The task is often motivated by safety-critical applications of deep learning, such as healthcare and autonomous driving. For these scenarios, there may be a large cost associated with sending a prediction on OOD data downstream. For example, it could be potentially dangerous for a self-driving car to unknowingly classify a grizzly bear as one of the classes in its training set.Footnote 1
However, in spite of a plethora of existing research, there is generally a lack of focus with regards to the specific motivation behind OOD detection in the literature, other than it is often performed as part of the pipeline of another primary task, e.g. image classification. As such OOD detection tends to be evaluated in isolation, formulated as binary classification between in-distribution (ID) and OOD data.
In this work, we consider the question why exactly do we want to do OOD detection during deployment? We focus on the problem setting where the primary objective is classification, and we are motivated to detect and then reject OOD data, as predictions on those samples will incur a cost. That is to say, the task is selective classification (El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) where OOD data is present within the input samples. Kim et al. (2021) term this problem setting unknown detection. However, we prefer to use Selective Classification in the presence of Out-of-Distribution data (SCOD) as we would like to emphasise the downstream classification task as the primary objective and will refer to the task as such in the remainder of this paper.
The key difference between this problem setting and OOD detection is that both OOD data and incorrect predictions on ID data will incur a cost (Kim et al., 2021). It does not matter if we reject an ID sample if it would be incorrectly classified anyway. As such we can view the task as separating correctly predicted ID samples (ID✓) from misclassified ID samples (ID✗) and OOD samples. This reveals a potential blind spot in designing approaches solely for OOD detection, as the cost of ID misclassifications is ignored if the aim is only to separate OOD|ID.
The key contributions of this work are:
-
1.
Building on initial results reported by Kim et al. (2021) that show poor SCOD performance for existing methods designed for OOD detection, we show novel insight into the behaviour of different post-hoc (after-training) detection methods for the task of SCOD. Improved OOD detection often comes directly at the expense of SCOD performance, through the conflation of ID✗ and ID✓. Moreover, the relative SCOD performance of different methods varies with the proportion of OOD data found in the test distribution, the relative cost of accepting ID✗ vs OOD, as well as the distribution from which the OOD data samples are drawn.
-
2.
We propose a novel method, targeting SCOD, Softmax Information Retaining Combination (SIRC). Our approach aims to improve the OOD|ID✓ separation of softmax-based confidence scores, by combining them with a secondary, class-agnostic confidence score, whilst retaining their ability to identify ID✗. It consistently outperforms or matches the baseline maximum softmax probability (MSP) approach over a wide variety of OOD datasets and convolutional neural network (CNN) architectures. On the other hand, existing OOD detection methods fail to achieve this.
-
3.
We find that the secondary scores investigated for SIRC perform inconsistently over different OOD datasets. That is to say, a given secondary score may improve SCOD for some OOD datasets, but won’t help on other datasets. Also, different scores appear to be better suited to detecting different distribution shifts. Thus, we extend SIRC to incorporate a combination of multiple secondary scores (SIRC+). This results in generally even better SCOD performance, as well as more consistent performance gains over a wider range of OOD data.
A preliminary version of this work has been published in ACCV 2022 (Xia & Bouganis, 2022a), which covers points {1, 2}. In this work, we extend the aforementioned preliminary version through:
2 Preliminaries
Neural Network Classifier For a K-class classification problem we learn the parameters \(\varvec{\theta }\) of a discriminative model \(P(y\mid \varvec{x};\varvec{\theta })\) over labels \(y \in \mathcal Y = \{\omega _k\}_{k=1}^K\) given inputs \(\varvec{x} \in \mathcal X = \mathbb R^D\), using finite training dataset \(\mathcal D_\text {tr} = \{y^{(n)}, \varvec{x}^{(n)}\}_{n=1}^{N}\) sampled independently from true joint data distribution \(p_\text {tr}(y, \varvec{x})\). This is done in order to make predictions \(\hat{y}\) given new inputs \(\varvec{x}^* \sim p_\text {tr}(\varvec{x})\) with unknown labels,
where f refers to the classifier function. In our case, the parameters \(\varvec{\theta }\) belong to a deep neural network with categorical softmax output \(\varvec{\pi }\in [0, 1]^K\),
where the logits \(\varvec{v} = \varvec{W} \varvec{z} + \varvec{b} \quad (\in \mathbb R^K)\) are the output of the final fully-connected layer with weights \(\varvec{W} \in \mathbb R^{K\times L}\), bias \(\varvec{b} \in \mathbb R^K\), and final hidden layer features \(\varvec{z} \in \mathbb R^L\) as inputs. Typically \(\varvec{\theta }\) are learnt by minimising the cross entropy loss, such that the model approximates the true conditional distribution \(P_\text {tr}(y\mid \varvec{x})\),
where \(\delta (\cdot , \cdot )\) is the Kronecker delta, A is a constant with respect to \(\varvec{\theta }\) and KL\([\cdot \Vert \cdot ]\) is the Kullback–Leibler divergence.
Selective Classification A selective classifier (El-Yaniv & Wiener, 2010) can be formulated as a pair of functions, the aforementioned classifier \(f(\varvec{x})\) [in our case given by Eq. (1)] that produces a prediction \(\hat{y}\), and a binary rejection function
where t is an operating threshold and S is a scoring function which is typically a measure of predictive confidence (or \(-S\) measures uncertainty). Intuitively, a selective classifier chooses to reject if it is uncertain about a prediction.
2.1 Problem Setting Selective Classification with OOD Data (SCOD)
We consider a scenario where, during deployment, classifier inputs \(\varvec{x}^*\) may be drawn from either the training distribution \(p_\text {tr}(\varvec{x})\) (ID) or another distribution \(p_\text {OOD}(\varvec{x})\) (OOD). That is to say,
where \(\alpha \in [0, 1]\) reflects the proportion of ID to OOD data found in the wild. Here “Out-of-Distribution” inputs are defined as those drawn from a distribution with label space that does not intersect with the training label space \(\mathcal Y\) (Yang et al., 2021). For example, an image of a car is considered OOD for a CNN classifier trained to discriminate between different types of pets. We use this definition as it means that OOD samples are fundamentally incompatible with the primary classifier, and any classification predictions made on them will be automatically invalid. Note that in our case we assume no knowledge of \(p_\text {OOD}\) before deployment.
We now define the predictive loss on an accepted sample as
for classifier \(f(\varvec{x})\) [Eq. (1)], where \(\beta \in [0, 1]\). We define the selective risk as in (Geifman & El-Yaniv, 2017),
which can be intuitively understood as the average loss of only the accepted samples, when using rejection function \(g(\varvec{x};t)\) [Eq. (4)]. We are only concerned with the relative cost of ID✗ and OOD samples, so we use a single parameter \(\beta \).
The objective is to find a classifier and rejection function (f, g) that minimise R(f, g; t) for some given setting of t. We focus on comparing post-hoc (after-training) methods in this work, where g (or equivalently S) is varied with f fixed. This removes confounding factors that may arise from the interactions of different training-based and post-hoc methods, as they can often be freely combined. In practice, both \(\alpha \) and \(\beta \) will depend on the deployment scenario. However, whilst \(\beta \) can be set freely by the practitioner depending on their own evaluation of costs, \(\alpha \) is outside of the practitioner’s control and their knowledge of it is likely to be very limited.
It is worth contrasting the SCOD problem setting with OOD detection. SCOD aims to separate OOD, ID✗ |ID✓, whilst for OOD detection the data is grouped as OOD|ID✗, ID✓ (see Fig. 1). The key difference is in the categorisation of ID✗.
SCOD and Types of Uncertainty We note that previous work (Kendall & Gal, 2017; Malinin & Gales, 2018; Malinin et al., 2020; Mukhoti et al., 2021; Pearce et al., 2021) refer to different types of predictive uncertainty, namely aleatoric and epistemic. The former arises from uncertainty inherent in the data (i.e. the true conditional distribution \(P_\text {tr}(y\mid \varvec{x})\)) and as such is irreducible, whilst the latter can be reduced by having the model learn from additional data. Typically, it is argued that it is useful to distinguish these types of uncertainty at prediction time. Epistemic uncertainty estimates should indicate distributional shift away from the training distribution, i.e. whether a test input \(\varvec{x}^*\) is OOD. On the other hand, aleatoric uncertainty estimates should reflect the level of class ambiguity of an ID input. An interesting result within our problem setting is that the conflation of these different types of uncertainties may not be an issue, as there is no need to separate ID✗ from OOD, as both should be rejected.
3 Existing OOD Detectors Applied to SCOD
As the explicit objective of OOD detection is different to SCOD, it is of interest to understand how existing detection methods behave for SCOD. Previous work (Kim et al., 2021) has empirically shown that some existing OOD detection approaches don’t perform very well, and in this section we shed additional light as to why this is the case.
Improving Performance: OOD Detection vs SCOD In order to build an intuition, we can consider, qualitatively, how detection methods can improve performance over a baseline, with respect to the distributions of OOD and ID✗ relative to ID✓. This is illustrated in Fig. 2.
-
For OOD detection the objective is to better separate the distributions of ID and OOD data. Thus, we can either find a confidence score S that, compared to the baseline, has OOD distributed further away from ID✓, and/or has ID✗ distributed closer to ID✓.
-
For improving SCOD, we want both OOD and ID✗ to be distributed further away from ID✓ than the baseline.
Thus there is a conflict between the two tasks. For the distribution of ID✗, the desired behaviour of confidence score S will be different.
Existing Approaches Sacrifice SCOD by Conflating ID✗ and ID✓
Considering post-hoc methods, the generally accepted baseline approach for both selective classification and OOD detection is the Maximum Softmax Probability (MSP) (Hendrycks & Gimpel, 2017; Geifman & El-Yaniv, 2017) confidence score. Improvements in OOD detection are often achieved by moving away from the softmax \(\varvec{\pi }\) in order to better capture the differences between ID and OOD data. Confidence scores such as Energy (Liu et al., 2020b) and Max Logit (Hendrycks et al., 2022) consider the logits \(\varvec{v}\) directly, whereas the Mahalanobis detector (Lee et al., 2018) and DDU (Mukhoti et al., 2021) build generative models using Gaussians over the features \(\varvec{z}\). ViM (Wang et al., 2022) and Gradnorm (Huang et al., 2021) incorporate class-agnostic, feature-based information into their scores.
Recall that typically a neural network classifier learns a model \(P(y\mid \varvec{x};\varvec{\theta })\) to approximate the true conditional distribution \(P_\text {tr}(y\mid \varvec{x})\) of the training data [Eqs. (2) and (3), Sect. 2]. As such, scores S extracted from the softmax outputs \(\varvec{\pi }\) should best reflect how likely a classifier prediction on ID data is going to be correct or not (and this is indeed the case in our experiments in Sect. 5). As the above (post-hoc) OOD detection approaches all involve moving away from the modelled \(P(y\mid \varvec{x};\varvec{\theta })\), we would expect worse separation between ID✗ and ID✓ even if overall OOD is better distinguished from ID.
Figure 3 shows empirically how well different types of data are separated using MSP (\(\pi _\text {max}\)) and Energy (\(\log \sum _k\exp v_k\)), by plotting false positive rate (FPR) against true positive rate (TPR). Lower FPR indicates better separation of the negative class away from the positive class.
Although Energy has better OOD detection performance compared to MSP, this is actually because the separation between ID✗ and ID✓ is much less for Energy, so ID as a whole is better separated from OOD. On the other hand the behaviour of OOD relative to ID✓ is not meaningfully different to the MSP baseline. Therefore, SCOD performance for Energy is worse in this case. Another way of looking at it would be that for OOD detection, MSP does worse as it conflates ID with OOD. However, this doesn’t harm SCOD performance as much, as those ID samples that are confused with OOD are mostly incorrect anyway. The ID dataset is ImageNet-200 (Kim et al., 2021), OOD dataset is iNaturalist (Huang & Li, 2021) and the model is ResNet-50 (He et al., 2016).
4 Targeting SCOD—Retaining Softmax Information
We would now like to develop an approach that is tailored to the task of SCOD. We have discussed how we expect softmax-based methods, such as MSP, to perform best for distinguishing ID✗ from ID✓, and how existing approaches for OOD detection improve over the baseline, in part, by sacrificing this. As such, to improve over the baseline for SCOD, we will aim to retain the ability to separate ID✗ from ID✓ whilst increasing the separation between OOD and ID✓.
Combining Confidence Scores Inspired by Gradnorm (Huang et al., 2021) and ViM (Wang et al., 2022) we consider the combination of two different confidence scores \(S_1, S_2\). We shall consider \(S_1\) our primary score, which we wish to augment by incorporating \(S_2\). For \(S_1\) we investigate scores that are strong for selective classification on ID data, but are also capable of detecting OOD data—MSP and (the negative of) softmax entropy, \((-)\mathcal H[\varvec{\pi }]\). For \(S_2\), the score should be useful in addition to \(S_1\) in determining whether data is OOD or not. We should consider scores that capture different information about OOD data to the post-softmax \(S_1\) if we want to improve OOD|ID✓. We choose to examine the \(l_1\)-norm of the feature vector \(\Vert \varvec{z}\Vert _1\) (Huang et al., 2021), the negative of the ResidualFootnote 2 score \(-\Vert \varvec{z}^{P^\bot }\Vert _2\) (Wang et al., 2022) and the negative of the k-th nearest neighbour distanceFootnote 3 (KNN) (Sun et al., 2022). These scores were chosen as they capture class-agnostic information at the feature level. Note that although \(\Vert \varvec{z}\Vert _1\), Residual and KNN have previously been shown to be useful for OOD detection (Huang et al., 2021; Wang et al., 2022; Sun et al., 2022), we do not expect them to be useful for identifying misclassifications. They are separate from the classification layer defined by \((\varvec{W}, \varvec{b})\), so they are far removed from the categorical \(P(y\mid \varvec{x};\varvec{\theta })\) explicitly modelled by the softmax.
Softmax Information Retaining Combination (SIRC) We want to create a combined confidence score \(C(S_1, S_2)\) that retains \(S_1\)’s ability to distinguish ID✗ |ID✓ but is also able to incorporate \(S_2\) in order to augment OOD|ID✓. We develop our approach based on the following set of assumptions about the behaviour of \(S_1\) and \(S_2\):
-
\(S_1\) will be higher for ID✓ and lower for ID✗ and OOD.
-
\(S_1\) is bounded by maximum value \(S_1^\text {max}\).Footnote 4
-
\(S_2\) is unable to distinguish ID✗ |ID✓ well, but is lower for OOD compared to ID.
-
\(S_2\) is useful in addition to \(S_1\) for separating OOD|ID.
These assumptions are illustrated roughly in Fig. 4. We expect our choices of \(S_1\) (MSP, \(-\mathcal {H}\)) and \(S_2\) (\(\Vert \varvec{z}\Vert _1\), Res., KNN) to conform to these assumptions for the reasons stated earlier. Moreover, future choices of confidence score should conform as well.
Given the aforementioned assumptions, we propose to combine \(S_1\) and \(S_2\) using
or equivalently taking logs,Footnote 5
where a, b are parameters chosen by the practitioner. The idea is for the accept/reject decision boundary of C to be in the shape of a sigmoid on the \((S_1, S_2)\)-plane (see Figs. 5, 6). As such the behaviour of only using the softmax-based \(S_1\) is recovered for ID✗ |ID✓ for high \(S_2\), as the decision boundary tends to a vertical line. However, C becomes increasingly sensitive to \(S_2\) as \(S_2\) decreases, and less sensitive to \(S_1\) as \(S_1\) decreases (Fig. 5). This allows for improved OOD|ID✓ as \(S_2\) is “activated” towards the bottom left of the (\(S_1, S_2\))-plane. We term this approach Softmax Information Retaining Combination (SIRC).
The parameters a, b allow the method to be adjusted to different distributional properties of \(S_2\). Rearranging Eq. (8),
we see that a controls the placement of the sigmoid with respect to \(S_2\), and b the sensitivity of the sigmoid to \(S_2\). Figure 5 shows that the sensitivity of SIRC to \(S_2\) (gradient) increases from zero as \(S_2\) approaches a from above, and then tends to a linear relationship (constant sensitivity proportional to b).
We use the empirical mean and standard deviation of \(S_2\), \(\mu _{S_2}, \sigma _{S_2}\) on ID data (training or validation) to set the parameters. We choose \(a = \mu _{S_2}-3\sigma _{S_2}\) so the centre of the sigmoid is below the ID distribution of \(S_2\), and we set \(b=1/\sigma _{S_2}\), to match the ID variations of \(S_2\). We find the above approach to be empirically effective, however, other parameter settings are of course possible. Practitioners are free to tune a, b however they see fit. This may be done using only ID data (training or validation) as we have, or by additionally using synthetic validation OOD data (Hendrycks et al., 2019; Sun et al., 2022).
SIRC Compared to Other Combination Approaches Fig. 6 compares different methods of combination by plotting ID✓, ID✗ and OOD data densities on the \((S_1, S_2)\)-plane. Other than SIRC we consider the combination methods used in ViM, \(C=S_1 + cS_2\), where c is a user set parameter, and in Gradnorm, \(C=S_1 S_2\). The overlayed contours of C represent decision boundaries for values of t [Eq. (4)].
We see that the linear decision boundary of \(C=S_1 + cS_2\) must trade-off significant performance in ID✗ |ID✓ in order to gain OOD|ID✓ (through varying c), whilst \(C=S_1 S_2\) sacrifices the ability to separate ID✗ |ID✓ well for higher values of \(S_1\). We also note that \(C=S_1S_2\) is not robust to different ID means of \(S_2\). For example, arbitrarily adding a constant D to \(S_2\) will completely change the behaviour of the combined score. On the other hand, SIRC is designed to be robust to this sort of variation between different \(S_2\). Figure 6 also shows an alternative parameter setting for SIRC, where a is lower and b is higher. The sigmoid is shifted down and steeper. Here more of the behaviour of only using \(S_1\) is preserved, but \(S_2\) contributes less. It is also empirically observable that the assumption that \(S_2\) (in this case \(\Vert \varvec{z}\Vert _1\)) is not useful for distinguishing ID✓ from ID✗ holds, and in practice this can be verified on ID validation data when selecting \(S_2\).
We also note that although we have chosen specific \(S_1, S_2\) in this work, SIRC can be applied to any S that satisfy the above assumptions. It is a combination method, rather than a specific confidence score. As such it has the potential to improve beyond the results we present, especially given the rapid pace of development of new confidence scores for uncertainty estimation.
Limitations We note that one limitation of SIRC is that it does not aim to improve ID✗ |ID✓, only OOD|ID✓. Moreoever, although the approach aims to limit this effect, we expect inevitable minor degradation in ID✗ |ID✓ as a result of the inclusion of \(S_2\).
5 Experimental Results—SIRC
We present experiments across a range of CNN architectures and ImageNet-scale OOD datasets. Extended results can be found in Appendix B.
Data For our ID dataset we use ImageNet-200 (Kim et al., 2021), which contains a subset of 200 ImageNet-1k (Russakovsky et al., 2015) classes. It has separate training, validation and test sets. We use a variety of OOD datasets for our evaluation that display a wide range of semantics and difficulty in being identified. Near-ImageNet-200 (Near-IN-200) (Kim et al., 2021) is constructed from remaining ImageNet-1k classes semantically similar to ImageNet-200, so it is especially challenging to detect. Caltech-45 (Kim et al., 2021) is a subset of the Caltech-256 (Griffin et al., 2007) dataset with non-overlapping classes to ImageNet-200. Openimage-O (Wang et al., 2022) is a subset of the Open Images V3 (Krasin et al., 2017) dataset selected to be OOD with respect to ImageNet-1k. iNaturalist (Huang & Li, 2021) and Textures (Wang et al., 2022) are the same for their respective datasets (Van Horn et al., 2017; Cimpoi et al., 2014). SpaceNet (Etten et al., 2018) contains satellite images of Rio De Janeiro. Colorectal (Kather et al., 2016) is a collection of histological images of human colorectal cancer, whilst Colonoscopy is a dataset of frames taken from colonoscopic video of gastrointestinal lesions (Mesejo et al., 2016). Noise is a dataset of square images where the resolution, contrast and pixel values are randomly generated (for details see Appendix A.2). Finally, ImageNet-O (Hendrycks et al., 2021) is a dataset OOD to ImageNet-1k that is adversarially constructed using a trained ResNet. Note that we exclude a number of OOD datasets from Kim et al. (2021) and Huang and Li (2021) as a result of discovering samples within said datasets that match ID labels.
Models and Training We train ResNet-50 (He et al., 2016), DenseNet-121 (Huang et al., 2017) and MobileNetV2 (Sandler et al., 2018) using hyperparameters based around standard ImageNet settings.Footnote 6 Full training details can be found in Appendix A.1. For each architecture, we train 5 models independently using random seeds \(\{1, \dots , 5\}\) and report the mean result over the runs. Appendix B additionally contains results on single pre-trained ImageNet-1k models, BiT ResNetV2-101 (Kolesnikov et al., 2020) and PyTorch DenseNet-121.
Detection Methods for SCOD We consider six variations of SIRC using the components {MSP, \(\mathcal H\)} \(\times \) {\(\Vert \varvec{z}\Vert _1, \)Residual, KNN}, as well as the components individually. We additionally evaluate various existing post-hoc methods: MSP (Hendrycks & Gimpel, 2017), Energy (Liu et al., 2020b), ViM (Wang et al., 2022) and Gradnorm (Huang et al., 2021). For the Residual score (used in SIRC and ViM) we use the full ID ImageNet-200 train set to determine parameters. For KNN we sample 12,500 feature vectors from the training set and use \(k=10\). Results for additional approaches, as well as further details pertaining to the methods, can be found in Appendices B and A.3.
5.1 Evaluation Metrics
For evaluating different scoring functions S for the SCOD problem setting we consider a number of metrics. Arrows (\(\uparrow \downarrow \)) indicate whether higher/lower is better (For graphical illustrations and additional metrics see Appendix A.4).
Area Under the Risk-Recall curve (AURR)\(\downarrow \) We consider how empirical risk [Eq. (7)] varies with recall of ID✓, and aggregate performance over different t by calculating the area under the curve. As recall is only measured over ID✓, the base accuracy of f is not properly taken into account. Thus, this metric is only suitable for comparing different g with f fixed. To give an illustrative example, a f, g pair where the classifier f is only able to produce a single correct prediction will have perfect AURR as long as S assigns that correct prediction the highest confidence (lowest uncertainty) score. Note that results for the AURC metric (Kim et al., 2021; Geifman et al., 2019) can be found in Appendix B, although we omit them from the main paper as they are not notably different to AURR.
Risk@Recall=0.95 (Risk@95)\(\downarrow \) Since a rejection threshold t must be selected at deployment, we also consider a particular setting of t such that 95% of ID✓ is recalled. In practice, the corresponding value of t could be found on a labelled ID validation set before deployment, without the use of any OOD data. It is worth noting that differences tend to be greater for this metric between different S as it operates around the tail of the positive class.
Area Under the ROC Curve (AUROC)\(\uparrow \) Since we are interested in rejecting both ID✗ and OOD, we can consider ID✓ as the positive class, and ID✗, OOD as separate negative classes. Then we can evaluate the AUROC of OOD|ID✓ and ID✗ |ID✓ independently. The AUROC for a specific value of \(\alpha \) would then be a weighted average of the two different AUROCs. This is not a direct measure of risk, but does measure the separation between different empirical distributions. Note that due to similar reasons to AURR this method is only valid for fixed f.
False Positive Rate@Recall=0.95 (FPR@95)\(\downarrow \) FPR@95 is similar to AUROC, but is taken at a specific t. It measures the proportion of the negative class accepted when the recall of the positive class (or true positive rate) is 0.95.
5.2 Separation of ID✗ |ID✓ and OOD|ID✓ Independently
Table 1 shows %AUROC and %FPR@0.95 with ID✓ as the positive class and ID✗, OOD independently as different negative classes (see Sect. 5.1). It is important for a confidence score to have strong ID✗ |ID✓ performance as ID✗ will always be presentFootnote 7 regardless of the volume or type of OOD data. It is also important for a confidence score to perform consistently over different OOD data, as we assume no knowledge at the time of deployment of what distribution shifts may occur.
In general, we see that SIRC, compared to \(S_1\), is able to improve OOD|ID✓ whilst incurring only a small (\(<0.2\)%AUROC) reduction in the ability to distinguish ID✗ |ID✓, across all 3 architectures. On the other hand, non-softmax methods designed for OOD detection show poor ability to identify ID✗, with performance ranging from \(\sim 8\) worse %AUROC than MSP to \(\sim 50\%\) AUROC (random guessing). Furthermore, they cannot consistently outperform the baseline when separating OOD|ID✓, in line with the discussion in Sect. 3.
We note that in some cases SIRC slightly improves ID✗ |ID✓, however, the impact is minimal and inconsistent over model architectures and \(S_2\). We provide some additional empirical analysis in Appendix B.1.1.
SIRC is Robust to Weak \(S_2\) Although for the majority of OOD datasets in Table 1 SIRC is able to outperform \(S_1\), this is not always the case. When SIRC does not provide a boost over \(S_1\), we can see that \(S_2\) individually is not useful for OOD|ID✓. For example, for ResNet-50 on Colonoscopy, Residual performs worse than random guessing. However, in cases like this the performance is still close to that of \(S_1\). As \(S_2\) will tend to be higher for these OOD datasets, the behaviour of SIRC is similar to that of for ID✗ |ID✓, with the decision boundaries close to vertical (see Figs. 5, 8). As such SIRC is robust to \(S_2\) performing poorly, but is able to improve on \(S_1\) when \(S_2\) is of use. In comparison, ViM, which linearly combines Energy and Residual, is more sensitive to when the latter stumbles. This is shown in Fig. 8. On iNaturalist ViM has \(\sim 25\) worse %FPR@95 compared to Energy, whereas SIRC (\(-\mathcal H\), Res.) loses \(<0.5\)% compared to \(-\mathcal H\). Note that the issue of \(S_2\) being inconsistent is addressed in Sect. 6, where we further extend SIRC.
We additionally remark that regardless of the choice of \(S_2\), there is little to no improvement for Near-ImageNet-200. This suggests that softmax-based scores are best suited to capturing this type of distributional shift. For Near-ImageNet-200 the semantic shift from ImageNet-200 is purposely very small (e.g. “cricket” vs “grasshopper”), and there is no higher level overarching shift (e.g. photographs vs cartoons).
OOD Detection Methods are Inconsistent Over Different Data In Table 1 the performance of existing methods for OOD detection relative to the MSP baseline varies considerably from dataset to dataset. This is directly illustrated in Fig. 7. Even though ViM is able to perform very well on Textures, Noise and ImageNet-O (>50 better %FPR@95 on Noise), it does worse than the baseline on many other OOD datasets (>20 worse %FPR@95 for Near-ImageNet-200 and iNaturalist). This suggests that the inductive biases incorporated, and assumptions made, when designing existing OOD detection methods may prevent them from generalising across a wider variety of OOD data. This behaviour is problematic as we assume no knowledge of the OOD data prior to deployment. In this case, a practitioner may be “unlucky” with the OOD data encountered and incur significant additional loss for choosing ViM over MSP.
In contrast, SIRC more consistently, albeit modestly, improves over the baseline (Fig. 7), due to its aforementioned robustness. These results suggest that methods designed to deal with OOD data should be evaluated on benchmarks that represent a wider range of distributional shifts than what is currently commonly found in the literature.
5.3 Varying the Importance of OOD Data Through \(\alpha \) and \(\beta \)
At deployment, there will be a specific ratio of ID:OOD data exposed to the model. Thus, it is of interest to investigate the risk over different values of \(\alpha \) (Eq. 5). Similarly, an incorrect ID prediction may or may not be more costly than a prediction on OOD data so we investigate different values of \(\beta \) (Eq. 6). Figure 9 shows how AURR and Risk@95 are affected as \(\alpha \) and \(\beta \) are varied independently (with the other fixed to 0.5). We use the full test set of ImageNet-200, and pool OOD datasets together and uniformly sample different quantities of data randomly in order to achieve different values of \(\alpha \). We use 3 different groupings of OOD data: All, “Close” {Near-ImageNet-200, Caltech-45, Openimage-O, iNaturalist} and “Far” {Textures, SpaceNet, Colonoscopy, Colorectal, Noise}. These groupings are based on relative qualitative semantic difference to the ID dataset (see Appendix A.2 for example images from each dataset). Although the grouping is not formal, it serves to illustrate OOD-data-dependent differences in SCOD performance.
Relative Performance of Methods Changes with \(\alpha \) and \(\beta \) At high \(\alpha \) and \(\beta \), where ID✗ dominates the risk, the MSP baseline performs best. However, as \(\alpha \) and \(\beta \) are decreased, and OOD data is introduced, we see that other methods improve relative to the baseline. There may be a crossover after which the ability to better distinguish OOD|ID✓ allows a method to surpass the baseline. Thus, which method to choose for deployment will depend on the practitioner’s setting of \(\beta \) and (if they have any knowledge of it at all) of \(\alpha \).
SIRC Most Consistently Improves Over the Baseline SIRC \((-\mathcal H, \text {Res.})\) is able to outperform the baseline most consistently over the different scenarios and settings of \(\alpha , \beta \), only doing worse for ID✗ dominated cases (\(\alpha , \beta \) close to 1). This is because SIRC has close to baseline ID✗ |ID✓ performance and is superior for OOD|ID✓ (Table 1). In comparison, ViM and Energy, which conflate ID✗ and ID✓, are often worse than the baseline for most (if not all) values of \(\alpha , \beta \). Their behaviour on the different groupings of data illustrates how these methods may be biased towards different OOD datasets, as they significantly outperform the baseline at lower \(\alpha \) for the “Far” grouping, but always do worse on “Close” OOD data.
5.4 Comparison Between SCOD and OOD Detection
Figure 10 shows the difference in %FPR@95 relative to the MSP baseline for different combinations of negative|positive data classes (ID✗ |ID✓, OOD|ID✓, OOD|ID), where OOD results are averaged over all datasets and training runs. In line with the discussion in Sect. 3, we observe that the non-softmax OOD detection methods are able to improve over the baseline for OOD|ID. However, this comes at the cost of significantly degraded ID✗ |ID✓, with only small improvements in OOD|ID✓. Thus their SCOD performance is poor compared to the MSP baseline. SIRC on the other hand is able to retain much more ID✗ |ID✓ performance whilst improving on OOD|ID✓, allowing it to have better OOD detection and SCOD performance compared to the baseline.
6 Extending SIRC—Improving Performance over Diverse Distribution Shifts
A salient result from the previous section is that for certain OOD datasets, certain \(S_2\) fail to improve the OOD|ID✓ performance of SIRC compared to \(S_1\) by itself (e.g. Residual on iNaturalist in Table 1). SIRC is robust to scenarios where \(S_2\) fails, as its behaviour defaults to being similar to only using \(S_1\) (Sect. 5.2). However, ideally we want performance improvements over as wide a range of distribution shifts as possible. Furthermore, it appears that different \(S_2\) are better suited for different OOD datasets, so there is not necessarily a “best overall choice” for \(S_2\). This is further illustrated in Fig. 11, which shows the improvement of SIRC vs only \(S_1\) for different \(S_2\) and OOD datasets.
Additionally, each choice of secondary score {\(\Vert \varvec{z}\Vert _1, \)Residual, KNN} captures information about distributional shift in a different way. This suggest that by choosing only one, we are leaving information that could be used to further improve SCOD performance on the table. Consequently, we suggest an extension to SIRC, in order to:
-
1.
improve the consistency of performance over a wider range of distribution shifts
-
2.
generally boost SCOD performance.
Using Multiple Secondary Scores Given we have access to a selection of options to use as \(S_2\), a natural question to ask is, can we combine the information from multiple secondary scores, in order to achieve the above aims? We propose to extend Eq. 8,
and the log version Eq. 9,
to include \(M-1\) secondary scores.Footnote 8 Fig. 5 can help with intuition for how the different components contribute in Eq. (12). Multiple secondary scores (righthand plot) contribute additively. We refer to this extended version of SIRC as SIRC+.
More Consistent Improvements over Different Distribution Shifts By incorporating multiple secondary scores as in Eq. (12), the idea is that only a single secondary score in SIRC+ needs to contribute usefully in order for OOD|ID✓ to improve. As long as a single score moves into the “sensitive zone” past or around a (Fig. 5) for OOD data samples, then SCOD should improve compared to only using \(S_1\).
Thus, different secondary scores may be able to compensate for each other’s failures, resulting in more consistent improvements in SCOD over different OOD data. We aim to increase the likelihood of SIRC responding to an unknown distribution shift. In a sense, this approach is an attempt to “safeguard” against as wide a range of distribution shifts as possible, where we do not trust any single secondary score to be able to detect all shifts. This is illustrated for the Colonoscopy OOD dataset in Fig. 13. It shows how the additional useful information from the KNN score can be exploited to improve SCOD even if the Residual score fails to distinguish OOD from ID.
Generally Improved OOD|ID✓ Additionally, when multiple secondary scores react to a distribution shift, we intuitively expect the OOD|ID✓ performance of SIRC+ to be better than using the scores individually. If the different secondary scores provide different information about the distribution shift, then they should contribute in a complementary manner, further improving detection. This is illustrated for the Openimage-O OOD dataset in Fig. 12. KNN is lower for OOD given the value of Residual is known, meaning it is additionally useful for detection. Figure 13 then shows how SIRC+ is able to utilise the information in both scores together.
Note that by including more secondary scores in SIRC+, we do expect increased degradation in ID✗ |ID✓. Although SIRC is insensitive to secondary scores for ID✗ |ID✓ (for which we do not expect them to contribute useful information), we still expect the (slight) negative effects to add up as M [Eq. (12)] increases.
7 Experimental Results—SIRC+
We extend the evaluation in Sect. 5.2, where we consider ID✗ |ID✓ and OOD|ID✓ separately, to include SIRC+ where all 3 sary scores are used together (\(-\mathcal H\), KNN, Res., \(\Vert \varvec{z}\Vert _1\)). Figure 14 shows, for ResNet-50, the difference in SCOD performance between \(-\mathcal H\) (only using \(S_1\)) and different variants of SIRC over the full range of OOD datasets. Full results for other architectures can be found in Appendix B, as well as tables in the format of Table 1 including SIRC+.
SIRC+ Improves over \(S_1\) More Consistently than SIRC Fig. 14 shows that, compared to SIRC with each individual \(S_2\), SIRC+ is able to more consistently boost SCOD performance over the whole range of OOD datasets. For example, for the two OOD datasets iNaturalist and Colonoscopy, SIRC with a single score (\(-\mathcal {H}\), Res.) is unable to improve over \(-H\). This is because the Residual score fails to recognise samples from these two datasets as OOD. On the other hand, SIRC+ is able to leverage the information in the other two scores KNN and \(\Vert \varvec{z}\Vert _1\), leading to better SCOD performance, even if the Residual score fails.
SIRC+ Generally Improves SCOD Compared to SIRC For a number of OOD datasets (e.g. Openimage-O), Fig. 14 also shows that SIRC+ is able to achieve better SCOD performance compared to using any of the secondary scores by themselves. This is in line with the discussion in Sect. 6, supporting the idea that even better OOD|ID✓ performance can be achieved by combining multiple secondary scores.
We note that we also observe a slight increase in the degradation of ID✗ |ID✓ as expected. However, it is small compared to the improvements in OOD|ID✓, which we believe justifies this trade-off. This is shown in Fig. 15, which reproduces part of Fig. 9 and shows that SIRC+ is able to further improve SCOD over SIRC for the scenarios considered in Sect. 5.3.
8 Related Work
OOD Detection There is extensive existing research into OOD detection, a survey of which can be found in Yang et al. (2021). To improve over the MSP baseline in Hendrycks and Gimpel (2017), early post-hoc approaches, primarily experimenting on CIFAR-scale data, such as ODIN (Liang et al., 2018), Mahalanobis (Lee et al., 2018) and Energy (Liu et al., 2020b) explore how to extract non-softmax information from a trained network. They investigate the use of logits and features, as well as the idea of using input perturbations (inspired by the adversarial attacks literature (Goodfellow et al., 2015)).
More recent work has moved to larger-scale, higher-resolution image datasets (Huang & Li, 2021; Hendrycks et al., 2022; Wang et al., 2022), designed to reflect more realistic computer vision applications. Gradnorm (Huang et al., 2021), although motivated by the information in gradients, at its core combines information from the softmax and features together. Similarly, ViM (Wang et al., 2022) linearly combines Energy with the class-agnostic Residual score. ReAct (Sun et al., 2021) aims to improve logit/softmax-based scores by clamping the magnitude of final layer features. KNN (Sun et al., 2022) takes a non-parametric approach, using the distance to the k-th nearest ID neighbour of a test feature vector.
There are also many training-based approaches. Outlier Exposure (Hendrycks et al., 2019) explores training networks to be uncertain on “known” existing OOD data, so that this behaviour generalises to unseen test OOD data. On the other hand VOS (Du et al., 2022) instead generates virtual outliers during training for this purpose. Hsu et al. (2020) and Techapanurak et al. (2020) propose the network explicitly learn a scaling factor for the logits to improve softmax behaviour. There also exists a line of research that explores the use of generative models, \(p(\varvec{x};\varvec{\theta })\), for OOD detection (Caterini & Loaiza-Ganem, 2021; Zhang et al., 2021; Ren et al., 2019; Nalisnick et al., 2019). These approaches are separate from classification, however, so are less relevant to this work.
Selective Classification Selective classification, or misclassification detection, has also been investigated for deep learning scenarios. Initially examined in Geifman and El-Yaniv (2017) and Hendrycks and Gimpel (2017), there are a number of approaches to the task that target the classifier f through novel training losses and/or architectural adjustments (Moon et al., 2020; Corbière et al., 2019; Geifman & El-Yaniv, 2019). Post-hoc approaches are fewer. DOCTOR (Granese et al., 2021) provides theoretical justification for using the \(l_2\)-norm of the softmax output \(\Vert \varvec{\pi }\Vert _2\) as a confidence score for detecting misclassifications, however, we find its behaviour similar to MSP and \(\mathcal H\) (See Appendix B). The comparatively smaller advancement in the selective classification literature, compared to OOD detection, suggests that improving performance on this task is much more challenging. This makes sense given the discussion in Sect. 3. The MSP baseline works well for detecting ID✗ as the softmax directly models \(P(y\mid \varvec{x})\), but is inherently ill-suited to OOD detection as it tends to conflate ID✗ with OOD.
General Methods for Uncertainty Estimation There also exist general approaches for uncertainty estimation. These approaches are typically more broadly motivated and aim to improve the quality of uncertainties over a wider range of potential downstream objectives. Earlier methods place neural networks in a Bayesian framework (MacKay, 1995; Jospin et al., 2022), of which a popular and simple-to-implement approach is MC-Dropout (Gal & Ghahramani, 2016). Deep Ensembles (Lakshminarayanan et al., 2017), where multiple models are trained independently using different random seeds, can also be viewed as Bayesian (Wilson, 2020). They offer consistent, and therefore compelling improvements in downstream tasks (Ovadia et al., 2019; Xia & Bouganis, 2022b, 2023; Malinin & Gales, 2021), however, their costs scale linearly with the number of ensemble members. Dirichlet Networks (Malinin & Gales, 2018; Malinin et al., 2020; Ulmer et al., 2023) model a distribution over categorical distributions in order to capture different types of uncertainty. SNGP (Liu et al., 2020a) and DDU (Mukhoti et al., 2021) use spectral normalisation so that shifts in the input space better correspond to shifts in the output space.
Selective Classification with Distribution Shift Here we discuss work that is most closely related to this work (some of which was published after the preliminary version of this paper (Xia & Bouganis, 2022a)). Kamath et al. (2020) investigate selective classification under covariate shift for the natural language processing task of question and answering. In the case of covariate shift, valid predictions can still be produced on the shifted data, which by our definition is not possible for OOD data (see Sect. 2). Thus the problem setting here is different to our work. They propose that g be a random forest classifier trained on a mixture of ID and covariate-shifted data, after f is fully trained.
Kim et al. (2021) introduce the idea that ID✗ and OOD data should be rejected together and investigate the performance of a range of existing approaches on an image-classification-based benchmark. They examine both training and post-hoc methods (comparing different f and g) on SCOD (which they term unknown detection). They also evaluate performance on misclassification detection and OOD detection independently. They find that Deep Ensembles (Lakshminarayanan et al., 2017) perform best overall. They do not provide a novel approach targeting SCOD, and consider a single setting of (\(\alpha , \beta \)), where the \(\alpha \) is not specified and \(\beta = 0.5\).
Jaeger et al. (2023) echo a similar sentiment to Kim et al. (2021), presenting a unified evaluation of selective classification with both OOD data and covariate-shifted data for image classification, without presenting a novel approach.
Cen et al. (2023) evaluate the SCOD performance of many approaches under different training regimes. They also propose a SIRC-inspired approach for a “few-shot” problem scenario, where a few OOD samples are available before deployment. They infact benchmark SIRC and report strong results (see their Table 5). We note that whilst both (Cen et al., 2023; Jaeger et al., 2023) are concurrent work to ours, they do not propose any methods that directly compete with SIRC(+), and perform similar classification-based experiments to those in our work [and (Kim et al., 2021)].
9 Future Work
In the future, it would be valuable to explore the ideas in SCOD in problem settings such as Object Detection and Semantic Segmentation that include classification as a sub-task. These scenarios are more complex compared to our definition of SCOD in Sect. 2 for vanilla classification. For example, in the case of Object Detection with OOD objects (Dhamija et al., 2020; Du et al., 2022), one can imagine a scenario where it is desirable to reject OOD objects as non-objects alongside low-confidence class predictions (just like SCOD), for which a SIRC-like approach may be suitable. However, it may alternatively be desirable to specifically detect OOD objects as unknown objects with a corresponding bounding box, which would require a different style of approach. In the case of semantic segmentation with OOD objects (Hendrycks et al., 2022), there are complications arising from the need to separate uncertainty relating to the edges of objects and uncertainty relating to the overall class of an object. One can easily imagine a SCOD-like problem setting where incorrect pixel predictions on edges would be irrelevant, whereas object-level misclassifications/OOD samples need to be detected.
Additionally, selective prediction for regression problems under distributional shift (Malinin & Gales, 2021) is an underexplored problem setting currently. It could also be possible in this case to leverage methods similar to SIRC, that combine multiple confidence scores together.
10 Concluding Remarks
In this work, we consider the performance of existing methods for OOD detection on selective classification in the presence of out-of-distribution data (SCOD). We show how their improved OOD detection vs the MSP baseline often comes at the cost of inferior SCOD performance. Furthermore, we find their performance is inconsistent over different OOD datasets.
In order to improve SCOD performance over the baseline, we develop SIRC. Our approach aims to retain information useful for detecting misclassifications from a softmax-based confidence score, whilst incorporating additional information useful for identifying OOD samples from a secondary score. Experiments show that SIRC is able to consistently match or improve over the baseline approach for a wide range of datasets, CNN architectures and problem scenarios. Moreover, by extending SIRC to include information from multiple secondary scores, we are able to further improve overall SCOD performance, as well as the consistency of SIRC over different distribution shifts.
We hope this work encourages the further investigation of SCOD or other new problem settings that involve detecting or distinguishing distributional shifts during deployment.
Data Availability Statement
The data analysed in the study are available from the referenced authors. Instructions for obtaining all datasets can be found here: https://github.com/Guoxoug/SIRC.
Notes
In this work OOD data is defined as being disjoint from the label space of the training distribution (Yang et al., 2021).
\(\varvec{z}^{P^\bot }\) is the component of the feature vector that lies outside of a principle subspace calculated using ID data. For more details see Wang et al. (2022)’s paper.
This is the Euclidean distance between a test feature vector and its k-th nearest neighbour from an ID dataset. Both features are \(l_2\)-normalised. For details see Sun et al. (2022)’s paper.
This holds for our chosen \(S_1\) of \(\pi _\text {max}\) and \(-\mathcal H\).
For SCOD, we are only concerned with the rank ordering of confidence scores, and log is a monotonic function. This version is more numerically stable. We implemented it using the logaddexp function in PyTorch (Paszke et al., 2019).
Assuming < 100% test accuracy of course.
Note that the parameters \(a_m, b_m\) are found on a per-score basis in the same way as described in Sect. 4.
References
Caterini, A. L., & Loaiza-Ganem, G. (2021). Entropic issues in likelihood-based ood detection. arXiv:2109.10794.
Cen, J., Luan, D., Zhang, S., Pei, Y., Zhang, Y., Zhao, D., Shen, S., & Chen, Q. (2023). The devil is in the wrongly-classified samples: Towards unified open-set recognition. In The 11th international conference on learning representations.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In 2014 IEEE conference on computer vision and pattern recognition (pp. 3606–3613).
Corbière, C., Thome, N., Bar-Hen, A., Cord, M., & Pérez, P. (2019). Addressing failure prediction by learning model confidence. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems 32 (pp. 2902–2913). Curran Associates Inc.
Dhamija, A., Gunther, M., Ventura, J., & Boult, T. (2020). The overlooked elephant of object detection: Open set. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV).
Du, X., Wang, Z., Cai, M., & Li, Y. (2022). VOS: Learning what you don’t know by virtual outlier synthesis. In The 10th international conference on learning representations, ICLR 2022, virtual event, April 25–29, 2022. OpenReview.net.
El-Yaniv, R., & Wiener, Y. (2010). On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11, 1605–1641.
Etten, A. V., Lindenbaum, D., & Bacastow, T. M. (2018). Spacenet: A remote sensing dataset and challenge series. arXiv:1807.01232.
Fort, S., Ren, J., & Lakshminarayanan, B. (2021). Exploring the limits of out-of-distribution detection. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems 34: Annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual (pp. 7068–7081). Curran Associates, Inc.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In M. F. Balcan, & K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning, volume 48 of proceedings of machine learning research (pp. 1050–1059). PMLR.
Geifman, Y., & El-Yaniv, R. (2017). Selective classification for deep neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., & Garnett, R., (Eds.), Advances in neural information processing systems, (Vol. 30). Curran Associates, Inc.
Geifman, Y., & El-Yaniv, R. (2019). Selectivenet: A deep neural network with an integrated reject option. In International conference on machine learning (pp. 2151–2159). PMLR.
Geifman, Y., Uziel, G., & El-Yaniv, R. (2019). Bias-reduced uncertainty estimation for deep neural classifiers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Y. Bengio & Y. LeCun (Eds.), 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings.
Granese, F., Romanelli, M., Gorla, D., Palamidessi, C., & Piantanida, P. (2021). DOCTOR: A simple method for detecting misclassification errors. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems 34: Annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual (pp. 5669–5681). Curran Associates, Inc.
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
Hendrycks, D., & Dietterich, T. G. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net.
Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, conference track proceedings. OpenReview.net.
Hendrycks, D., Basart, S., Mazeika, M., Zou, A., Kwon, J., Mostajabi, M., Steinhardt, J., & Song, D. (2022). Scaling out-of-distribution detection for real-world settings. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th international conference on machine learning, volume 162 of proceedings of machine learning research (pp. 8759–8773). PMLR.
Hendrycks, D., Mazeika, M., & Dietterich, T. G. (2019). Deep anomaly detection with outlier exposure. In 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. X. (2021). Natural adversarial examples. 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 15257–15266).
Hsu, Y.-C., Shen, Y., Jin, H., & Kira, Z. (2020). Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10948–10957).
Huang, R., & Li, Y. (2021). Mos: Towards scaling out-of-distribution detection for large semantic space. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8706–8715).
Huang, R., Geng, A., & Li, Y. (2021). On the importance of gradients for detecting distributional shifts in the wild. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems 34: Annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual (pp. 677–689). Curran Associates, Inc.
Huang, G., Liu, Z., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2261–2269).
Jaeger, P. F., Lüth, C. T., Klein, L., & Bungert, T. J. (2023). A call to reflect on evaluation practices for failure detection in image classification. In The 11th international conference on learning representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net.
Jospin, L. V., Laga, H., Boussaid, F., Buntine, W., & Bennamoun, M. (2022). Hands-on Bayesian neural networksG-a tutorial for deep learning users. IEEE Computational Intelligence Magazine, 17(2), 29–48.
Kamath, A., Jia, R., & Liang, P. (2020). Selective question answering under domain shift. In D. Jurafsky, J. Chai, N. Schluter, & J. R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5–10, 2020 (pp. 5684–5696). Association for Computational Linguistics.
Kather, J. N., Weis, C.-A., Bianconi, F., Melchers, S. M., Schad, L. R., Gaiser, T., Marx, A., & Zöllner, F. G. (2016). Multi-class texture analysis in colorectal cancer histology. Scientific Reports, 6.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st international conference on neural information processing systems, NIPS’17 (pp. 5580–5590). Curran Associates Inc.
Kim, J., Koo, J., & Hwang, S. (2021). A unified benchmark for the unknown detection capability of deep neural networks. arXiv:2112.00337.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2020). Big transfer (bit): General visual representation learning. lecture notes in computer scienceIn A. Vedaldi, H. Bischof, T. Brox, & J. Frahm (Eds.), Computer vision–ECCV 2020–16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part V (Vol. 12350, pp. 491–507). Springer.
Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan, D., & Murphy, K. (2017). Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, December 4–9, 2017, Long Beach, CA, USA (pp. 6402–6413). Curran Associates, Inc.
Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, NeurIPS 2018, December 3–8, 2018, Montréal, Canada (pp. 7167–7177). Curran Associates, Inc.
Liang, S., Li, Y., & Srikant, R. (2018). Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, conference track proceedings. OpenReview.net.
Liu, J., Lin, Z., Padhy, S., Tran, D., Bedrax Weiss, T., & Lakshminarayanan, B. (2020a). Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin, (Eds.), Advances in neural information processing systems, (Vol. 33, pp. 7498–7512). Curran Associates, Inc.
Liu, W., Wang, X., Owens, J., & Li, Y. (2020b). Energy-based out-of-distribution detection. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 21464–21475). Curran Associates, Inc.
MacKay, D. J. C. (1995). Probable networks and plausible predictions—a review of practical bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469.
Malinin, A., & Gales, M. J. F. (2018). Predictive uncertainty estimation via prior networks. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett, (Eds.), Advances in neural information processing systems 31: Annual conference on neural information processing systems 2018, NeurIPS 2018, December 3–8, 2018, Montréal, Canada (pp. 7047–7058). Curran Associates, Inc.
Malinin, A., & Gales, M. J. F. (2021). Uncertainty estimation in autoregressive structured prediction. In 9th international conference on learning representations, ICLR 2021, virtual event, Austria, May 3–7, 2021. OpenReview.net.
Malinin, A., Band, N., Gal, Y., Gales, M., Ganshin, A., Chesnokov, G., Noskov, A., Ploskonosov, A., Prokhorenkova, L., Provilkov, I., Raina, V., Raina, V., Roginskiy, D., Shmatova, M., Tigas, P., & Yangel, B. (2021). Shifts: A dataset of real distributional shift across multiple large-scale tasks. In 35th conference on neural information processing systems datasets and benchmarks track (round 2).
Malinin, A., Mlodozeniec, B., & Gales, M. J. F. (2020). Ensemble distribution distillation. In 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net.
Mesejo, P., Pizarro, D., Abergel, A., Rouquette, O., Beorchia, S., Poincloux, L., & Bartoli, A. (2016). Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE Transactions on Medical Imaging, 35(9), 2051–2063.
Moon, J., Kim, J., Shin, Y., & Hwang, S. (2020). Confidence-aware learning for deep neural networks. In Proceedings of the 37th international conference on machine learning, ICML 2020, 13–18 July 2020, virtual event, volume 119 of Proceedings of machine learning research (pp. 7034–7044). PMLR.
Mukhoti, J., Kirsch, A., van Amersfoort, J. R., Torr, P. H. S., & Gal, Y. (2021). Deterministic neural networks with appropriate inductive biases capture epistemic and aleatoric uncertainty. arXiv:2102.11582.
Nalisnick, E. T., Matsukawa, A., Teh, Y. W., Görür, D., & Lakshminarayanan, B. (2019). Do deep generative models know what they don’t know? In 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in neural information processing systems (Vol. 32). Curran Associates, Inc.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems 32 (pp. 8024–8035). Curran Associates, Inc.
Pearce, T., Brintrup, A., & Zhu, J. (2021). Understanding softmax confidence and uncertainty. arXiv:2106.04972.
Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., Dillon, J., & Lakshminarayanan, B. (2019). Likelihood ratios for out-of-distribution detection. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 32). Curran Associates, Inc.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.
Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp. 4510–4520).
Sun, Y., Guo, C., & Li, Y. (2021). React: Out-of-distribution detection with rectified activations. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems 34: Annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual (pp. 144–157). Curran Associates, Inc.
Sun, Y., Ming, Y., Zhu, X., & Li, Y. (2022). Out-of-distribution detection with deep nearest neighbors. 162, 20827–20840.
Techapanurak, E., Suganuma, M., & Okatani, T. (2020). Hyperparameter-free out-of-distribution detection using cosine similarity. In Proceedings of the Asian conference on computer vision (ACCV).
Ulmer, D. T., Hardmeier, C., & Frellsen, J. (2023). Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. In Transactions on machine learning research: OpenReview.net.
Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. (2017). The inaturalist species classification and detection dataset.
Wang, H., Li, Z., Feng, L., & Zhang, W. (2022). Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4921–4930).
Wilson, A. G. (2020). The case for bayesian deep learning. arXiv:2001.10995.
Xia, G., & Bouganis, C.-S. (2022a). Augmenting softmax information for selective classification with out-of-distribution data. In Proceedings of the Asian conference on computer vision (ACCV) (pp. 1995–2012).
Xia, G., & Bouganis, C.-S. (2022b). On the usefulness of deep ensemble diversity for out-of-distribution detection. arXiv:2207.07517.
Xia, G., & Bouganis, C.-S. (2023). Window-based early-exit cascades for uncertainty estimation: When deep ensembles are more efficient than single models. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV).
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv:2110.11334.
Zhang, M., Zhang, A., & McDonagh, S. (2021). On the out-of-distribution generalization of probabilistic image modelling. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems 34: Annual conference on neural information processing systems 2021, NeurIPS 2021, December 6–14, 2021, virtual (pp. 3811–3823). Curran Associates, Inc.
Acknowledgements
Guoxuan Xia is jointly funded by UK Research and Innovation and Arm ltd.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Lei Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Experimental Details
We present detailed information about our experimental setup. Our code is available at https://github.com/Guoxoug/SIRC.
1.1 Models and Training
For the main results we train ResNet-50 (He et al., 2016) using the default hyperparameters found in PyTorch’s examples.Footnote 9 We train on ImageNet-200 for 90 epochs with a batch size of 256. Stochastic gradient descent is used with a weight decay of \(10^{-4}\), a momentum of 0.9 and an initial learning rate of 0.1 that steps down by a factor of 10 at epochs 30 and 60. Images are augmented using RandomResizedCrop and RandomHorizontalFlip. MobileNetV2 (Sandler et al., 2018) uses the same setting, but with an initial learning rate of 0.05. DenseNet-121 is trained with the same settings are ResNet-50 but with Nesterov momentum as per (Huang et al., 2017). We perform 5 independent training runs for each architecture, with random seeds \(\{1,..., 5\}\).
Additionally, we also test on two pre-trained ImageNet-1k models. We use ResNetV2-101 from Google’s Big TransferFootnote 10 (Kolesnikov et al., 2020), specifically BiT-S-R101x1, and DenseNet-121 provided by PyTorch.Footnote 11 Note that the BiT model takes \(480\times 480\) images as input, whereas all other models take standard ImageNet-scale \(224\times 224\) images. Note that for evaluating these models we exclude Near-ImageNet-200 and Caltech-45 due to class overlap with ImageNet-1k.
1.2 ImageNet-Scale Datasets
Figure 16 shows a number of random examples from each dataset introduced in Sect. 5, alongside the number of samples in said dataset. Below we describe the methodology for constructing Colonoscopy and Noise. For the remaining datasets please refer to their original papers for details (Huang & Li, 2021; Wang et al., 2022; Kim et al., 2021; Hendrycks et al., 2021; Kather et al., 2016; Etten et al., 2018). We note that there is a slight discrepancy between the number of samples reported in Kim et al. (2021) for ImageNet-200 and in the authors’ provided datasets,Footnote 12 but we do not believe this affects the validity of our results.
Noise We randomly generate 10,000 square images. All samples are generated independently. Within each image, each value (in space and RGB) is sampled from the same gaussian distribution, with mean 0.5. The standard deviation of said gaussian differs between images. These in turn are generated by sampling from a unit gaussian and squaring the samples. Pixel values are then clipped to be in [0, 1] and mapped to 8-bit integers. The widths of each image are sampled uniformly from \(\{2,..., 256\}\), and the images are all scaled to \(256\times 256\) using the lanczos interpolation method in PIL.Footnote 13 The resulting data thus varies in both scale and contrast (see Fig. 16).
Colonoscopy We separate out frames as individual images from videos provided in Mesejo et al. (2016).Footnote 14 We download the first 10 narrow band imaging (NBI) videos in each class of lesion (hyperplasic, serrated, adenoma) and extract each frame as an individual image. Although the data is not independent in this case, we treat it as such for the purposes of our investigation.
1.3 Confidence Scores
Below we detail all confidence scores S implemented and evaluated in our investigation. There are additional approaches that were omitted from the main paper for the sake of brevity.
-
SIRC(+): for a description of the score see Sects. 4 and 6 in the main paper. We use the whole of the ImageNet-200 training set to determine the values of \(\mu _{S_2}, \sigma _{S_2}\). For ImageNet-1k we randomly sample 250, 000 images from the training set. Note that for all following methods that require ID data to find parameters, we use the same ID data as for SIRC. We investigate combinations of \(S_1, S_2\) from the cartesian product {MSP, DOCTOR, \(\mathcal H\)}\(\times \){\(\Vert \varvec{z}\Vert _1, \)Residual, KNN}, as well as the use of all secondary scores together for SIRC+.
-
Maximum Softmax Probability (MSP) (Hendrycks & Gimpel, 2017): a baseline score that takes the max value from the softmax \(\pi _\text {max} = \max _k \pi _k\).
-
DOCTOR (Granese et al., 2021): the original paper does not directly present it as such, but the confidence score is equivalent to \(\Vert \varvec{\pi }\Vert _2\).
-
Softmax entropy (\(\mathcal H\)): measures softmax uncertainty, \(\mathcal H[\varvec{\pi }] = -\sum _k \pi _k\log \pi _k\). We use \(S = -\mathcal H[\varvec{\pi }]\) to change it to a measure of confidence.
-
\(l_1\)-norm of the features: used in Gradnorm (Huang et al., 2021), \(\Vert \varvec{z}\Vert _1\).
-
Residual: used in ViM (Huang et al., 2021), this score measures the component of the feature vector that is outside of a principal subspace defined using ID data, \(\Vert \varvec{z}^{P^\bot }\Vert _2\). We follow Wang et al. (2022) in setting the dimensionality of the subspace to 1000 if the dimensionality of \(\varvec{z}\), \(L>1500\) and 512 otherwise. Like Entropy, we use the negative of the score \(S = -\Vert \varvec{z}^{P^\bot }\Vert _2\) as this score is meant to be higher for OOD data. Please refer to Wang et al. (2022)’s paper for full details.
-
KNN (Sun et al., 2022): a non-parametric approach that uses the Euclidean distance between a test feature vector \(\varvec{z}^*\) and its kth nearest neighbour in the training set. Both vectors are \(L_2\)-normalised, so this is equivalent to cosine similarity. Sun et al. (2022) subsample the training dataset to reduce search costs at inference, and proportionally scale k. We use a similar-sized training subset of 12, 500 to them for ImageNet-scale data and a value of \(k=10\).
-
Max Logit (Hendrycks et al., 2022): Max Logit is similar to MSP, but the score is taken from the logits before the softmax \(v_\text {max} = \max _k v_k\).
-
Energy (Liu et al., 2020b): this score aggregates over all logit values as \(\log \sum _k \exp v_k\).
-
Gradnorm (Huang et al., 2021): although this score was originally motivated by gradients, we can view it simply as the combination of two scores, \(C = \Vert \varvec{\pi }- \varvec{1}/K\Vert _1\Vert \varvec{z}\Vert _1\).
-
ViM (Wang et al., 2022): this linearly combines Energy and Residual, \(C = \log \sum _k \exp v_k - c\Vert \varvec{z}^{P^\bot }\Vert _2\). The parameter c is given by the average value of Max Logit divided by the average value of Residual on ID data, which scales the importance of Residual to be similar to that of Energy in the combination.
-
Mahalanobis (Lee et al., 2018): this score involves building a classwise gaussian mixture model over the features with tied covariance matrix. The confidence is then calculated as \( -\min _k (\varvec{z} - \varvec{\mu }_k)^T \tilde{\varvec{\Sigma }} (\varvec{z} - \varvec{\mu }_k)\). We use the approach in Wang et al. (2022) and Fort et al. (2021) where only the final layer features are considered.
1.4 Evaluation Metrics
Other than the metrics specified in Sect. 5.1, we additionally use Area Under the Risk-Coverage Curve (AURC)\(\downarrow \), from Kim et al. (2021) and Geifman and El-Yaniv (2017). It aggregates risk over all values of coverage, which is the proportion of all input data accepted. For AURC their exists an oracle curve, where OOD and ID✗ are perfectly disjoint from ID✓. AURC can be reduced either by lowering the oracle curve by reducing the number of ID✗ (increasing baseline accuracy of f) or by better separating OOD, ID✗ |ID✓ (better choice of g) and so bringing the curve closer to the oracle. Thus the metric is suitable for both training based, and post-hoc approaches. Figure 17 illustrates graphically some of the metrics we use to evaluate SCOD.
Additional Results
We provide more complete versions of the results presented in Sect. 5 of the main work across all architectures and datasets.
1.1 AUROC and FPR@95
We present results across all post-hoc confidence scores in Appendix A.3 for all architectures in Tables 2, 3, 4 and 5. We also include mean \(\pm 2\) SD. for experiments with multiple training runs. SIRC performs as expected in all cases – a small reduction in ID✗ |ID✓ in exchange for a meaningful uplift in OOD|ID✓ compared to only using \(S_1\). SIRC+ is able to offer further improvements in OOD|ID✓ over SIRC.
DOCTOR in general performs somewhere in between MSP and \(-\mathcal H\), both individually and when used in SIRC, so we relegate it to the appendix. We note that Residual and Mahalanobis perform much better only for ResNetV2-101 [these results are inline with Wang et al. (2022)]. This may be due to the fact that BiT uses Weight Standardisation and Group Normalisation when training, rather than standard Batch Normalisation. Mukhoti et al. (2021) show that limiting the Lipschitz constant of the network during training improves the OOD detection performance of gaussian mixture models, which may be also what is occurring in this example. The Mahalanobis detector performs poorly outside ResNetV2-101 otherwise. There is non-negligible variance between training runs on a number of OOD datasets, highlighting the need to perform multiple training runs. Some datasets (e.g. Noise, Colorectal), have especially high variation.
1.1.1 Additional Analysis for SIRC on ID✗ |ID✓
We note that in some cases for ID✗ |ID✓ SIRC is able to slightly outperform \(S_1\) by itself, even when \(S_2\) has \(\le 50\%\) AUROC by itself (e.g. \(S_2=\)Res. in Tables 1 and 2). This is counter-intuitive as \(S_2\) should be harmful to performance. We provide some analysis to show that in some cases \(S_2\) is indeed useful for ID✗ |ID✓. We train a series of linear logistic classifiersFootnote 15 with (MSP, Res.) as the input with different class weightings on the test set of ImageNet-200. Figure 18 shows that for ResNet-50, better ID✗ |ID✓ can be achieved by considering (slightly) the value of Res. alongside MSP. However, it also shows that for high MSP, Res. ID✓ has a significant tail of low confidence values. This tail doesn’t have much effect when Res. is considered together with MSP, since MSP ID✓ is high and heavily weighted, but will reduce AUROC when Res. is considered by itself for ID✗ |ID✓.
Figure 18 and Tables 1 and 4 show that for DenseNet-121 Res. provides no benefit at all for ID✗ |ID✓. Generally over different architectures and \(S_2\), no secondary score is able to consistently help for ID✗ |ID✓. Moreover, any benefit is minimal and within the range of \(\pm 2\) SD. Thus we believe that \(S_2\) should be only considered for its contribution to OOD|ID✓ when deploying SIRC.
1.2 Varying \(\alpha \) and \(\beta \)
We plot versions of Fig. 9 for all 3 ImageNet-200 architectures (Figs. 19, 20, 21). We also present the mean ± SD. The ability of SIRC to perform consistently better than the baseline generalises across the 3 different CNN architectures. We note that differences in AURC are harder to distinguish, due to the metric considering the proportion of all input data accepted, rather than just the recall of ID✓. The behaviour, however, is similar to AURR in terms of relative performance to the baseline, so we omit AURC from the main results.
1.3 SCOD vs OOD Detection
Similar to the previous section we include versions of Fig. 10 for all architectures and confidence scores (Figs. 22, 23, 24, 25, 26). The behaviour is as discussed in Sect. 5.4, with methods designed for OOD detection achieving gains over the baseline for OOD detection by sacrificing their ability to separate ID✗ |ID✓.
1.4 Plotting \(S_2\) against \(S_1\)
In a similar vein to Fig. 6, we plot different SIRC combinations on the \(S_1, S_2\)-plane for different experimental configurations (Figs. 27, 28, 29, 30). If there are multiple training runs, we plot the distributions corresponding to the outputs of the 1st run. Decision contours corresponding to the default parameter setting for SIRC are also overlayed. We note that the inconsistency of Residual can be observed here, where in some cases the OOD distribution is much lower than ID, whilst in others, there is almost complete overlap. In the case of MobileNetV2 on iNaturalist it is in fact higher for OOD than ID, although the nature of SIRC means that it is robust to such \(S_2\) failure (as discussed in Sect. 5.2).
1.5 SIRC+ on Other Architectures
As in Fig. 14, we plot the change in SCOD performance relative to only using \(S_1\) (\(-\mathcal {H}\)), for CNN architectures DenseNet-121 and MobileNet-V2 (Figs. 31, 32). Results tell a similar story to Sect. 7, with SIRC+ providing more consistent and overall better improvements. We note that for DenseNet-121 there is a slightly larger drop in ID✗ |ID✓ performance compared to the other two architectures for SIRC+. From the perspective of a practitioner, this cost should be visible on a validation set, and so the trade-off between ID✗ and OOD should be considered when choosing which version of SIRC to deploy.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xia, G., Bouganis, CS. Augmenting the Softmax with Additional Confidence Scores for Improved Selective Classification with Out-of-Distribution Data. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02029-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11263-024-02029-3