1 Introduction

Document analysis is a field of research that deals with automating the process of reading, analyzing, and understanding business documents. Modern businesses rely heavily on business documents to communicate details of their internal and external transactions, which is critical to their efficiency and productivity. As large volumes of documents are produced on a daily basis, there is an urgent need today to automate the processing of these documents to facilitate tasks such as search, retrieval, and information extraction. However, automated processing of documents can be particularly challenging for a number of reasons, including high levels of data complexity [1], large inter-class similarity and intra-class variance [2], and corruption of scanned document data with various types of distortions [3].

To address the aforementioned challenges, deep learning has been extensively explored in the field and has proven to be exceptionally effective in a wide range of document analysis tasks such as document image classification [4, 5], layout analysis [5], OCR [6], etc. However, deep learning presents some unique challenges of its own. One major disadvantage of deep learning-based approaches is that their performance is heavily dependent on the availability of large amounts of annotated training data. While most real-world tasks have a vast amount of data available that could be annotated, which is also true for document processing tasks, data annotation is generally a labor-intensive task and can become extremely costly for certain tasks that require domain experts’ knowledge to annotate. Furthermore, modern document processing pipelines are often continuously under development to accommodate new and evolving tasks and data requirements. Consequently, data annotation often becomes a routine process within these pipelines, resulting in an increased labor cost.

Fig. 1
figure 1

Two samples of different classes with high similarity (top left) and two samples of the same class with high variance (top right) are shown

Active learning (AL) is a relatively new and emerging research topic that directly addresses the above mentioned challenges of data annotation costs [7]. The goal of active learning is to maximize the performance of deep learning models while minimizing the costs associated with annotation. Generally, AL involves training a machine learning model on a small labeled dataset and then using it to extract the most informative samples from an unlabeled pool of data samples. The newly extracted samples are then sent to the Oracle for labeling and incorporated back into the labeled dataset. Lastly, the model is retrained on the updated labeled dataset and the process is repeated. Active learning has recently made remarkable advancements [8], particularly in the fields of image classification [9,10,11] and semantic segmentation [12], where it has shown to significantly reduce annotation costs without adversely affecting model performance. However, despite the fact that AL can provide substantial benefits in terms of reducing annotation costs associated with document analysis tasks, only limited literature has been published in this direction. This paper investigates AL for reducing data annotation costs specifically in the context of document image classification, which is one of the core elements of modern document processing pipelines.

The task of document image classification poses a significant challenge due to the high intra-class variance and inter-class similarity [2, 13]. An example of this is shown in Fig. 1. Not only does this make it difficult for deep learning models to distinguish between different document classes, but also for humans to annotate them, which in turn increases the possibility of high annotation noise in document classification datasets. Therefore, this paper explores not only the effectiveness of existing AL approaches for reducing annotation costs while maintaining high performance, but also their effectiveness in countering labeling noise in document image classification. Data imbalance is another prevalent issue in real-world document datasets, and therefore, this paper additionally investigates the performance of AL under different scenarios of data bias and imbalance. In summary, this paper offers two main contributions:

  1. 1.

    This work shows for the first time that active learning can be used to significantly reduce data annotation costs while achieving competitive performance on document image classification benchmarks.

  2. 2.

    This work investigates the potential of different AL strategies in countering annotation noise and data imbalance and presents a thorough comparative analysis of their effectiveness in such scenarios.

2 Related work

2.1 Active learning

Active learning has seen tremendous research growth in the past few years, with a wide range of approaches proposed in this area [8]. Current active learning methods primarily fall into two categories: membership query-synthesis [14, 15] and pool-based approaches [11, 12, 16]. Query-synthesis active learning methods not only look for informative samples in the unlabeled dataset but also generate their own informative samples using generative models. In contrast, the pool-based methods rely mainly on different sampling techniques in order to query the most informative samples from an unlabeled dataset. This work focuses primarily on investigating pool-based active learning strategies for classifying document images, and therefore, previous work in this area is reviewed in greater depth.

Several approaches to pool-based active learning have been proposed in the past, which can generally be divided into three main categories: uncertainty-based approaches [16, 17], representation-based approaches [11], and enhanced hybrid approaches [12, 18]. Uncertainty-based approaches aim to identify and select those samples from the unlabeled dataset on which the trained model exhibits the greatest degree of uncertainty. These approaches have been proposed in both Bayesian and non-Bayesian frameworks. In non-Bayesian realm, various uncertainty measures are directly employed, such as entropy [19], distance from decision boundaries [17], and expected risk minimization. By contrast, Bayesian approaches estimate uncertainty using Gaussian processes. A study by Gal et al. [20] showed that neural networks with Dropout [21] applied before each weight layer approximate a probabilistic deep Gaussian process, and used Dropout to estimate uncertainty in predictions. In a slightly different direction, some works have also proposed model ensembles to compute uncertainty [22].

Representation-based approaches focus primarily on querying samples that increase the diversity of the batch being queried. One popular representation-based method is KMeans Sampling [8], which generates the sample clusters from the unlabeled dataset using KMeans Clustering and then selects samples in proportion to their squared distances from the nearest centroid. CoreSets [11] is another popular representational-based approach that relies on reducing the distance between the queried samples and the labeled samples in feature space and has shown promising results in large-scale image classification applications.

Several enhanced or hybrid approaches have also been proposed in recent years. CEAL [18] is a hybrid active learning approach that may be used in conjunction with any existing active learning query method. CEAL first uses the underlying AL strategy to extract the samples from the unlabeled dataset and then extracts additional samples by assigning pseudo-labels to those samples that are confidently predicted by the model. Enhanced adversarial approaches such as DeepFool Active Learning (DFAL) [23] and Adversarial Basic Interactive Method (AdvBIM) [24] have also recently become popular, which seek adversarial examples in unlabeled datasets to increase the diversity of the samples being queried. Sinha et al. [12] proposed a hybrid adversarial approach that combines variational auto-encoders with an adversarial discriminator to increase batch diversity. Similarly, Shui et al. [25] recently proposed the Wasserstein Adversarial Active Learning (WAAL) approach which trains an independent discriminator model to search for diverse unlabeled samples. Loss Prediction Loss (LPL) [26] is another recent hybrid approach that trains a separate network in parallel with the target model to predict the loss of inputs with respect to the target model and then queries the samples that result in the highest predicted loss. In a different direction, Ash et al. [27] proposed the Batch Active Learning by Diverse Gradient Embeddings (BADGE) approach that finds a tradeoff between uncertainty and diversity by computing gradient embeddings for unlabeled samples and clustering them with the KMeans++ algorithm. Ash et al. [28] also recently proposed Batch Active Learning via Information MaTrices (BAIT) as an improvement over BADGE, which uses gradient embeddings in combination with Fisher information to determine the optimal tradeoff between uncertainty and diversity.

2.2 Document image classification

The topic of document image classification has been extensively researched in the past. Early work in this area concentrated primarily on exploiting the structural similarity in documents [29], feature matching [30], or applying classical approaches such as K-nearest neighbors [31] or hidden Markov models [32] to distinguish between classes of documents.

Fig. 2
figure 2

An overview of the active learning cycle. The model \(\mathcal {M}\) is first trained on the labeled dataset \(\mathcal {D}_\textrm{L}\) which is then utilized to query samples from the unlabeled dataset \(\mathcal {D}_\textrm{U}\). Samples are labeled by the oracle and aggregated back into the labeled set, and the process is repeated

Recent advances in deep learning have led to the development of numerous image-based and multi-modal approaches for the classification of document images [4, 5, 13]. A major contribution to this field was made by Kang et al. [33], who demonstrated that even a shallow neural network with just four layers could achieve dramatic performance improvements over traditional approaches. In the following years, the works of Harley et al. [2] and Afzal et al. [13] explored the potential of convolutional neural networks (CNNs) in combination with transfer learning and demonstrated exceptional performance improvements on popular benchmark datasets. Several CNN-based approaches have been proposed since then that have explored different directions such as transfer learning [13], parallel training [4], multi-view stacking [34], and inherent interpretability [35] in the context of document image classification. In recent studies, self-supervised pretraining has also been investigated for document classification both in the image domain [36, 37] and in the multi-modal domain [1, 5, 38] in order to leverage large-scale document datasets for training without incurring additional annotation costs. In such approaches, however, data annotation is still required in order to fine-tune the models on the downstream tasks.

3 Methods

This section describes our active learning setup and the different query strategies that were investigated in this study.

3.1 Active learning setup

Let \(\mathcal {D}_\textrm{L}=\{(x_1, y_1),(x_2, y_2),\dots ,(x_n, y_n)\}\) denote the labeled training dataset in a standard supervised learning setting, where \(x_i\) denotes a data sample and \(y_i\) denotes the corresponding class label, and let \(\mathcal {D}_\textrm{U}=\{x_1,x_2,\dots ,x_m\}\) denote a larger pool of unlabeled samples such that \(n<<m\), then the goal of active learning (AL) is to iteratively select b most informative samples from the unlabeled dataset \((x_\textrm{U} \sim \mathcal {D}_\textrm{U})\) using a query function f, such that when they are annotated and aggregated back the labeled dataset \(\mathcal {D}_\textrm{L}\), the overall classification performance of the machine learning model \(\mathcal {M}\) trained on the updated labeled dataset \(\mathcal {D}_\textrm{L}\) is maximized.

In this study, the standard batch active learning (BAL) approach was used, which has previously been shown to be effective in training convolutional neural networks (CNNs) for image classification tasks [11]. In a standard supervised setting of BAL, each AL cycle (see Fig. 2) begins with training the deep learning model (\(\mathcal {M}\)) on the labeled dataset \(\mathcal {D}_\textrm{L}\). The trained model \(\mathcal {M}\) is then utilized in combination with a predefined query function f of choice to query a batch of samples of size b from the unlabeled dataset \(\mathcal {D}_\textrm{U}\). Newly selected samples are then sent to the Oracle for annotation and aggregated back into the labeled dataset \(\mathcal {D}_\textrm{L}\) for the next round of training. This cycle is repeated until either the total annotation budget \(\mathcal {B}\) has been exhausted or a predefined termination condition has been met. In this work, a fixed number of active learning rounds was used as a termination condition.

The query function f is the most important component of AL, which describes the criteria by which samples are selected from the unlabeled dataset \(\mathcal {D}_\textrm{U}\). In order to maximize the machine learning model’s performance with minimal annotation costs, it is necessary to select the most informative samples from \(\mathcal {D}_\textrm{U}\) during each AL round. A variety of query functions f have been proposed in the past, each defining the informativeness of a sample according to a different criterion. In this study, several existing pool-based query approaches were explored, including uncertainty-based approaches, representation-based approaches, and enhanced hybrid approaches.

3.1.1 Uncertainty-based approaches

Several uncertainty-based query functions have been investigated in this paper from both the Bayesian and Non-Bayesian realms. Non-Bayesian sampling techniques include Margin Sampling [8], Least Confidence Sampling [8], and Entropy Sampling [19]. In the Bayesian setting, the techniques Bayesian Active Learning Disagreement (BALD) [16], Margin Sampling, Least Confidence Sampling, and Entropy Sampling were explored in combination with Monte Carlo Dropout [20].

Fig. 3
figure 3

The distribution of classes in the Tobacco3482 training set

3.1.2 Representation-based approaches

In this domain, two approaches, namely CoreSets [11] and KMeans Sampling [8], were investigated. The CoreSets approach was implemented utilizing the KCenterGreedy algorithm, as originally proposed. However, due to the high dimensionality of the output embeddings of the model, it was not computationally feasible to apply the KCenterGreedy algorithm directly to the output of the model. To address this issue, principal component analysis (PCA) was used to reduce the dimensionality of the output embedding of the model before applying the querying algorithm.

3.1.3 Enhanced/Hybrid approaches

A number of enhanced adversarial [23, 24] and hybrid [12, 18] approaches were investigated in this work, including Cost-Effective Active Learning (CEAL) [18], the Adversarial Basic Interactive Method (AdvBIM) [24], WAAL [25], LPL [26], BADGE [27], and BAIT [28]. CEAL [18] was used in combination with the Entropy uncertainty measure approach in this work.

4 Experiments and results

This section describes and presents the results of the experiments conducted in this paper.

4.1 Datasets

To evaluate the effectiveness and feasibility of different AL techniques, two publicly available document benchmark datasets were utilized, namely RVL-CDIP [2] and Tobacco3482, both of which have been extensively utilized for benchmarking document image classification in the past [2, 4, 13]. RVL-CDIP [2] is a large-scale dataset containing 400K labeled document images from 16 document classes, divided into training, testing, and validation sets of 320K, 40K, and 40K, respectively. Tobacco3482, in contrast, is a smaller dataset consisting of only 3482 images divided into 10 different document categories. It is important to note, however, that unlike RVL-CDIP, Tobacco3482 has an imbalanced class distribution as shown in Fig. 3. This makes it useful for investigating the performance of AL algorithms in the presence of class imbalance. In this work, the Tobacco3482 dataset was divided into training, test, and validation sets of 2504, 700, and 278 in size, respectively.

4.2 Implementation details

In order to simulate a realistic AL scenario, a small percentage of the original training set of the respective datasets was randomly sampled to create the initial labeled dataset \(\mathcal {D}_\textrm{L}\). This percentage was set at 10% for RVL-CDIP and at 5% for the Tobacco3482 dataset. The remaining samples in the training set were used to create the unlabeled pool \(\mathcal {D}_\textrm{U}\) from which the samples were extracted for annotation by the Oracle. In each AL cycle, the batch size for querying was set to 2.5% of the full training set, which is equivalent to 8000 for RVL-CDIP, and 63 for Tobacco3482 dataset and the AL cycle was repeated for each experiment until a total of \(40\%\) of the original dataset was annotated.

All the experiments were conducted using the standard ResNet-50 model [39] pretrained on the ImageNet-22k dataset [40], which has previously been demonstrated to perform exceptionally well on the aforementioned document datasets [13]. As has been done in previous works [4, 13], the input images for the model were down-scaled to a resolution of \(224\times 224\), converted to RGB color space, and normalized with the ImageNet mean of (0.485, 0.456, 0.406) and standard deviation of (0.229, 0.224, 0.225). Training was conducted using the standard Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01, which was gradually reduced over the training cycle using the Cosine Decay Learning Rate Scheduler. In each AL cycle, 40 training epochs were used, a number previously determined to be sufficient for this task [4, 13]. For the RVL-CDIP dataset, a batch size of 256 was used, while for the Tobacco3482 dataset, a batch size of 64 was employed.

For the Tobacco3482 dataset, additionally, two different training settings, namely, Tobacco3482ImageNet and Tobacco3482RVL-CDIP were investigated. In Tobacco3482ImageNet setting, the models pretrained on ImageNet-22k were used. Whereas in Tobacco3482RVL-CDIP setting, the models pretrained on the RVL-CDIP dataset were utilized in order to assess the effectiveness of active learning in combination with document-specific pretraining. It is important to note that for pretraining the models on the RVL-CDIP dataset, we also used the ImageNet-22k pretrained weights for model initialization. In addition, results for all the experiments on Tobacco3482 dataset were presented with mean and standard deviation over 5 runs.

Some techniques were excluded from investigation for the larger dataset RVL-CDIP due to prohibitive computational requirements. With KMeans Sampling [8], the high CPU computational costs were faced, whereas, for BADGE [27], and BAIT [28], the memory requirements scaled proportionally with the size of the dataset, which rendered it impossible to apply these techniques to the RVL-CDIP dataset. Moreover, WAAL uses a two-stage training strategy for implementing discriminative learning and trains the networks on both the labeled and unlabeled datasets in parallel. This results in huge training costs when the unlabeled pool \(\mathcal {D}_\textrm{U}\) is large as compared to other active learning strategies. As a result, to apply WAAL on RVL-CDIP dataset, only a batch percentage of 5.0% was investigated so that the total computational costs of the active learning process could be reduced.

4.3 Performance evaluation

This section presents the performance results of different active learning algorithms on the two document datasets. For each AL method, Table 1 presents both the average accuracy achieved on 40% of original training datasets and the area under the budget curve (AUBC) which is useful in comparing the overall performance of an AL method under varying budgets. Figure 4 illustrates the budget–accuracy curves for each AL strategy, under different dataset settings, which indicate the accuracy achieved by the model after each AL round until a total of 40% of the training dataset was annotated.

Table 1 ResNet-50 performance under supervised learning with fully annotated training datasets (top) and active learning with 40% annotated training datasets (bottom)
Fig. 4
figure 4

Accuracy–budget curves for the different active learning strategies on RVL-CDIP and Tobacco3482 datasets

For comparison of the active learning performance with standard supervised training on fully annotated dataset, the accuracy achieved by the model with 100% annotated dataset is also presented in the table as mentioned by ResNet-50Supervised. The ResNet-50Supervised performance reported by Afzal et al. [13] for Tobacco3482 settings differs from ours in that they only used 100 randomly selected samples per class (a total of 1000 samples) for training, while we utilized the entire training dataset. The performance difference on both the Tobacco3482ImageNet and Tobacco3482RVL-CDIP settings is evident as a result of this difference in approach.

In addition to standard supervised learning approaches, we also compared the results of our experiments with self-supervised learning approaches. Siddiqui et al. [36] recently examined two state-of-the-art self-supervised approaches in the context of document classification, namely BarlowTwins [41] and SimCLR [42]. The research demonstrated how self-supervised pretraining can assist in reducing annotation costs on both large and small datasets. As they also used the ResNet-50 model for their analysis, our results can be directly compared with theirs. The results presented in this paper are based on an experiment in which Siddiqui et al. [36] first pretrained a ResNet-50 model on the RVL-CDIP dataset using the SimCLR and BarlowTwins self-supervised approaches and then fine-tuned the model on the RVL-CDIP and Tobacco3482 datasets with only 50% of the training data annotated. Model performance accuracies resulting from these experiments are shown in Table 1 denoted by ResNet-50BarlowTwins and ResNet-50SimCLR.

4.3.1 Results on RVL-CDIP

As can be seen from Table 1, many of the AL techniques investigated in this work showed significantly better performance than the Random Sampling baseline. Furthermore, they were able to achieve a performance comparable to the model trained on fully annotated dataset (ResNet-50Supervised) by using only 40% of the labeled training dataset. The enhanced LPL approach showed a significantly better performance compared to others both in terms of accuracy and AUBC which is also clearly visible in Fig. 4a. Uncertainty-based techniques such as Entropy and Margin also showed competitive performance despite their simplicity compared to enhanced approaches. One interesting observation is that while CEAL (Entropy) showed similar accuracy to Entropy, its AUBC was much higher in comparison. From Fig. 4a, it can be seen that CEAL (Entropy) results in an overall consistently higher accuracy over varying budgets as compared to Entropy. Some techniques such as BALD and AdvBIM performed even worse than the Random Sampling baseline in this case. KCenterGreedy also appeared to perform similarly to top uncertainty-based methods in terms of accuracy; however, its AUBC remained significantly lower. WAAL despite being a state-of-the-art (SotA) enhanced hybrid approach also performed poorly compared to other simpler approaches. These performance differences can also be observed in Fig. 4a, where the accuracy–budget curves of AdvBIM and BALD show a very similar behavior to the Random Sampling baseline, whereas the accuracy–budget curves of KCenterGreedy and WAAL stayed consistently lower than their competitors.

4.3.2 Results on Tobacco3482ImageNet

A slightly different trend was seen in this setting, where Entropy showed the highest mean accuracy of 82.54%, which is \(\approx \)2.7% higher than the Random Sampling baseline. Other techniques that performed comparatively well were Margin (Dropout) and BADGE. One noticeable observation in this scenario is that many of the techniques showed really high variance due to the small dataset size and large class imbalance. The variation could also explains why the Margin (Dropout) approach had the highest overall AUBC, regardless of its lower accuracy compared to Entropy. BALD, KMeans, and BAIT performed even worse than the Random Sampling baseline in this scenario which is surprising as BAIT has been shown to perform much better than other approaches in natural image classification domain [28]. One interesting observation from Fig. 4b is that BAIT performed better than other approaches in the first few rounds (up to  15% labeled dataset); however, its performance degraded suddenly afterward. The SotA enhanced approaches LPL and WAAL also performed poorly compared to other approaches in this scenario. A possible explanation is that these techniques are greatly sensitive to data imbalance. For example, for LPL, data imbalance may result in bias in loss prediction, resulting in poor performance as a consequence. Interestingly, these observations are similar to those reported by [8]. Overall, it can be noted that AL strategies did not perform as well in this scenario as they did on RVL-CDIP. However, these results are not surprising, since AL methods have previously been shown to suffer from biased sampling in case of data imbalance [8, 43].

4.3.3 Results on Tobacco3482RVL-CDIP

It is evident from Table 1 that the document-specific pretraining resulted in a significant improvement in the performance of AL algorithms. It is interesting to note that even the Random Sampling baseline outperformed the ResNet-50Supervised model trained on the full dataset, proving that active learning can be an effective training process even with just Random querying in some scenarios. Uncertainty-based techniques performed overall better than other approaches (except AdvBIM) in this setting; however, the differences between their performance was minor. Figure 4c also illustrates an interesting observation that the majority of AL strategies under this scenario outperformed the ResNet-50Supervised model even with just 10–15% of the training data annotated. Another noteworthy observation from Fig. 4c is that AdvBIM started out the worst in initial rounds but performed significantly better than other approaches achieving a mean accuracy of 92.80% at 40% labeled training dataset. Similar to the Tobacco3482ImageNet case, many enhanced approaches including BADGE, BAIT, LPL and WAAL performed poorly in this scenario, sometimes performing even worse than the Random Sampling baseline.

4.3.4 Comparison with self-supervised approaches

From Table 1, it can be observed that the AL approaches consistently outperformed the self-supervised approaches ResNet-50BarlowTwins and ResNet-50SimCLR despite overall 10% fewer annotated samples. In addition, it can be seen from Fig. 4a that the self-supervised performance on RVL-CDIP is outperformed by a number of AL strategies with only 25–30% of the training data labeled. It should also be noted that while AL strategies showed only modest improvements over the self-supervised approaches on the RVL-CDIP dataset, they demonstrated much more significant improvements on Tobacco3482. It can be seen from Fig. 4c that, when using standard RVL-CDIP pretraining, fine-tuning on the Tobacco3482 dataset, even with 5% annotated data, was more effective than fine-tuning from self-supervised pretraining with 50% annotated data. This suggests that task-specific (classification) pretraining on a larger dataset (RVL-CDIP) can provide significant performance boosts on smaller datasets in comparison with self-supervised pretraining. This does, however, mean that the larger pretraining dataset must be fully annotated itself, which may not always be possible, in which case, self-supervised pretraining may be a more viable option. While the above is true, it can also be noticed from Fig 4b that even without RVL-CDIP pretraining for the Tobacco3482 dataset, AL strategies with only 40% of the training dataset were able to achieve performances comparable to those achieved by the self-supervised pretraining approaches. This indicates that active learning can be as effective in reducing annotation costs as self-supervised pretraining even if there is no document-specific training.

Table 2 For each AL strategy, the average time spent on querying samples from the unlabeled pool \(\mathcal {D}_\textrm{U}\) and the total training time relative to the Random Sampling baseline are shown
Fig. 5
figure 5

Accuracy–budget curves for the different active learning strategies on RVL-CDIP and Tobacco3482 datasets under varying query batch sizes

Table 3 ResNet-50 performance under supervised learning with fully annotated training datasets (top) and active learning with 40% annotated training datasets and with varying batch sizes (bottom)

4.4 Query time analysis

The time it takes to perform a querying operation is an imperative consideration when selecting an active learning strategy. Since Random Sampling is also an effective AL method, ideally the querying methods should be as time-efficient as that. Table 2 provides the mean querying time (Avg. tquery) spent by each AL strategy and the total training time (Rel. ttrain) taken by each strategy per AL round relative to the Random Sampling baseline. As can be seen from the table, non-Bayesian uncertainty-based approaches were the fastest in terms of both Avg. tquery and Rel. ttrain. The Bayesian uncertainty-based methods, on the other hand, were significantly slower which is expected since these methods use multiple forward passes with Dropout to compute uncertainty. CEAL (Entropy) also showed competitive computational times since it used Entropy as the underlying querying method. However, its training time was comparatively much higher as it adds additional training samples to the labeled dataset \(\mathcal {D}_\textrm{L}\) per round. In order to overcome the large memory requirements of the KCenterGreedy algorithm for large datasets, batch processing was used in our work. As a consequence, its querying time increased proportionally with the size of the dataset. This is evident from Table 2, where its querying time is similar to uncertainty-based techniques on the small Tobacco3482 dataset but increases considerably on RVL-CDIP. Although BADGE was found to perform as well as non-Bayesian uncertainty-based approaches, BAIT showed nearly ten times more computation time than BADGE. LPL despite training additional models showed competitive training and querying time. The training time for WAAL, on the other hand, scaled considerably with the size of the dataset, taking about \(18\times \) more time for training the model on the RVL-CDIP dataset compared to Random Sampling baseline. This is the reason why WAAL was only investigated with a batch percentage of 5.0% on RVL-CDIP dataset.

4.5 Effect of varying batch size

One critical hyperparameter for active learning training scenarios is the batch size. Previous studies have shown that smaller batches result in better AL performance since fewer redundant samples are queried [12]. We examined this effect for the document domain using batch percentages of \(\%b=1.25\%\), \(\%b=2.5\%\), and \(\%b=5.0\%\) on a subset of techniques as presented in Table 3. Figure 5 also illustrates the accuracy–budget curves for four different techniques under varying batch sizes. Only four techniques are shown here for visual clarity. Although there were minor differences in accuracy and AUBC across different batch sizes, no major differences were observed that could be directly correlated with increasing or decreasing batch sizes for RVL-CDIP. For Tobacco3482 settings, some minor differences were seen for the AUBC across difference batches which can also be observed in Fig. 5. For example, the effect of different batch sizes on CEAL (Entropy) and AdvBIM approaches is clearly visible with \(b=5.0\%\) resulting in worse performance in comparison. On the other hand, increasing the batch size also resulted in decreased variance across different runs for the Tobacco3482ImageNet case as is evident from the mean standard deviations across different batch sizes. To confirm our findings, we also conducted a single-factor ANOVA test [44] to compare the AUBCs of all methods across the three batch size groups. For RVL-CDIP, no statistically significant difference was found. (Alpha was greater than 0.05.) However, a significant difference was found (alpha was greater than 0.05) for the Tobacco3482 settings in which it was observed that the AUBC of the AL strategies was generally lower for \(b=5.0\%\) than for other batch sizes. This is also evident from the mean AUBC values across different batch sizes in Table 3.

4.6 Bias in initial labeled dataset

4.6.1 Experimental setup

This section describes a set of experiments that examined the effects of bias in the initial labeled dataset \(\mathcal {D}_{L0}\) on the overall performance of the AL query strategies. It is often the case that data bias occurs in real-world deployment scenarios when there is a relative shortage of labeled data for one class in comparison with another. Due to this bias, the model trained in the first round may not accurately represent the underlying distribution of data. We simulate the bias by excluding samples from the initial labeled datasets of m randomly selected classes. This experiment was carried out for two cases, \(m=2\) and \(m=4\), and the results are presented in Table 4. The accuracy–budget curves for this scenario are illustrated in Fig. 6.

Table 4 ResNet-50 performance under supervised learning with fully annotated training datasets (top) and active learning with 40% annotated training datasets and data bias in the initial labeled dataset \(\mathcal {D}_{L0}\) (bottom)

4.6.2 Results on RVL-CDIP

In the case of the RVL-CDIP dataset with 16 classes, the effect of bias was negligible since most techniques exhibited a similar performance trend as in the case of no bias reported previously in Table 1. As can be seen in Fig. 6a, d, the models initially performed poorly in comparison due to a lack of data for some classes; however, their performance quickly improved in subsequent cycles. Overall, no significant performance drop was observed when m was increased from 2 to 4 as well with LPL, CEAL (Entropy), and uncertainty-based approaches again comparatively performing the best in this scenario.

4.6.3 Results on Tobacco3482ImageNet

In Tobacco3482ImageNet setting, AL strategies seemed especially effective in countering the data bias compared to Random Sampling baseline on both \(m=2\) and \(m=4\) cases as they understandably targeted the missing classes in the subsequent AL cycles. Margin and its Dropout variant showed the highest accuracy in this scenario closely followed by KCenterGreedy and Least Confidence. Surprisingly, many strategies showed a higher accuracy with \(m=4\) compared to \(m=2\). This may have been due to the removal of high frequency classes in the initial dataset allowing the model to be less susceptible to overall class imbalance. The classes that were randomly selected for removal for \(m=2\) and \(m=4\) cases for this dataset were \(\{\)Memo, Letter\(\}\) and \(\{\)Memo, Letter, Form, ADVE\(\}\). As can be noticed from Fig. 3, for the \(m=4\) case, Email was the only high frequency class left after the sample removal from the initial labeled dataset. Another interesting observation can be made from Fig. 6e where it is visible that KMeans Sampling, BAIT, and BADGE both performed better than other techniques in the few initial AL cycles; however, their performance degraded after approximately 20% data was annotated. This behavior was also previously seen on the Tobacco3482ImageNet setting described in Sect. 4.3. A possible explanation for this behavior is that BAIT and BADGE are highly vulnerable to class imbalance in the data. WAAL and LPL showed a trend similar to the case of no bias discussed in Sect. 4.3 performing even worse than the Random Sampling baseline in some cases.

Fig. 6
figure 6

Accuracy–budget curves for the different active learning strategies on RVL-CDIP and Tobacco3482 datasets under biased initial labeled dataset

Fig. 7
figure 7

Accuracy–budget curves for the different active learning strategies on Tobacco3482 datasets with top 4 highest frequency classes removed (left) and top 4 lowest frequency classes removed (right)

4.6.4 Results on Tobacco3482RVL-CDIP

For the Tobacco3482RVL-CDIP scenario, the difference in accuracy among different AL strategies including Random Sampling baseline at 40% annotated data was quite negligible and all of them surpassed the performance of the ResNet-50Supervised model with just 15%-20% data annotated. Uncertainty-based approaches, however, scored comparatively higher at 40% annotated data. From Fig. 6c, f, it can also be observed that the enhanced approaches BADGE and BAIT and the diversity-based techniques such as KCenterGreedy and KMeans were able to handle the bias much better than others achieving a higher performance in first few cycles in comparison with others. On the other hand, enhanced approaches AdvBIM and LPL seemed especially vulnerable to the bias, taking a number of iterations to reach the same level of performance as other methods.

4.6.5 Removing highest and lowest frequency classes

In this section, we present the results of another experiment in which, rather than removing the classes at random, we removed the top 4 highest frequency classes and the top 4 lowest frequency classes to determine its overall effect on the different AL strategies. This experiment was only performed for the Tobacco3482ImageNet and Tobacco3482RVL-CDIP settings as only those settings have class imbalance. The top 4 highest frequency classes that were removed from the dataset include Letter, Email, Memo, and Form. In contrast, the top 4 lowest frequency classes that were removed include Resume, News, Note, and ADVE.

In this experiment, the results are presented only as accuracy–budget curves, as shown in Fig. 7. A few interesting conclusions can be drawn from the figure. To begin with, it can be observed that only removing the highest and lowest frequency classes had a significant effect on the performance of representation-based AL strategies such as KMeans, BADGE, and BAIT. This effect was especially pronounced in the Tobacco3482ImageNet case. In both cases of the Tobacco3482ImageNet scenario, the removal of the highest frequency classes (Fig. 7a) and the removal of the lowest frequency classes (Fig. 7c), these techniques performed even worse than the Random Sampling baseline. Surprisingly, KCenterGreedy approach while it is also a representation-based approach stayed unaffected in this experiment. Overall, we observed no significant effect on performance of other approaches in this scenario, with the exception of CEAL (Entropy), which showed some instability in last few AL rounds when the top four classes with the lowest frequency were removed.

4.7 Robustness to annotation noise

4.7.1 Experimental setup

The problem of annotation noise is very common in real-world deployment scenarios of machine learning models. Past studies have shown that even with human annotators, the amount of mislabeled samples in the training dataset can reach up to 10% of its size [45]. This section describes a set of experiments in which a realistic annotation noise scenario was created for document data by randomly switching labels of those document classes that exhibit similarity with each other based on predetermined weights \(W \in \mathbb {R}^{N\times N}\).

figure a

Noise Labeling Strategy

Fig. 8
figure 8

Confusion matrices that show the degree of annotation noise added per class for the RVL-CDIP and Tobacco3482 datasets under noise percentage of \(\epsilon =10\%\). As illustrated, some classes were not subjected to annotation noise due to their distinct differences from others

Table 5 ResNet-50 performance under supervised learning with fully annotated training datasets (top) and active learning with 40% annotated training datasets and varying degrees of annotation noise (bottom)

This process is detailed in Algorithm 1. As shown, for each class l, we determine the probabilities of drawing labels for similar classes based on the predetermined weights \(W \in \mathbb {R}^{N\times N}\). Then, based on the class probabilities \(p \in \mathbb {R}^{1\times N}\), we draw n random labels and assign them to the existing samples of the class l, where n is the total number of samples to be updated. This process is repeated for each class and the training set is updated. Note that we used an in-place update to the dataset in this scenario, so the noise added to the dataset for classes that were mutually similar could be spread over both classes. For example, if two classes Letter and Memo were similar to each other, Algorithm 1 resulted in switching a total of 10% of the samples between them. Additionally, it allowed switching samples between classes that were only indirectly similar. For example, if Letter was similar to Memo and Memo was similar to Presentation, then some of the samples from Letter class were also converted to Presentation class.

Since document classes usually have very high intra-class variance and inter-class similarity, it was difficult to visually determine which classes should be considered similar in our scenario. Therefore, the weights \(W \in \mathbb {R}^{N\times N}\) for the similarity between classes were heuristically determined by inspecting the classes between which the fully trained ResNet-50Supervised model had shown the highest confusion. For example, the confusion matrix of the ResNet-50Supervised model generated on the test set of RVL-CDIP dataset is depicted in Fig. 10a, which directly shows which document classes were confused with each other the most. We use this confusion to directly define the similarity. For example, to generate the class similarity weight for each class, we simply applied a threshold followed by normalization to the off-diagonal entries of the confusion matrix. The resulting similarity weights are presented in Fig. 10b, from which it can be seen that the classes Letter and Memo were determined to be mutually similar both with weights of 1.0. In many cases, it was also possible for one class to be similar to another, but not vice versa. For example, the class Scientific Report was found to be similar to the class News Article with a weight of 0.4 but the opposite was not true.

Fig. 9
figure 9

Accuracy–budget curves for the different active learning strategies on RVL-CDIP and Tobacco3482 datasets in the presence of annotation noise of varying degrees

This experiment was conducted with two settings of percentage annotation noise (\(\epsilon \)) per class, \(\epsilon =10\%\) and \(\epsilon =20\%\). The resulting annotation noise confusion matrices for the two datasets RVL-CDIP and Tobacco3482 after applying Algorithm 1 with percentage noise \(\epsilon =10\%\) per class are illustrated in Fig. 8. The results of this experiment are presented in Table 5, and the accuracy–budget curves for each AL method are illustrated in Fig. 9. For a fair evaluation of the performance of AL methods, the ResNet-50Supervised models in this scenario were also trained on the fully annotated noisy datasets and their performances are reported for each case in the table.

4.7.2 Results on RVL-CDIP

For RVL-CDIP dataset, it can be seen that for noise percentage \(\epsilon =10\%\) the performance of the Random Sampling baseline was severely affected. Most uncertainty-based approaches such as Entropy and Least Confidence, and some enhanced approaches including CEAL (Entropy) and LPL, were still able to counter the effects of noise significantly better in comparison, even surpassing the performance of the ResNet-50Supervised model with just 40% annotated data. While Entropy showed better accuracy, LPL showed a better overall performance with a considerably higher AUBC which can also be observed in Fig. 9a. A slightly different trend was seen for the noise percentage \(\epsilon =20\%\), where the performance of all the techniques was seen to be greatly affected by the noise. However, CEAL (Entropy) was still considerably effective in countering its effects, surpassing both the Random Sampling baseline and the fully trained ResNet-50Supervised model as evident from both Table 5 and Fig. 9d. Besides CEAL (Entropy), the uncertainty-based techniques and the enhanced LPL approach also showed competitive performance in this case, both of which outperformed the fully trained ResNet-50Supervised model at 40% annotated dataset.

4.7.3 Results on Tobacco3482ImageNet

In Tobacco3482ImageNet setting, only uncertainty-based approaches such as Entropy and Margin Sampling seemed to consistently perform well for both cases \(\epsilon =10\%\) and \(\epsilon =20\%\). KCenterGreedy and AdvBIM still seemed to perform slightly better than the Random Sampling baseline, but their performance was severely deteriorated for the \(\epsilon =20\%\) case. Most enhanced approaches including CEAL (Entropy), BADGE, LPL, and WAAL were severely affected by the annotation noise and performed significantly worse than even Random Sampling baseline for the \(\epsilon =20\%\) case.

Fig. 10
figure 10

Confusion matrix of the ResNet-50Supervised on the test set of RVL-CDIP dataset (left) and the weights generated from it (right) are shown

4.7.4 Results on Tobacco3482RVL-CDIP

The Tobacco3482RVL-CDIP scenario showed a different trend, where WAAL showed slightly better performance than other techniques on both the \(\epsilon =10\%\) and \(\epsilon =20\%\) cases; however, the overall differences in performance were quite negligible between different approaches. Similar to previous cases, many of the techniques again surpassed the baseline performance even with just 15–25% of the training set queried as evident from Fig. 9c, f.

4.8 Generalization to other models

Table 6 ConvNeXt-B performance under supervised learning with fully annotated training datasets (top) and active learning with 40% annotated training datasets (bottom)
Fig. 11
figure 11

Accuracy–budget curves of the ConvNeXt-B model for the different active learning strategies on RVL-CDIP and Tobacco3482 datasets

In this section, we investigate whether the AL strategies studied in this work are applicable to other deep networks that can be used for document classification. For this purpose, we apply the AL strategies explored in this paper to the recently introduced ConvNeXt [46] model, specifically its ConvNeXt-B variant. To train the model on RVL-CDIP and Tobacco3482 datasets, we used the same training strategy as [35] and trained the model with image resolution of \(224\times 224\), Adam optimizer, LabelSmoothing, and CutMix and Mixup augmentations. For RVL-CDIP dataset, we compare the results of our experiments with supervised learning performance achieved by [35] on ConvNeXt-B/224 scenario, whereas for the Tobacco3482ImageNet and Tobacco3482RVL-CDIP cases, we separately trained the model on full Tobacco3482 dataset for comparison. For a fair comparison, we used the same training strategies for both active learning and supervised learning. The results of these experiments are presented in Table 6 in which for each AL strategy, both the average accuracy achieved on 40% of original training datasets and the area under the budget curve (AUBC) are given. Figure 11 illustrates the budget–accuracy curves for each AL strategy, under different dataset settings, which indicate the accuracy achieved by the model after each AL round until a total of 40% of the training dataset was annotated.

It can be observed from both Table 6 and Fig. 11a that with just \(40\%\) of the training dataset annotated, the active learning approaches were able to achieve performances very close to the ConvNeXtSupervised [35] model. Similarly, for the Tobacco3482ImageNet and Tobacco3482RVL-CDIP cases, it can be seen from both Fig. 11b, c that the some active learning strategies were able to even outperform the fully trained supervised learning models. This suggests that active learning may have helped the model learn better distributions on the imbalanced dataset compared to supervised training. A noteworthy observation in this scenario was the instability of LPL strategy in both RVL-CDIP and Tobacco3482RVL-CDIP scenarios. While LPL worked quite well for the ResNet-50 model, for ConvNeXt-B, we observed quite a lot of instability in performance during training using the exact same LPL configuration. As shown in both Fig. 11a, c, the LPL technique performed very poorly in this case and we found it difficult to tune its hyperparameters to reach any satisfactory results. Similarly, we found it difficult to train ConvNeXt-B with WAAL on the RVL-CDIP dataset with the exact same configuration as in the case of ResNet-50, with the discriminator loss becoming unstable during training. Aside from these two enhanced approaches, we observed that most AL techniques generalized fairly well to this model and even produced exceptional performance on the datasets.

Table 7 An overview of the top three most effective approaches as well as the top three least effective approaches for each of the scenarios investigated in this study. The approach types Unc. Non-Bayesian, Unc. Bayesian, Representation-based, and Enhanced/Hybrid are highlighted in Blue, Seagreen, Yellow, and Orange, respectively

5 Practical implications

This section summarizes the overall results of our study and discusses its practical implications in the context of document classification. In Table 7, we present an overview of the performance of different AL strategies under different settings of datasets and experiments. In the top section, we present the top 3 highest performing AL approaches for each scenario, and in the bottom section, we present the top 3 worst performing AL approaches based on the experiments performed on the ResNet-50 model. The types of each approach are also highlighted with different colors in order to present an overall view of which types of approaches performed the best and the worst.

From the table, a few interesting observations can be drawn. First, we can observe that the non-Bayesian uncertainty-based approaches were not only computationally efficient, but also consistently produced the best results. On the larger RVL-CDIP dataset, the only two enhanced approaches that performed slightly better than others were LPL and CEAL (Entropy), with CEAL (Entropy) performing slightly better than others in the case of heavier annotation noise. It is worth mentioning, however, that from our results on ConvNeXt-B, we also observed difficulty in extending LPL to other models. In contrast, the three approaches BALD (Dropout), AdvBIM, and WAAL consistently performed worse than the others on this dataset. It is interesting to note that both AdvBIM and WAAL are enhanced approaches that require significantly more computational resources in comparison with the others, yet they failed to produce any satisfactory results in this case. Overall, we can conclude that for large class-balanced document datasets, the AL strategies that are most practical in terms of computational performance and computational costs are uncertainty-based approaches such as Entropy and Margin, as well as the enhanced techniques LPL and CEAL (Entropy). While LPL is difficult to train, it can result in slight performance gains. Thus, it is a reasonable choice if adequate training resources are available to tune hyperparameters. On the other hand, CEAL (Entropy) can be particularly useful when dealing with severe labeling noise.

In the Tobacco3482ImageNet setting, both Bayesian and non-Bayesian uncertainty-based approaches showed the highest performance across different scenarios. While most representation-based approaches were severely affected in the case of bias in the initial labeled dataset, KCenterGreedy fared relatively much better and showed performances similar to uncertainty-based approaches. In contrast, enhanced approaches such as LPL, BAIT, and BADGE were among the worst performing approaches. In addition, BALD (Dropout) and KMeans also performed significantly worse than others in this scenario. Overall, we conclude that for small, imbalanced document datasets without document-specific pretraining, uncertainty-based approaches such as Entropy and Margin as well as the KCenterGreedy (CoreSets) approach are most appropriate to attain both better performance and efficiency.

A similar trend was observed in the Tobacco3482\(_{\textit{RVL}-\textit{CDIP}}\) scenario where uncertainty-based approaches outperformed other approaches. However, because most approaches performed at a similar scale in this scenario, it was difficult to identify any significant advantages of some approaches over others. Nevertheless, enhanced approaches such as LPL, BADGE, BAIT, and WAAL were generally among the least effective approaches, with the exception of noisy datasets, in which WAAL and LPL appeared to be more efficient. As a general rule, we again recommend that when using small datasets with document-specific pretraining, the simplest uncertainty-based approaches seem to be the most appropriate option, as they not only provide superior accuracy and computational performance, but also make the training process more convenient.

6 Conclusion

In this paper, we investigated the potential of active learning in reducing annotation costs while enabling machine learning models to perform competitively in document image classification. An analysis of different active learning strategies has revealed that deep learning models can achieve competitive performance with as little as 40% of the training datasets labeled by Oracle with the use of AL, thereby reducing annotation costs by up to 60%. Additionally, domain-specific pretraining was shown to significantly enhance AL performance on small datasets, allowing models to outperform models trained on fully annotated datasets with as little as 15% of the data annotated. We also demonstrated that, in comparison with self-supervised learning approaches, AL strategies result in better performance on partially annotated datasets. While enhanced approaches such as LPL and CEAL (Entropy) surpassed the simpler uncertainty or representation-based approaches on the RVL-CDIP dataset, their performance was severely degraded under class imbalance on the Tobacco3482 dataset. On the other hand, uncertainty-based approaches such as Entropy and Margin performed more consistently on the Tobacco3482 dataset showing better performance under multiple scenarios of class imbalance, annotation noise, and data bias. Overall, it was observed that class imbalance in the dataset severely affects the performance of various recent SotA techniques such as BADGE, BAIT, and LPL. To address the issues of data imbalance, recently introduced class-balanced active learning approaches [43] may be explored in the future. Another plausible future work could be to explore active learning for multi-modal document analysis tasks, where both image and textual data are utilized in the training process.