1 Introduction

In the last decade, deep neural networks have become the dominant technology for image analysis and recognition. Enormous progress has been reported for various tasks such as image classification [1], segmentation [2], and object detection [3]. This observation also applies to historical document image analysis, a research field that is gaining increasing importance, given the tremendous demand of historians and scholars to analyze large sets of digitized documents [4].

For document image analysis applications, layout analysis plays a central role. One of the most important tasks of layout analysis is text line and text block detection, as it is a useful preliminary step for automatic text recognition. Additionally, various scripts and text styles can occur in the same document, so it is interesting to classify text lines correctly. Thus, this paper focuses on so-called text line detection and classification in historical documents. An efficient way to address this goal is to consider it a semantic labeling task, which can be solved with deep neural networks of the U-Net family.

The text line detection task has been addressed by many researchers, and respectable results are regularly reported for experiments made with data sets produced by the research community for which enough ground-truth information is available. However, in practice, the models trained with these datasets are hardly adaptable to process other document classes for which only limited ground-truth annotations are available.

To overcome the above difficulties, we propose a novel approach based on transfer learning, using controlled data for pre-training. Two complementary strategies are investigated: the first uses artificial data and pre-train networks to solve the real task. The second strategy involves pre-training the networks with a pretext task applied to real data using self-supervised training.

The artificial data we use are synthetically generated, including the necessary annotations; they can be automatically produced in large quantities. Instead of producing synthetic data that imitates the real data, e.g., using generative adversarial networks [5], we create them in an entirely rules-based approach, where the rules reflect the knowledge that the network should learn.

Regarding the pretext tasks applied to real data, we focus on morphological filters that reveal horizontal and vertical block structures. We conjecture that such a task will push the networks to find the relevant text line detection features and classify them according to the vertical context.

To assess the relevance of our strategies, we designed a series of experiments that address the line detection and classification challenges when only a few (in our case, ten) labeled pages are available for training. Our goal is not to achieve the best possible results but to compare the performance of a pre-trained and fine-tuned system to a baseline system trained from scratch. For these experiments, four different U-Net-based network architectures are used. We chose them for their well-known suitability for semantic labeling.

The main observations we report in this paper can be summarized as follows:

  1. 1.

    In most cases, the pre-trained networks reach a comparable or better performance than randomly initialized networks, and they reach their performance much quicker, i.e., after only a fourth of the training epochs. Thus, even when including the pre-training step, the learning process is globally faster and uses fewer resources.

  2. 2.

    The positive gain in performance of pre-trained networks using artificial data is bigger for smaller network architecture, i.e., networks with fewer trainable parameters.

  3. 3.

    The best performance is obtained with the smallest network, which has a number of trainable parameters that is two orders of magnitude smaller than those of the standard U-Net. This network, pre-trained with artificial data, outperforms all other networks.

The remainder of this paper is organized as follows: Sect. 2 presents existing related work. Section 3 describes the task we address. Section 4 describes the neural network architectures that were used in our experiments. In Sect. 5, we present the data used. Section 6 details the experiments and discusses the obtained results. Finally, our conclusion and future work are presented in Sect. 7.

2 Related works

With the recent overgrowth of deep learning, a wide variety of research topics in text line detection in historical document image analysis incorporate deep architectures. This section reviews the literature tackling the most closely related works to ours.

Moysset et al. [6] presented a new method for text line detection in document images based on convolutional neural network (CNN) and multidimensional long short-term memory cells as a regressor. Based on the idea of the YOLO model [7], coordinates of the text lines and confidence scores were predicted. Two regression strategies were compared: directly predicting the coordinates and predicting bottom-left and upper-right points separately, followed by pairing. To evaluate their method, the authors used the Maurdor database [8] containing unconstrained document images handwritten and/or printed in three languages (French, English, and Arabic).

Chen et al. [9] proposed a simple CNN having only one convolution layer. The text line segmentation was achieved using super-pixel labeling where each pixel was classified as background, main text, decoration, or comment. The input to the classification was a patch centered on the super-pixel. For a better understanding, the presented model was compared with conditional random field and multilayer perceptron as pixel classifiers. The following datasets were used for the evaluation: G. Washington [10], Parzival [11], St.Gall [12], and DIVA-HisDB [13].

Oliveira et al. [14] described a generic CNN-based system, dhSegment, for addressing multiple tasks, e.g., layout analysis, page extraction, and baseline detection. The system was based on an FCN, where the encoder was a pre-trained ResNet-50 [15], followed by a post-processing block for pixels classification. The page extraction task was evaluated on the cBAD: ICDAR2017 Competition [16] dataset. The same dataset was used for the baseline detection. The document layout analysis task aimed to assign each pixel to text regions, decorations, comments, or background. This task was evaluated using DIVA-HisDB [13] dataset.

Renton et al. [17] implemented a variant of the U-Net model by using dilated convolutions instead of standard ones for handwritten text line segmentation in historical document images. The model was applied to an x-height labeling. For evaluation, the experiments were running on cBAD dataset [16].

Grüning et al. [18] introduced ARU-Net, an extended variant of the U-Net model, for text line detection in historical documents. Two stages were presented in this method. The first stage focused on pixel labeling to one of the three classes: baseline, separator, or other. The second stage performed a bottom-up clustering to build baselines. ARU-Net has been tested on cBAD [16] and DIVA-HisDB [13] datasets.

Mechi et al. [19] proposed an Adaptive U-Net model for text line detection in handwritten document images, a variant of the U-Net model [20]. Unlike the original U-Net model, which used 64, 128, 256, 512, and 1024 filters at each block in the encoder, the Adaptive U-Net model proposed using 32, 64, 128, 256, and 512 filters. This lead to a reduction of the number of parameters and therefore reduces the memory requirements and the time processing. The presented model was evaluated on three datasets for x-height-based pixel-wise classifications of text line detection: cBAD [16], DIVA-HisDB [13], and the private ANTFootnote 1 datasets.

Boillet et al. [21] presented the Doc-UFCN model, inspired by the dhSegment model [14], for text line detection. The difference between DocUFCN and dhSegment lies in the encoder used. The encoder of dhSegment was the ResNet-50 [15] architecture, pre-trained on natural scene images. The encoder of Doc-UFCN was smaller than dhSegment’s and was fully trained on historical document images. The authors proved that pre-training an FCN model on a few data improved the line detection without needing a huge amount of data to have good results. The Doc-UFCN model was evaluated on Balsac [22], Horae [23], READ-BAD [24], and DIVA-HisDB [13] datasets.

Self-supervised learning techniques are common in natural image classification and segmentation to lower the demand for labeled data. Noroozi and Favaro [25] showed that with a simple jigsaw puzzle, a network can learn valuable object representations useful for different downstream tasks. Other simple methods are used, like the rotation of the image [26], inpainting [27], or coloring gray-scale images [28] or more complex techniques like contrastive learning [29].

Only a little research is done for self-supervised learning in document image analysis like Cosma et al. [30] that translate the jigsaw puzzle approach to document images.

Fig. 1
figure 1

Illustration of several semantic labeling tasks for an artificial document: a original image, b labeling for regions: drop cap (red), main text (yellow), gloss (cyan), highlight (magenta), and capital (green), c labeling for character strokes: drop cap (red), capital (yellow), and main text, by distinguishing median part (white), ascender (green), descender (cyan), d labeling for text line detection: main text (yellow), highlight (magenta), gloss (cyan) (color figure online)

3 Layout analysis task

Layout analysis aims at identifying and locating regions of interest in document images and is considered an essential step for many document analysis tasks. For instance, automatic text transcription requires isolated text blocks, which are later split into text lines. Drop caps and ornaments, which are useful to determine the document’s origin (location and date), need to be located and isolated. Form formulas, tables, or music scores must also be identified for other specialized analysis tasks.

There are several approaches to detect the regions of interest. Often a rectangular bounding box is expected. However, for complex shapes, a more precise description is usually required. Pixel labeling allows the most accurate method to delimit regions of interest. Additionally, there exist various definitions of semantic labeling that depend on the analysis task.

In our research, we consider several semantic pixel labeling tasks:

  • Region classification: detection of dominant blocks such as highlight, main text, gloss, drop cap, initial, and other ornaments;

  • Typographic labeling: separation of text non-text content and for text stroke classification into the median zone (x-height), ascenders, and descenders;

  • Text line detection and classification: a combination of text line location and labeling for which we distinguish the classes highlight, main text, and gloss.

Figure 1 shows an example of a document with these three different labelings.

This paper focuses on text line detection and classification by semantic labeling. This task is of high interest not only because of its relevance for practical applications (automatic transcription, word spotting, script analysis, etc.) but also because of the required knowledge a neural network should learn by combining low-level features (stroke analysis) with high-level structural information of the global page.

4 Network descriptions

In the literature, the classical fully convolutional network (FCN) architectures, variants of the classical deep CNN without dense layers but with convolutional layers, have significantly improved semantic segmentation tasks [31]. This characteristic brings major advantages of highly reducing the number of parameters and working with images having variable input sizes. In the last five years, several variants of FCN architectures have exceeded the state of the art in text line detection [18]. Given this success, it is clear to use these models on challenging historical document images.

In this section, we investigate four FCN architectures as references, vanilla U-Net (U-Net-S) [20] and three variants of U-Net, named U-Net-16, Adaptive U-Net [19], and Doc-UFCN [21]. In Table 1, you can see the different number of parameters for each architecture.

4.1 Standard U-Net and U-Net-16

The U-Net-S architecture, proposed by Ronneberger et al. [20], was initially designed for the semantic labeling of medical images. Recently, increased interest in applying the U-Net-S in historical document image analysis tasks has been observed (see section related works). This architecture operates as an encoder-decoder model. The encoder part represents the downsampling (contracting) path, and the decoder represents the upsampling (expansive) path. The outputs from the downsampling are forwarded to the upsampling part through horizontal connections. U-Net-S allows shortcuts between layers of the same level. A more straightforward combination of high-level and low-level features is introduced, and the vanishing gradient problem is reduced [32]. The U-Net-S architecture is depicted in Fig. 2.

Table 1 Number of parameters of the different FCN architectures

Since we aim to minimize the networks, another optimization has been proposed concerning the number of filters in the encoder path of U-Net-S. Unlike the standard U-Net-S model based on 64, 128, 256, 512, and 1024 filters at each block, we propose a variant of U-Net-S with a constant number of 16 filters at each block.

4.2 Adaptive U-Net

Adaptive U-Net [19] is also a variant of the original U-Net for the text line detection task on historical document images. The extracted features from the contracting path are forwarded to the expanding path through horizontal connections. Unlike the original U-Net model, which used 64, 128, 256, 512, and 1024 filters at each block in the contracting path, the Adaptive U-Net model proposed using 32, 64, 128, 256, and 512 filters. Each convolution block in the contracting path is followed by max-pooling. The tensor that is passed into the expanding path is the bottleneck. The expanding path consists of four convolution blocks, each represented by two simple convolutions followed by a transposed convolution.

4.3 Doc-UFCN

Doc-UFCN [21] is a state-of-the-art method for the text line detection task on historical document images. It is a U-shape FCN composed of a contracting path followed by an expanding path and a last convolution layer for classification. The contracting path is composed of four dilated blocks. Five dilated convolutions represent each block with dilation 1, 2, 4, 8, and 16, respectively. The blocks have 32, 64, 128, and 256 filters, respectively. A max-pooling layer follows each block except for the last one. The expanding path consists of three convolution blocks, each represented by a simple convolution followed by a transposed convolution.

Fig. 2
figure 2

The original U-Net architecture as proposed by Ronneberger et al. [20]

5 Datasets

To investigate the performance of the four neural network architectures, our experiments are carried out on two types of data: real and artificial data. In this section, we present a qualitative description of these data.

5.1 Real data

The real data we use in our experiments are selected from the well-known DIVA-HisDB [13]. DIVA-HisDB is a historic manuscript dataset consisting of three subsets of medieval manuscripts: CSG18, CSG863, and CB55, with challenging layouts, diverse scripts, and degradations. In our work, we focus on the CB55 and CSG18 manuscripts.

Fig. 3
figure 3

Two example pages of CB55 and CSG18 real dataset with the truth from left to right, respectively. In the ground-truth image, black represents the background, yellow main text, cyan gloss, and magenta highlight (color figure online)

Fig. 4
figure 4

An example of the creation of the ground truth produced with morphological filters. a Shows an original image from the CB55 dataset. In image b, we see the result of applying a \(1\times 45\) pixel filter on the binarized blue channel of the original image. c Is similar to b, but the filter size is \(25\times 25\). By using b as the binary mask of the red channel and c as the binary mask for the green channel, we get image d (color figure online)

The CB55 manuscript, Codex Guarneri, was written in the first half of the 14th century in Italy [33]. It consists of the Purgatorio and Inferno from Dante’s Divina Commedia. The CB55 manuscript is written in chancery script and comprises marginal Latin glosses. The CSG18 composite manuscript, the astronomical clock of Pacificus of Verona, was written from the 9th to 12th centuries in St. Gallen [33]. It consists of a collection of liturgical works containing texts and an illustration of the Pacificus of Verona’s star clock. Figure 3 shows some samples of CB55 and CSG18 images with their labels.

Our CB55 dataset is composed of 120 pages (40 labeled and 80 unlabeled). The images have been cropped and resized to \(960\times 1344\) pixels. This corresponds to a medium resolution considered appropriate to capture the layout structure of entire pages with sufficient precision. For our experiments, we created the ground truth for 40 pages. We created a new, more balanced split in which we reserved 20 images exclusively for testing, and the remaining 20 pages were used, 10 for training and 10 for validation. The two left images of Fig. 3 show an example of the CB55 dataset with the truth.

The CSG18 dataset consists of 40 pages from the original DIVA-HisDB dataset. We cropped parts of the black border and resized the images to \(960\times 1440\). With these 40 pages, we created a more balanced split in which we used 10 pages for training, 10 for validation, and 20 for testing. The two right images in Fig. 3 show an example of the CSG18 dataset and the generated ground truth.

We created the ground truth for both datasets CB55 and CSG18 semi-automatic way using the baseline from the existing DIVA-HisDB XML ground truth and adding an average x-height for text lines, glosses, and highlights. Afterward, we inspected and corrected these x-heights by hand to fix wrong orientations and remove noise like page numbers or bleed-through.

5.2 Data for the pretext task

The pretext task we have chosen for self-supervised pre-training has been designed to reveal the natural structures of text, which consist of text lines and text blocks. To emulate these structures, we were inspired by morphological image processing filters that have proven their effectiveness in layout analysis.

Our experiments use a combination of two filters, providing each a binary image. As input, we use the result of an Otsu binarization [34] method applied on the blue image channel, which by observation, provides the best contrast. Both filters consist of an opening operation followed by a closing operation with the same parameters. For CB55, the first filter uses a mask of size \(45\times 1\) (see Fig. 4b), and the second uses a square block mask of \(25\times 25\) pixels (see Fig. 4c). On these binary images, we remove the border by cropping the left and right sides with 30 pixels and the top and bottom with 60 pixels. Then we are replacing the cropped-away parts with value 1. The colors are obtained by concatenating the result of the first filter as the red channel, the result of the second filter as the green channel, and zeroing the blue channel. In Fig. 4d, we see the result of this workflow. The obtained result can be considered a four-class pixel labeling task.

Fig. 5
figure 5

Samples of artificial images used in our experiments. a, c, d Realistic document structures for CB55 dataset with the different fonts: Carolus, Carolus, and Dummy, respectively. b Realistic document structure for the CSG18 dataset with font Papyrus

5.3 Artificial data

As already explained, one of our learning strategies relies on artificial data. Unlike most other approaches published in the scientific literature, our synthetically generated documents don’t necessarily try to imitate real documents; instead, our documents are designed to represent the rules we believe the network should learn. That is why we introduce the term controlled data for pre-training and speak of artificial data instead of synthetic data. Our creativity was mainly guided by a series of initial experiments analyzing the generalization behavior of deep learning.

First, we use a set of the most realistic documents with appropriate fonts and a layout reflecting the CB55 target dataset’s main characteristics: two ink colors and two different sizes of scripts. Additionally, we developed another set where the text was replaced by abstract geometrical shapes that reveal the typographic structure of text lines composed of a median zone with dense horizontal and vertical strokes bordered by ascender and descender zones, composed of much sparser vertical strokes. We call this dummy text. Finally, we designed another document class with an artificial layout and an over-sampled representation of specific item classes, such as highlights and capitals. We simulated page borders for left and right-sided pages. All parameters, such as margins, column width, or text line height, are slightly randomized to capture a reasonable variability.

To produce these artificial data with various ground truths, we proceed as follows. First, we generate a template as an image with 27 classes represented by indexed colors. All these classes help generate data for many different structural analysis tasks, including ornaments and block classification, or character strokes labeling (with ascenders and descenders), etc. (see Fig. 1).

From these templates, we automatically generate the original pages and their corresponding ground truth. Figure 5 illustrates a set of samples of such documents used in our experiments.

6 Experiments

In this section, a set of experiments has been conducted with the aim of targeting the text line detection and classification task, where the classification is done pixel-wise (semantic labeling). The goal is to locate the median zone of text lines by distinguishing three classes: highlight, main text, and gloss. Our objective is to verify and measure to what extent a pre-training, with either artificial data or pretext task, can contribute to a general improvement of the final training.

We evaluate our work on two datasets (CB55 and CGS18) and four network architectures (U-Net-S, U-Net-16, Doc-UFCN, and Adaptive-Unet). Our experiments focus on finding the best neural network, which presents a constructive compromise between the performance and the computational cost, among the four architectures. We want to show the impact of the controlled pre-training data on the detection results while having a small network and a reduced training time. Additionally, we perform preliminary experiments with the morphological ground truth on two networks (U-Net-S and U-Net-16) with the CB55 dataset.

As a baseline, we evaluate the four neural network architectures from scratch using the labeled datasets. Then, we pre-train the neural networks on the synthetic documents (Synth) and the morphological ground truth (Morpho). Afterward, the pre-trained models (as a whole) are fine-tuned on the labeled datasets.

6.1 Metrics

To report the results obtained for the network architectures, we compute the following pixel-level metrics: precision (P), recall (R), F1-score (F), and macro-intersection over union (IoU), defined by Eqs. (1), (2), (3) and (4).

$$\begin{aligned} \textrm{Precision}&=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}} \end{aligned}$$
(1)
$$\begin{aligned} \textrm{Recall}&=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}} \end{aligned}$$
(2)
$$\begin{aligned} \text {F1-score}&=\frac{{2}\times {\textrm{TP}}}{{{2}\times {\textrm{TP}}}+\textrm{FP}+\textrm{FN}} \end{aligned}$$
(3)
$$\begin{aligned} \textrm{IoU}&=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}+\textrm{FN}} \end{aligned}$$
(4)

where:

TP:

number of positive pixels correctly predicted;

FP:

number of negative pixels predicted as positive;

FN:

number of positive pixels predicted as negative.

6.2 Pre-training strategies

In this section, we investigate the influence of different synthetic datasets on the pre-training strategies. Each pre-training dataset consists of 60 pages for training and 30 pages for validation. The first set, termed SetD, comprises document images from a Dummy text. The second set, SetF, combines Lorem ipsum text printed with two historically looking fonts: Papyrus and Carolus. SetM is a mix of the Dummy, Carolus, and Papyrus texts. Table 2 contains the details of the training and validation sets of the three pre-training datasets.

We pre-train the U-Net-16 on the three pre-training datasets, and the obtained models are then fine-tuned on the real data containing 10 documents for training, 10 documents for validation, and 20 documents for testing. The purpose of this experiment is to find the best pre-training dataset to be used in the following experiments. As we can see in Table 3, SetM is the best-performing pre-training dataset.

Table 2 Details of the pre-training strategies
Table 3 Results obtained on the different pre-training strategies (CB55)

6.3 Experimental corpora

To analyze the performances of the four neural network architectures in this work, our experiments have been carried out on two different datasets: CB55 and CSG18 [13]. Each dataset contains 40 real document images: 10 for training, 10 for validation, and 20 for testing.

For each dataset, synthetic data is generated for the pre-training strategy. Each synthetic data contains 180 document images: 120 for training and 60 for validation. We choose a multiple of 20 (4 GPUs \(\times \) 5 batch size) for this split to take full advantage of the hardware and hyperparameters we are using.

The morphological dataset consists of 80 images we randomly took from the original e-codices CB55 book.Footnote 2 Sixty of these images were used for training and 20 for validation. We ensured that none of the real test, training, or validation images were part of this dataset.

Table 4 Results obtained by the fine-tuned models, pre-trained on synthetic data, in comparison with the training from scratch (baseline) on CB55 dataset
Table 5 Results obtained by the fine-tuned models, pre-trained on synthetic data, in comparison with the training from scratch (baseline) on CSG18 dataset

6.4 Evaluation protocol

All of the experiments were done with the help of the PyTorch-Lightning-based [35] DIVA-DAF framework [36].Footnote 3 They are performed on a GPU cluster containing 4\(\times \) NVIDIA RTX 2080 Ti GPUs, 378 GB RAM, and 4\(\times \) Intel©Xeon©6142. The models were trained with a batch size of five. In combination with the cross-entropy loss, ADAM optimizer [37] is used for training with an epsilon of \(1e^{-5}\) and a learning rate equal to \(1e^{-3}\). The best model was chosen based on the maximum IoU value on the validation set for the artificial data and the minimum loss on the validation set for the morphological data. We use all four GPUs during pre-training and just two GPUs when fine-tuning. This makes it possible always to have all GPUs running with a full batch.

For the training from scratch, the baseline results are obtained by training 800 epochs on the real data for the different neural networks. The best models that achieved the highest validation IoU value were saved and then used to predict the test set. The reported results are the trimmed mean (removing the best and the worst result), and the standard deviation over five runs.

For the pre-training with artificial data, the pre-training and the fine-tuning are done for 100 epochs. We tested two configurations of multiple runs. The first consists of running a model one time for pre-training and five times for fine-tuning. The second one consists in running a model five times for pre-training and one time for fine-tuning. Both configurations give almost the same results. For this reason, we chose the first configuration, which consumes less execution time. For each dataset, the trimmed mean and standard deviation are calculated. The results of the U-Net-16, Doc-UFCN, Adaptive U-Net and U-Net-S architectures are reported on CB55 and CSG18 datasets

For the morphological strategy, the results are reported on the CB55 dataset for the U-Net-16 (lightest network) and the U-Net-S (heaviest network) models. We performed three training runs of 200 epochs for each network on the morphological data and chose the best one based on the cross-entropy loss. The size of the datasets causes a difference in the number of epochs between the morphological and synthetic data. The network then gets fine-tuned in the same fashion as the fine-tuning of the synthetic data (100 epochs).

6.5 Quantitative evaluation

The four neural networks have been pre-trained on synthetic data and fine-tuned on CB55 and CSG18 real datasets. The results are reported in Tables 4 and 5, respectively. The morphological pre-training has been done only for U-Net16 and U-Net-S on the CB55 dataset. The results are reported in Table 7. In all these tables, the best F1-score and IoU values are quoted in bold.

Table 6 Obtained results per class from scratch and fine-tuned strategies with the four neural networks

6.5.1 Pre-training strategy using artificial data

Generally, we observe a global and clear performance improvement when using the pre-training with U-Net-16 and Adaptive U-Net architectures. We note that U-Net-16 architecture achieves the best F1-score and IoU, which are boosted by 18.05 and 20.09% on the CB55 dataset and by 8.93 and 8.25% on the CSG18 dataset, respectively. In contrast, the baseline results slightly outperform the fine-tuning one for the U-Net-S architecture by 0.57% F1-score and 0.73% IoU on average on both datasets. We note that the Adaptive U-Net obtains the third-best performance, and it is slightly behind U-Net-S by 1.34% F1-score and 1.00% IoU on the CB55 dataset and by 0.62% F1-score and 0.36% IoU on CSG18 dataset. In contrast, the Doc-UFCN is not impacted by the pre-training and has the worst results on both datasets. This architecture is dedicated to text line segmentation (two classes: background and text). This explains the sensitivity of this architecture to text classification to four classes.

As shown in Table 6, the pre-training results are compared with the baseline one to show its impact per class. Among the five runs, we choose the results of the one having the maximum IoU. We can observe that the pre-training impacts the highlight class more than the other classes: back-ground, main text, and gloss. This can be explained by the small amount of data available for the highlight class. We note that this class is completely unrecognized by U-Net-16 in the baseline. In contrast, in fine-tuning, it achieves the highest F1-score and IoU by 84.21 and 72.72%, respectively. For the Doc-UFCN, the baseline outperforms the fine-tuning in all classes except the highlight, where 7.89 and 9.23%, respectively, boost the F1-score and the IoU. For the Adaptive U-Net, we note that the fine-tuning improves the results slightly in the classes background, main text and gloss. In contrast, for the highlight class, the F1-score and the IoU are boosted by 5.76 and 6.03%, respectively. For the U-Net-S, we obtained almost the same results for both strategies baseline and fine-Tuning, except the highlight class, where we observed degradation of results.

Table 7 Results obtained by the fine-tuned models, pre-trained on morphological data, in comparison with the training from scratch (baseline) on CB55 dataset
Table 8 Run time for each neural network on CB55 with 10 training samples

6.5.2 Morphological pre-training strategy

As shown in Table 7, the U-Net-S architecture achieves the best performance by 89.64% F-1 score and 81.71% IoU on the morpho dataset. Despite that the U-Net-16 results are behind U-Net-S one, they are significantly impacted by the morphological pre-training in which 6.74 and 6.27%, respectively, boost the F-1 score and the IoU. Because of time constraints, we could not conduct experiments on other datasets or network architectures.

6.5.3 Computational cost

Remember that the U-Net-16 model has only 53\(^{\prime }\)908 parameters, whereas U-Net-S, Doc-UFCN, and Adaptive U-Net have 31\(^{\prime }\)043\(^{\prime }\)716, 4\(^{\prime }\)096\(^{\prime }\)322, and 7\(^{\prime }\)760\(^{\prime }\)130 parameters, respectively. Therefore, dealing with a low number of parameters leads to a dramatic reduction in the computational complexity, the memory requirements, and the training time. Table 8 shows the run time (including inference time) for baseline, pre-training, and fine-tuning for each evaluated neural network. We observe that the run times and the number of parameters are sufficiently congruent, except for the Doc-UFCN. The setting of the dilated convolution can justify this without growth in the number of parameters. We show that the U-Net-16 architecture, having the lowest number of parameters, has the lowest computation time, which is more than two times faster than the U-Net-S.

6.5.4 Observations

The following observations are derived based on the conducted results to ensure a deductive compromise between obtained results and computation time.

  • In most cases, we observe an improvement in the performance when the network was pre-training compared to the baseline;

  • The best results with controlled data for pre-training are obtained with the U-Net-16 model having a low number of parameters;

  • The morphological pre-training strategy has a positive impact on performance;

  • We can note that the highest impact of controlled data for pre-training is achieved when combining artificial text samples (with SetM);

  • The fine-tuned model decreases the training time and improves the model’s performance;

  • When the computation time is taken into consideration, the U-Net-16 model has lower complexity compared to the other models with comparable performance;

  • Evaluating and comparing the four neural network models based on the F1-score and IoU would lead to choosing U-Net-16 as the best model, having the best trade-off between the best performance and the lowest computational time.

6.6 Qualitative evaluation

Figure 6 illustrates the predictions of the best result of the pre-trained U-Net-16 model (Fig. 6b), the U-Net-S from scratch (Fig. 6c) and the pre-trained U-Net-S (Fig. 6d) on CB55 dataset. We can see that the standard U-Net suffers from several partially detected lines and other misclassified ones. Also, the highlight class is not well recognized on both from scratch and pre-training strategies. Conversely, the U-Net-16 detects all lines and suffers less from classification mistakes than the U-Net-S, whereas the highlight class region is completely and correctly detected, except for a small amount of misclassified pixels.

Fig. 6
figure 6

Illustration of the qualitative results on the CB55 dataset: a Original image, b U-Net-16 pre-trained with synthetic data, c U-Net-S from scratch, d U-Net-S pre-trained with synthetic data

The results obtained from the qualitative and quantitative evaluations confirm our hypothesis that the small networks, with fewer parameters, are well-suited for the pre-training strategy when only a few real labeled samples are available. It achieves competitive results without being more time-consuming or losing information.

7 Conclusion

In this paper, we have introduced the concept of controlled data used for pre-training neural networks with the aim of boosting their performance after fine-tuning. Such a methodology is particularly interesting when only a few labeled data are available.

Two different strategies have been investigated for pre-training: first, the use of artificial data applied to the real task, and second, the use of a pretext task with real data. The artificial data created for the purpose was not inspired by classical data synthesis,but was designed to reflect rules we believe the network should learn. For the pretext task, the idea was to teach the network relevant features already known by the community to be efficient for layout analysis.

Based on our experiments, we can summarize the findings as follows: using controlled data to pre-train neural networks reduces the resources needed to fine-tune the models while achieving comparable or better results.

The complexity of the neural network architecture seems to have an impact. According to our experiments, the benefit of artificial data is more important for small networks with fewer parameters than for larger networks. We even obtained the best result with the smallest network.

This last observation invites us to investigate the behavior of small networks further. The minimization of resources has become an important research topic, and it is legitimate to wonder if controlled data for pre-training could contribute to this momentum.

Currently, our training strategies have been validated for one task and on two data sets. However, we have some evidence that they can be generalized; therefore, in the future, we plan to consider more tasks and more complex data sets.