1 Introduction

Extensive collections of historical manuscripts holding important notarial documents span across vast lengths of archive shelving globally. Many of these collections have been digitized, transforming them into high-resolution digital images.

Typically, these images are organized sequentially in various archival units like folders, books or boxes, here called “image bundles”.Footnote 1 Each of these bundles can encompass thousands of individual page images which are sequentially organized into several, often many “image documents”, also known as “files”, “acts”, or, specifically for notarial documents considered in this work, “deeds”.

For massive series of documents of this kind, it is often unfeasible for archives to provide detailed metadata that accurately describes the content of each bundle. In particular, if available at all, metadata often goes without information about the specific location of individual deeds within a bundle. Automated solutions are needed to assist archival specialists in the arduous task of cataloging these expansive series. Segmenting bundles into distinct deeds stands as one of the preliminary phases in this operation. This is the primary focus of our current study.

Historically, efforts to address analogous challenges originated in the field of Layout Analysis (LA). This domain encompasses many document analysis tasks, from line detection and page layout segmentation to document understanding.

Numerous line detection and extraction techniques, a basic step for Handwritten Text Recognition (HTR) systems, lean on text baseline detection methods [6, 25] that employ diverse strategies. Some recent studies utilize convolutional networks [4] or employ encoder–decoder designs to concurrently achieve layout segmentation and line detection [3, 18, 27]. Additionally, some incorporate a spatial attention mechanism across various resolutions to existing architecture [9].

Region Proposal Networks (RPN) are also used for single-page LA. Tools like MaskRCNN [12] have proven effective for complex layout segmentations, as demonstrated in Refs. [2, 25]. Notably, in Ref. [1], the same methodology addresses challenges posed by having lines close together, facilitating data extraction from tables.

Furthermore, the LA domain has achieved considerable advancements by implementing transformer-based designs, which occasionally integrate HTR. For instance, the DONUT approach introduced in Ref. [14] offers document understanding by analyzing entire pages in an end-to-end fashion. Another solution, DAN [7], introduces a segmentation-free model that determines the logical layout and simultaneously recognizes textual content without necessitating geometrical data from LA.

The interest in these kinds of holistic approaches notwithstanding, none of these works consider text-image processing beyond the realm of a single-page image, which is the problem we face here.

Automatic classification techniques have been developed for image documents (deeds) to classify their specific typological categories, like “Letter of Payment” or “Will”, with encouraging results documented in Refs. [8, 21, 23, 30]. However, these studies assume that successive page images for each deed are assumedly given. Contrarily, real-world scenarios often present deeds nestled within bundles without explicitly separating the distinct page images each deed comprises.

Therefore, to provide practical solutions for automated documentary management of these important series of manuscripts, a pending problem is how to segment a large bundle into its constituent deeds. This is the task considered in the present work. Previous approaches to this problem exist, but they should be considered only somewhat tentative. In studies like Refs. [5, 6] and [20], the authors introduce text-line-oriented, page-level segmentation techniques using Hidden Markov Models (HMMs) and recurrent CTC-based systems to identify the deeds’ starting, middle, and ending sections within each page image. Further, Tarride et al. [29], outline a complete workflow for data extraction from registers. However, it is important to note that these studies do not provide results on segmenting entire bundles. Additionally, they lack methodologies to ensure consistent detection throughout a whole bundle.

In this paper, we propose a pipeline based on a simple neural network to obtain the posterior probabilities of each image over a closed set of classes: initial, middle, and final. Then, relying on the sequential order of the book’s images, we apply a decoding approach based on the global context on these image posteriors to segment the entire book. The book segmentation is evaluated in terms of two different metrics which clearly show that far better results are obtained when considering the global book context rather than local image classification. One of the main advantages of this pipeline is that it does not require any kind of textual transcripts of the images for training. The cost of producing reference transcripts would be overly prohibitive for this task in practice. Unlike other approaches such as [7, 14], another advantage is that our method does not require large amounts of training data or the creation of any kind of synthetic data. It is also worth mentioning that this pipeline is hardware lightweight and can be trained with low memory GPU (6GB or less).

This work extends the research started in Ref. [22]. Here, a detailed formalization of the problem and new ways of evaluating the results are proposed. Furthermore, we present more comprehensive experiments, with a larger and more complete dataset.

The presentation of our work is organized as follows. Section 2 describes the historical documents considered. Section 3 explains in detail the technical problem entailed by bundle segmentation and the solutions we propose. Section 4 introduces the metric used to assess bundle segmentation results. Section 5 provides dataset and empirical setting details. Section 6 presentes the experiments carried out and the results achieved. And Sect. 7 summarizes the work carried out and comes out with final remarks.

2 The JMDB series of notarial record manuscripts

For this study, we have worked on a portion of the 16,849 “notarial protocol books” which are more than three hundred years old and are preserved in the Provincial Historical Archive of Cadiz (AHPC, by its acronym in Spanish). Each protocol book (or “bundle”) contains more than eight hundred pages, organized into two hundred and fifty deeds on average.

The portion considered in this work is a series of four bundles referred to as JMDB, which contain deeds written by the notary Jose Manuel Briones Delgado between 1712 and 1726. The deeds of each bundle are arranged chronologically, preceded by an onomastic and topographic index (about fifty pages) which has not been used in the experiments presented in this paper. Figure 1 shows a typical JMDB deed.Footnote 2

Fig. 1
figure 1

A four-page deed from the JMBB-4949 bundle of the AHPC

For the JMBD series, the deeds are page-aligned. Each deed always begins on a new recto page and can contain from one to dozens of pages, some of which may be almost blank or without any textual content. The first and last pages of each deed are often visually identifiable because of slight layout differences compared to other pages. However, separating the deeds of each bundle is not straightforward and remains a challenging task. The main difficulty is the similarity many regular pages share with the beginning or ending pages. Various “data-centric” and rule-based techniques have been attempted, but none of them have worked sufficiently well. Additionally, another feature of this(and most other) collection(s) is the lack of textual transcripts. The deed in Fig. 1 showcases the starting, ending, and two intermediary page images.

While these details might appear very specific to these manuscripts, it is important to highlight this series’s sheer magnitude, which deserves the development of ad-hoc methods. Furthermore, innumerable notarial record series exhibit a similar bundle structuring, especially where each deed begins on a fresh page. However, it is pertinent to point out that other notarial record series exist where the deeds are not aligned at the page-level. Among them, we can mention Chancery (HIMANIS) [20], Oficio de Hipotecas de Girona (OHG) [26, 28], and The Cartulary of the Seigneury of Nesle (Nesle) [13].Footnote 3 Though these particular manuscript types are not the focus of our current study, some of the concepts and methods introduced here might also be useful when addressing challenges associated with these kinds of bundles.

3 Problem statement and proposed approaches

The bundle segmentation problem is formalized in this section, along with the approaches we propose, based on individual page-image classification and whole-bundle consistence modeling.

3.1 Problem formalization

Let \(B=D^1,\dots ,D^K\) be a bundle which sequentially encompasses K deeds.Footnote 4 Each deed \(D^k\), in turn, is a sequence of \(M(k)\!\ge \!{2}\,\) page images, denoted as \(D^k=G^k_1,\dots ,G^k_{M(k)}\). In the deed segmentation task, a bundle B is given just as a plain, unsegmented sequence of N page images \(B\!=\!G_1,\dots ,G_N\) and the problem is to find \(K\!+\!1\) boundaries \(b_k\), \(0\!\le \!k\!\le \!{K}\), such that \(b_0\!\!=\!0,\,b_{k-1}\!\!<\!b_k,\,b_K\!=\!N\), and B becomes described as a sequence of deeds \(D^1,\dots ,D^K\), where \(D^k=G_{b_{k-1}+1},\dots ,G_{b_k}\) and \(M(k)=b_k-b_{k-1}\), \(1\le {k}\le K\).

As discussed in Sect. 2, all the deeds within a bundle start on a new page and also end with a full page, even if part of the page is unused. Initial and final deed pages will be referred to as “Initial” (I) and “Final” (F), respectively. All the pages within I  and F  will be referred to as “Mid” (M). Let \(\mathcal {C}{{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}{\{\textsf {I},\textsf {M},\textsf {F}\}}\) be de set of page classesFootnote 5 or labels. A sequence \(c_1,\ldots ,c_N,~c_j\in \mathcal {C},~1\le {j}\le {N}\) of page labels that follows these rules is said to be consistent.

Given a label sequence \(c_1,\ldots ,c_N\), the corresponding segmentation is readily obtained by successively setting the boundaries \(b_k\) to the positions of the pages labeled with F, as outlined by the following pseudocode:

$$\begin{aligned}b_0\!:=\!k\!:=\!0;\quad \\ \text {for}\;(j\!:=\!1\ldots N)\;\text{ if }\,(c_j\!=\!\textsf {F}) &\{k\!:=\!k\!+\!1;\, b_k\!:=\!j\};\; K\!:=\!k \end{aligned}$$
(1)

3.2 Proposed approaches

Three increasingly complex solutions are proposed. The common idea is to model the likelihood of a (consistent) page label sequence, and then try to obtain as a result a sequence with maximum likelihood. In general, the probability of a label sequence \(c_1,\dots ,c_N\) for a given bundle \(B\!=\!G_1,\dots ,G_N\) can be decomposed as:

$$\begin{aligned} P(c_1,\dots ,c_N\mid {B})&= P(c_1\mid {B})\, \prod _{j=2}^N P(c_j\mid {B},c_1,\dots ,c_{j-1}) \end{aligned}$$
(2)

Assuming naive Bayes page class independence and also that \(P(c_j\mid {B})\) only depends on the corresponding image \(G_j\):

$$\begin{aligned} P(c_1,\dots ,c_N\mid {B}) &\approx \prod _{j=1}^N P(c_j\mid {G_j}) \end{aligned}$$
(3)

Now, rather than naive Bayes, a more context-aware decomposition of the right part of Eq. (2) can be adopted (a derivation is given in “Appendix A”):

$$\begin{aligned}&P(c_1,\dots ,c_N\mid {B})~\approx ~ \,P(c_1\mid {G_1})\, \prod _{j=2}^N P(c_j\mid {c_{j-1}})\nonumber \frac{P(c_j\mid {G_j})}{P(c_j)} \end{aligned}$$
(4)

To obtain this expression, two independence assumptions have been made: page class dependency is first-order Markov; and \(G_j\) is conditionaly dependent only on \(c_j\) and conversely \(c_j\) is conditionally dependent only on \(G_j\). This decomposition essentially corresponds to that of a Hidden Markov Model (HMM), as discussed below.

Our proposals to segment B into deads will amount to obtaining a page class sequence with maximum probability; that is:

$$\begin{aligned} \hat{c}_1,\dots ,\hat{c}_N = {{\,\mathrm{arg\,max}\,}}_{c_1,\dots ,c_N} P(c_1,\dots ,c_N\mid {B}) \end{aligned}$$
(5)

where \(P(c_1,\dots ,c_N\mid {B})\) is approximated by Eqs. (3) or (4). This optimization process is called decoding.

3.2.1 Optical page class modeling

A classifier is required to estimate the class-posteriors \(P(c\mid {G})\), \(c\!\in \!\mathcal {C}\) needed in Eqs. (3,4) for each page image G of a bundle. This classifier is fully local in that it completely ignores the context of G (i.e., the preceding and succeding pages) and relies only on “visual” or “optical” features of each individual page image. Possible optical classifiers to model these probabilities are proposed in Sect. 3.3, below.

Given a bundle \(B=G_1,\dots ,G_N\), the page classifier is used to estimate a sequence of class-posteriors \(P(c\mid {G_j})\), which will be referred to as the posteriorgramFootnote 6 of B:

$$\begin{aligned} \textbf{g}_1,\dots ,\textbf{g}_N,\quad g_{jc}\,{{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}~P(c\mid {G_j}),\;\; c\in \mathcal {C},\;\; 1\!\le \!{j}\!\le \!{N} \end{aligned}$$
(6)

3.2.2 Consistency constraints model

The probability decomposition of Eq. (4) is that of a first-order HMM with a set of states \(\mathcal {Q}=\mathcal {C}=\{\textsf {I},\textsf {M},\textsf {F}\}\) and state transition probabilities \(P(c\mid {c'})\), \(c,c'\in \mathcal {Q}\). The state-emission probabilities would be the class-conditional likelihoods \(P(G\mid {c})\), where G is a page image and \(c\in \mathcal {C}\). These likelihoods are proportional to \(P(c\mid {G})/P(c)\), as used in Eq. (4), where the posterior \(P(c\mid {G})\) is given by the image classifier (Eq. 6), and P(c) is trivially estimated from Ground Truth (GT) segmented bundles.

Note that the ultimate goal of segmentation is to preserve the coherence of the textual information of each deed of a bundle. To this end, each deed segment \(D^k\), \(1\!\le {k}\!\le \!{K}\), must fullfil the following hard Consistency Constraints (CC):  \(G_{b_{k-1}+1}\) is an I-page, \(G_{b_k}\) is an F-page and, if \(M(K)\!>\!2\), \(G_{b_{k-1}+2},\dots ,G_{b_k-1}\) are all M-pages.

Correspondingly, only the state F can be final and the initial-state probability must be 1 for the state I and 0 for other states. In addition, \(P(\textsf {I}\!\mid \!\textsf {M})\!=\) \(P(\textsf {M}\!\mid \!\textsf {F})\!=\!P(\textsf {I}\!\mid \!\textsf {I})=\!P(\textsf {F}\!\mid \!\textsf {F})\!=\!0\). The other five transition probabilities can be straightforwardly estimated from GT segmented bundles. This HMM topology is depicted in Fig. 2.

Fig. 2
figure 2

Topology of the Consistency Constraints HMM. For convenience, states are labeled with I, M, and F,  respectively corresponding to the initial, middle, and final pages of a deed

3.2.3 Unconstrained decoding

Using Eq. (3), the optimization (5) becomes trivial (but consistency is not guaranteed):

$$\begin{aligned} \hat{c}_j={{\,\mathrm{arg\,max}\,}}_{c\in \mathcal {C}} ~g_{jc}, \quad 1\le {j}\le {N} \end{aligned}$$
(7)

3.2.4 Greedy decoding

Before attempting to obtain a globally optimal solution to Eq.(5), a much simpler greedy decoder can be devised by locally applying the CCs as follows:

$$\begin{aligned}{} & {} \hat{c}_1=\textsf {I};~~~ \hat{c}_j={{\,\mathrm{arg\,max}\,}}_{c\in \rho (\hat{c}_{j-1})} ~g_{jc}, ~ 2\le {j}\le {N-2};\nonumber \\{} & {} \quad \hat{c}_{N-1}=\pi (\hat{c}_{N-2});\quad \hat{c}_N=\textsf {F}\end{aligned}$$
(8)

where the function \(\rho :\mathcal {C}\rightarrow ~2^{\,\mathcal {C}}\) is defined as: \(\rho (\textsf {I})\!=\!\rho (\textsf {M})\!=\!\{\textsf {M},\textsf {F}\};~\rho (\textsf {F})\!=\!\{\textsf {I}\}\), and \(\pi \!:\mathcal {C}\rightarrow \mathcal {C}\) is a “previous label function” defined as: \(\pi (\textsf {I})=\pi (\textsf {M})=\textsf {M}; ~\pi (F)=\textsf {I}\).

By construction, this greedy decoder yields a consistent label sequence. However, it does not guarantee that the probability of this sequence is maximum.

3.2.5 Viterbi decoding

Following Eqs. (4) and (6) the optimization (5) becomes:

$$\begin{aligned} \hat{c}_1,\dots ,\hat{c}_N = {{\,\mathrm{arg\,max}\,}}_{c_1,\dots ,c_N}\, g_{{1}c_1}\prod _{j=2}^N \tilde{g}_{j{c_j}} P(c_j\mid {c_{j-1}}) \end{aligned}$$
(9)

where \(\tilde{g}_{j{c_j}}{{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}\,{g_{j{c_j}}}/{P(c_j)}\)\(g_{j{c_j}}\), \(1\!\le \!j\!\le \!N\), \(c_j\in \mathcal {C}\), are the components of the bundle posteriorgram, and \(P(c_j)\) are the page class priors.

To achieve a globally optimal solution to this equation, a Dynamic Programming decoder is needed. The Viterbi algorithm, which is one of the most popular of this kind of decoders, can be outlined as follows.

Let V(jq) denote the probability of a max-probability state sequence which ends in state q and generates the first j labels. Set \(V(1,\textsf {I})\!=\!g_{1,\textsf {I}},~V(1,\textsf {M})=V(1,F)=0\). Then the following recurrence relation holds for \(2\le {j}\le {N}\):

$$\begin{aligned} V(j,q) = \max _{q'\in \mathcal {Q}} ~\tilde{g}_{jq}\, P(q\mid {q'})\, V(j-1,q'), \quad q\in \mathcal {Q} \end{aligned}$$
(10)

Once \(V(N,\textsf {F})\) is computed, backtracing yields a globally optimal consistent sequence of states and the corresponding sequence of I,M,F  labels.

3.3 Individual page image classification

To compute the posteriorgram introduced in Eq. (6), a page-image classifier is needed to obtain the class posterior \(P(c\mid {G})\) for each page image G and each \(c\in \{\textsf {I},\textsf {M},\textsf {F}\}\).

In the present work, several classifiers have been tried; namely, ResNet-{18,50,101} [11], ConvNeXt [16], and other transformer-based image classifiers, such as Swin [15]. All of them have been pre-trained on ImageNet from [32].

ResNet and ConvNeXt are both convolutional neural networks with a last, linear classification layer. ResNet is a commonly used architecture for image classification tasks. It consists of multiple convolutional layers with residual connections between them. Residual connections enable the network to train much deeper architectures by mitigating the vanishing gradient problem. The specific architecture is determined by the number of ResNet blocks. The more blocks we set, the larger and deeper the network, and the greater number of parameters to be trained.

ConvNeXt, as the authors explain in Ref. [16], is a “modernized” ResNet with the latest advances from training ConvNets and Vision Transformers. In the present work, only ConvNeXt base has been used. ConvNeXt tiny was tried too, but the results were worse.

We also tried to train Vision Transformers, such as Swin [15], but the model had convergence difficulties and the results were rather poor. A possible cause is the difference in the resolution of images from the pre-training stage, where the original images were of \(224\times {224}\) pixels, while we need to use an image size of \(1024\times {1024}\). A larger size is required because, in our classification task, the clues are usually words-sized visual features or relatively small boxes in specific parts of the image and a higher resolution is needed to avoid losing this information.

4 Evaluation measures

In the approaches described in Sect. 3, several alternatives are proposed for the different components needed to build a complete bundle segmentation system.

To select the best system components, we first need to assess the performance of the different individual page image classifiers (Sect. 3.3), and then evaluate, end-to-end, the segmentation performance of the alternative decoders proposed in Sect. 3.2.

4.1 Assessing individual page classifiers

The performance of the page image classifiers discussed in Sect. 3.3 could be assessed just using the conventional classification error rate. However, the impact of these classifiers on the end-to-end performance of the different methods discussed in Sect. 3.2 is more dependent on the faithfulness of the class probability distribution than on their ability to make the best class choice without taking the context into account.

Therefore, we are interested in well-calibrated class probabilities, not just low classification errors. Even though incontextual classification by max class posterior fails to yield correct class hypotheses, decoding can achieve flawless segmentation if the probabilities of the correct classes are not too low.

The faithfulness of a posterior distribution can be measured by the cross-entropy:

$$\begin{aligned} H(P_t,P) &= \, - \frac{1}{N} \sum _{j=1}^N \sum _{c\in \mathcal {C}} P_t(c\mid G_j)\log _2 P(c\mid G_j) = - \frac{1}{N} \sum _{j=1}^N \log _2 g_{j,c_j} \end{aligned}$$
(11)

where \(P_t\) is the “target distribution”,Footnote 7N is the total number of samples (pages) and \(g_{j,c_j}\) is the class-posterior of the image \(G_j\) for the reference class \(c_j\), as defined in Sect. 3.2. It will be used to evaluate and compare the page-image classifiers between the probability distributions of the hypotheses and the reference distributions. This way, we will assess the uncertainty between both probability distributions, selecting the model with the lowest cross-entropy for later decoding and segmenting the books. It should be noted that cross-entropy is rather preferable to classification error, since our goal is to measure the goodness of the image posterior probabilities provided by the model to segment properly the image sequence.

4.2 Assessing bundle segmentation performance end-to-end

Most importantly, to assess the ultimate goal of the proposed approaches, we need to measure how well a whole bundle is segmented into its deeds. To this end, we propose two metrics called Bundle Segmentation Error Rate (BSER) and Content Alignment Error Rate (CAER).Footnote 8 BSER aims to measure structural deed errors due to wrong segmentation boundaries, without considering the (textual) content of the deeds. On the other hand, CAER explicitly aims to measure the amount of textual information lost because of deed segmentation errors.

In both cases, the same Dynamic Programming (DP) algorithm is used to align reference and hypothesis deeds. However, the individual deed alignment cost is different for BSER and CAER. For the moment, let us just assume this cost is given by a function \(L:\mathcal {D}\times \mathcal {D}\rightarrow \mathbb {R}^{\ge {0}}\), where \(\mathcal {D}\) is the set of deeds which includes the “empty deed” (a deed with no pages or words), denoted by \(\epsilon\).

Let \(\hat{B}=\hat{D}^{1},\dots ,\hat{D}^{\hat{K}}\) be a sequence of \(\hat{K}\) deeds, obtained as a hypothesis for a whole bundle as explained in Sec. 3, and let \(B=D^1,\dots ,D^K\) be the corresponding reference GT. Using these sequences, the minimum deed editing cost to transform B into \(\hat{B}\) is \(\textrm{E}(K,\hat{K})\), which can be computed by DP according to the following recurrence relation:

$$\begin{aligned}{} & {} \mathrm{E}(i,j) = \min \big( \mathrm{E}(i,j-1) + \mathrm{L}(\epsilon ,\hat{D}^j), \nonumber \\{} & {} \quad \mathrm{E}(i-1,j-1) + \mathrm{L}(D^i\!,\hat{D}^j), \nonumber \\{} & {} \quad \mathrm{E}(i-1,j) + \mathrm{L}(D^i\!,\epsilon ) \end{aligned}$$
(12)

The decisions associated to the minimal cost \(\textrm{E}(K,\hat{K})\) are interpreted as deed edit operations; namely, deed insertions, substitutions and deletions, respectively corresponding to the three terms of the \(\min\) function. An insertion indicates that a deed that was not in the reference does appear in the hypothesis. Conversely, a deletion means that a deed of the reference has disappeared in the hypothesis. A substitution, finally, corresponds to matching a pair of reference and hypothesis deeds.

4.2.1 Bundle segmentation error rate

A deed here is considered just as a set of page images. Therefore, a direct way to measure how much two deeds differ is just to compute the difference between the sets of pages of the deeds.

Let \(D,\hat{D}\in \mathcal {D}\) denote the sets of page images of a pair of reference and hypothesis deeds of some bundle. The individual alignment cost for these two deeds is defined as the symetric set difference \(L(D,\hat{D}){{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}\,|{D}\ominus \hat{D}|\), which can be also computed asFootnote 9

$$\begin{aligned} L(D,\hat{D}) = |{D}\cup \hat{D}|-|{D}\cap \hat{D}|\end{aligned}$$
(13)

According to this definition, the cost of a deed insertion, \(L(\epsilon ,\hat{D})=|\hat{D}|\), is the number of page images in \(\hat{D}\). Similarly for deletion, where the cost \(L(D,\epsilon )\) is the number page images in D. And the cost of a substitution \(L(D,\hat{D})\) is just the number of page images which are in D but not in \(\hat{D}\) and those page images that are in \(\hat{D}\) but not in D.

Finally, for the reference deed sequence of a bundle \(B=D^1\!,\dots ,D^K\) and the corresponding segmentation hypothesis \(\hat{B}=\hat{D}^1\!,\dots ,\hat{D}^{\hat{K}}\), the BSER is defined as:

$$\begin{aligned} \textrm{BSER}(B,\hat{B}) = \frac{1}{T}\,\textrm{E}(K,\hat{K}) \end{aligned}$$
(14)

where \(\textrm{E}(K,\hat{K})\) is computed using Eq. (12) with the cost function given by Eq. (13) and \(T{{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}\sum _{i=0}^{K}|{D}^i|\) is the total number of page images in B.

4.2.2 Content alignment error rate

To define a metric that measures the information loss caused by segmentation errors, some issues must be taken into account. First, not solely the quality of structural segmentation should impact the evaluation metrics; also the deed’s textual content should be taken into account. However, recognition errors in nonessential pages or in pages containing minimal or low informational text should exert only a minor penalty on the overall bundle segmentation performance.

Taking these issues into account and following [8, 20], we rely on Probabilistic Indexing (PrIx) [24] and Information Gain (IG) to compute a feature vector \(\mathbf {\textbf{D}}\in \mathbb {R}^n\) for a reference deed and similarly \(\mathbf {\hat{D}}\in \mathbb {R}^n\) for a deed hypothesis. The n components of these vectors correspond to the n words with higher IG, as determined in Refs. [20, 23], and the value of each component is the expected number of occurrences of the corresponding word, estimated from the image PrIx as also discussed in Refs. [20, 23]. In addition, an empty deed \(\epsilon\) is represented by the null vector \(\textbf{0}\). Such a document characterization provides a compact representation of the information provided by the most relevant textual contents of the page images that constitute a deed.

This way, the sequence of deeds obtained by automatic bundle segmentation becomes represented as a sequence of vectors \(\mathbf {\hat{D}}^1\!,\dots ,\mathbf {\hat{D}}^{\hat{K}}\) and, similarly, the corresponding GT reference as \(\textbf{D}^1\!,\dots ,\textbf{D}^K\).

As explained in Ref. [23], the IG calculation depends on the documents’ classes. Since our interest lies in ranking words to select the top n, we have used the same order as in Ref. [23], where 12 deed typologies were adopted as classes. And also in accordance with the results of Prieto et al. [23], we have set the number of high IG words at \(n=16\,384\).

If \(D,\hat{D}\in \mathcal {D}\) is a pair of reference and hypothesis deeds, \(L(D,\hat{D})\) is defined in similar way as the “Bag of Words Error Rate” (\(\text {bWER}\)) in Ref. [31], but without normalization. It estimates the number of word insertion, substitution and deletion edit operations to transform the text in D into the text in \(\hat{D}\), but ignoring the order of words in D or in \(\hat{D}\). Specifically:

$$\begin{aligned} L(D,\hat{D}) ~{{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}~ \frac{1}{2} \Big (\Vert \textbf{D}-\mathbf {\hat{D}}\Vert _1 + \big |\Vert \textbf{D}\Vert _1-\Vert \mathbf {\hat{D}}\Vert _1\big |\Big ) \end{aligned}$$
(15)

According to this definition, the cost of a deed insertion, \(L(\epsilon ,\hat{D})=\Vert \mathbf {\hat{D}}\Vert _1\), is the number of (high IG) running words in \(\hat{D}\). Similarly, the cost of a deletion \(L(D,\epsilon )\) is the number of (high IG) running words in D. And, in the case of a substitution, \(L(D,\hat{D})\) is the unnormalized \(\text {bWER}\) calculated with the two aligned deeds.

Finally, for the reference deed sequence of a bundle \(B=D^1\!,\dots ,D^K\) and the corresponding segmentation hypothesis \(\hat{B}=\hat{D}^1\!,\dots ,\hat{D}^{\hat{K}}\), the CAER is defined as:

$$\begin{aligned} \textrm{CAER}(B,\hat{B}) = \frac{1}{T}\,\textrm{E}(K,\hat{K}) \end{aligned}$$
(16)

where \(\textrm{E}(K,\hat{K})\) is computed using Eq. (12) with the cost function given by Eq. (15) and \(T{{\,\mathrm{{\mathop {=}\limits ^{\tiny {\text {def}}}}}\,}}\sum _{j=0}^{K}\Vert \textbf{D}^j\Vert _1\) is the total number of (high IG) running words in B.

5 Data set and experimental setup

The data set statistics are shown in this section, along with empirical details needed to make the experiments reproducible.

5.1 Data set

Among the AHPC Series of Notarial Record Manuscripts described in Sect. 2, 50 bundles were included in the collection compiled in the Carabela project  [30].Footnote 10, Footnote 11 From these bundles, in the present work we selected four: JMBD4946, JMBD4949, JMBD4950 and JMBD4952 to be manually tagged with segmentation GT annotations.

Table 1 shows statistics for this dataset. Each bundle has more than, or close to one thousand pages. The number of deeds is also shown, with more than two hundred deeds per bundle, and almost three hundred in JMBD4949. An interesting fact is the large variation in the number of pages per deed, with a standard deviation of more than 14 pages per deed in JMBD4946. In fact, one of these deeds spans up to 200 pages. This large deed size variability poses a challenge for automated segmentation, since methods which may rely solely on page count or structure are not sufficiently effective. Additional strategies, such as the methods presented in the present work, need to be employed to accurately identify and separate individual deeds within a bundle.

Table 1 Number of page images and deeds for the bundles JMBD4946, JMBD4949, JMBD4950 and JMBD4952

5.2 Empirical settings

In a real use of the proposed methods, the most expensive step is the production of the required training data. Therefore, we try to limit the amount of training deeds in our experiments as much as possible. To this end, we will obtain results using only one bundle as a training set and the other three as a test set. That is, we perform four experiments—each using a different bundle for training. This protocol will allow us to study training data requirements in more detail by further limiting the amount of deeds used from each training bundle. Note that, for training it is not feasible to use isolated samples from bundles, since it would then be impossible to employ the entire contexts of that bundles for decoding them. Thus training with full bundles in this approach is more aligned with a real-world use case.

Note that the metrics discussed in Sect. 4 are defined at bundle level; for instance in Eq. (16) CAER is defined as the cost of all deed edit operations for a bundle, normalized by the total number of page images of the bundle. Since in the proposed protocol each experiment entails testing with three bundles, we do micro-averaging to compute the metrics; that is, the cost is accumulated over the three bundles and finally normalized by the total number of pages of these bundles.

The image-classifiers explained in Sect. 3.3 are trained, at least, for 15 epochs and a maximum of 30, with an early stopping of 5 epochs concerning to the evaluation loss. For all the models, a \(15\%\) of the training set was randomly selected for validation. The AdamW optimiser [17] was used with a learning rate of 0.001 with a batch size of 4, except for ConvNeXt, where it was reduced to 2 due to memory limitations. A learning rate decay of 0.5 was applied after every 10 epochs. All of the images were resized to \(1024 \times 1024\).

To mitigate fluctuations due to the initialization of parameters of the learning algorithms, all the values reported in tables and curves in Sect. 6 are averages of results obtained with 10 random parameter initializations.

6 Results

First, we evaluated the image classifiers discussed in Sect. 3.2, to measure the quality of the image class-posterior distribution they provide. Table 2 reports the cross-entropy for the different models trained with each bundle, along with the overall average.

Table 2 Cross-entropy (bits/page) between the hypothesis page-image class posterior and the reference (0/1) probability distributions. Training with one bundle and testing with the other three bundles

ResNet50 clearly outperforms the other classifiers, followed by the ConvNeXt model. Recall that no decoder was used in these experiments. According to these results, the model with the lowest cross-entropy, ResNet50, was selected to conduct the following tests where the test bundles are actually segmented.

These end-to-end bundle segmentation results were obtained using the posteriorgrams produced by ResNet50 and each of the decoders proposed in Eqs. (79) of Sect. 3.2. Results in terms of both BSER and CAER metrics are reported in Table 3.

Table 3 BSER and CAER achieved by different decoders, training with one bundle and testing with the other three bundles. The page image classifier was ResNet50. Results are in percentage. The highest standard deviation of fluctuations of the Viterbi decoding results (BSER and CAER) obtained for the 10 trials is 4.55%, while 9.31% for the other decoding results

We first notice that the trends shown by BSER and CAER are very similar, indicating a high correlation between the two metrics, with BSER’s average slightly higher. This is because, while BSER considers all the book pages equally important, giving them the same weight, CAER does not. As explained in Sect. 4.2, CAER will penalize errors more heavily on a page with more textual content than on an empty one since it penalizes the wrong segmented content, not the wrong segmentation itself.

As expected, Viterbi decoding performs significantly better than the other (unconstrained and greedy) approaches in all the four experiments. The last column shows the overall average, noting that the Viterbi performance is more than three times better than that of the other decoders. This makes it clear that considering the global context of the entire book, we have chances of achieving a far better segmentation. On the other hand, approaches that consider local context of the posterior probabilities yield results significantly worse. This demonstrates that the problem is not trivial enough to rely only on image classification without considering the global context.

The best result is achieved when training on JMBD4946, with \(5.6\%\) BSER and \(4.5\%\) CAER. The worst result is for training on JMBD4952, reaching a \(13.4\%\) BSER and \(10.8\%\) CAER. These results align with the average number of pages per deed, showing that training with larger deeds lead to a lower segmentation error on the test bundles. Each of the first three bundles have an average of close to 6 pages per deed, leading the low segmentation errors. In comparison, JMBD4952, with 4.2 pages per deed (the lowest count), exhibits the highest segmentation error.

Finally, learning curves were obtained to study the training needs of the final proposed approach; namely, individual page image class-posteriors computed by a ResNet50 classifier, followed by Viterbi decoding. For each bundle, an increasing number of deeds are considered for training and the other three complete bundles are used for testing, as in the experiments of Table 3. The number of training deeds was chosen in powers of two, preserving the order in which they appear in the bundle, without repetition. As in the previous experiments, each test was repeated 10 times, and average results were reported. Figure 3 shows these results.

Fig. 3
figure 3

Learning curve when training with each bundle. Page class posteriors (posteriorgrams) were obtained with a ResNet50 classifier and decoding was carried out with the Viterbi algorithm

Segmentation error decreases monotonically with the number of training deeds. Notably, from 64 deeds onwards, we begin to obtain useful segmentation performance, except for JMBD4952 which, as already noted, seems to be less informative for training because of the smaller size of its deeds. When training with any of the other three bundles, 128 deeds appear to be enough since errors do not decrease significantly when twice as many training samples are used. On the other hand, the trend at the end of the JMBD4952 curve suggests that better results could be achieved if more deeds of this bundle where available for training.

7 Conclusion

We have proposed a system that, for the first time, can segment a large historical manuscript image bundle into the (many) multi-page deeds it encompasses. The proposed approach entails training an individual page image classifier to compute the class posterior of each page image of a full test bundle. This sequence of posteriors is then “decoded” to obtain a maximum probability class-label sequence which fullfills the full-bundle consistency constraints required for a proper deed segmentation.

Different classifiers and decoders have been studied and compared to finally select the best combination. An important conclusion of this study is that adequately using the bundle’s global context not only ensures segmentation consistency but it leads to much better segmentation decisions.

On the other hand, we have introduced two new end-to-end metrics to evaluate the raw and content-aware segmentation accuracy at the bundle level. While the first metric, BSER, measures the segmentation errors, the second, called CAER, measures the loss of textual information caused by segmentation errors.

In future works, we will study whether an appropriate calibration of the posteriors may improve the segmentation using the Viterbi decoder [10]. We also plan to integrate deed segmentation and deed typology classification, as studied in Ref. [8, 21, 23], taking into account both textual and visual information in a multimodal way [20]. In this way, we also plan to work with deeds without the page-level simplifying restrictions exhibited by the corpora of this paper. In this regard, progress has been made with unrestricted deed corpora, utilizing both visual and textual information, where we hope results will be published soon [19].