当前位置: X-MOL 学术Int. J. Doc. Anal. Recognit. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Textline alignment on the image domain
International Journal on Document Analysis and Recognition ( IF 2.3 ) Pub Date : 2022-08-29 , DOI: 10.1007/s10032-022-00408-5
Boraq Madi , Ahmad Droby , Jihad El-Sana

Editing and publishing a historical manuscript involves a research phase to recover the original manuscript and reconstruct the transmission of its text based on the relations between its surviving copies. Manuscript alignment, which aims to locate the shared and the different text among a set of copies of the same manuscript, is essential for this phase. In this paper, we present an alignment algorithm for historical handwritten documents that works directly on the image domain due to the absence of an accurate handwritten text recognition (HTR) system for handwritten historical documents and the necessity to visualize the original manuscripts in parallel to examine features beyond the transcribed text. Our approach extracts subwords, estimates the similarity among these subwords, and establishes an alignment among them. We extract subwords from textlines images and convert them into sequences of subword images. It estimates the similarity between two subwords using a Siamese network model and applies Longest Common Subsequence (LCS) to establish the alignment between two image sequences. We have implemented our algorithm, trained the Siamese model, and evaluate its performance using textline images from historical documents. Our algorithm outperformed the state-of-the-art by large margins. Unlike the state-of-the-art, the framework builds the alignment from scratch without requiring any prior knowledge concern subwords boundaries. In addition, we build a new dataset for textline alignment for historical documents, which include ten pairs of pages taken from two copies of two Arabic manuscripts and annotated at the subword level.



中文翻译:

图像域上的文本行对齐

编辑和出版历史手稿涉及一个研究阶段,以恢复原始手稿并根据其幸存副本之间的关系重建其文本的传输。手稿对齐,其目的是在同一手稿的一组副本中找到共享的和不同的文本,对于这个阶段来说是必不可少的。在本文中,我们提出了一种历史手写文档的对齐算法,该算法直接在图像域上工作,因为缺乏用于手写历史文档的准确手写文本识别 (HTR) 系统,并且需要并行可视化原始手稿以检查转录文本之外的功能。我们的方法提取子词,估计这些子词之间的相似性,并在它们之间建立对齐。我们从文本行图像中提取子词并将它们转换为子词图像序列。它使用连体网络模型估计两个子词之间的相似性,并应用最长公共子序列 (LCS) 来建立两个图像序列之间的对齐。我们已经实现了我们的算法,训练了 Siamese 模型,并使用来自历史文档的文本行图像评估了它的性能。我们的算法大大优于最先进的算法。与最先进的技术不同,该框架从头开始构建对齐,而不需要任何关于子词边界的先验知识。此外,我们为历史文档的文本行对齐构建了一个新数据集,其中包括取自两份阿拉伯手稿的两份副本并在子词级别进行注释的十对页面。并使用历史文档中的文本行图像评估其性能。我们的算法大大优于最先进的算法。与最先进的技术不同,该框架从头开始构建对齐,而不需要任何关于子词边界的先验知识。此外,我们为历史文档的文本行对齐构建了一个新数据集,其中包括取自两份阿拉伯手稿的两份副本并在子词级别进行注释的十对页面。并使用历史文档中的文本行图像评估其性能。我们的算法大大优于最先进的算法。与最先进的技术不同,该框架从头开始构建对齐,而不需要任何关于子词边界的先验知识。此外,我们为历史文档的文本行对齐构建了一个新数据集,其中包括取自两份阿拉伯手稿的两份副本并在子词级别进行注释的十对页面。

更新日期:2022-08-30
down
wechat
bug