当前位置: X-MOL 学术Dokl. Math. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Text Reuse Detection in Handwritten Documents
Doklady Mathematics ( IF 0.6 ) Pub Date : 2024-03-11 , DOI: 10.1134/s106456242370120x
A. V. Grabovoy , M. S. Kaprielova , A. S. Kildyakov , I. O. Potyashin , T. B. Seyil , E. L. Finogeev , Yu. V. Chekhovich

Abstract

Plagiarism detection in scholar assignments becomes more and more relevant nowadays. Rapidly growing popularity of online education, active expansion of online educational platforms for secondary and high school education create demand for development of an automatic reuse detection system for handwritten assignments. The existing approaches to this problem are not usable for searching for potential sources of reuse on large collections, which significantly limits their applicability. Moreover, real-life data are likely to be low-quality photographs taken with mobile devices. We propose an approach that allows detecting text reuse in handwritten documents. Each document is a picture and the search is performed on a large collection of potential sources. The proposed method consists of three stages: handwritten text recognition, candidate search and precise source retrieval. We represent experimental results for the quality and latency estimation of our system. The recall reaches 83.3% in case of better quality pictures and 77.4% in case of pictures of lower quality. The average search time is 3.2 s per document on CPU. The results show that the created system is scalable and can be used in production, where fast reuse detection for hundreds of thousands of scholar assignments on large collection of potential reuse sources is needed. All the experiments were held on HWR200 public dataset.



中文翻译:

手写文档中的文本重用检测

摘要

如今,学者作业中的抄袭检测变得越来越重要。在线教育的快速普及,中高中在线教育平台的积极拓展,催生了开发手写作业自动重用检测系统的需求。解决此问题的现有方法无法用于搜索大型集合的潜在重用来源,这极大地限制了它们的适用性。此外,现实生活中的数据很可能是用移动设备拍摄的低质量照片。我们提出了一种允许检测手写文档中文本重用的方法。每个文档都是一张图片,并且搜索是在大量潜在来源上执行的。该方法由三个阶段组成:手写文本识别、候选搜索和精确源检索。我们展示了我们系统的质量和延迟估计的实验结果。对于质量较好的图片,召回率达到 83.3%;对于质量较差的图片,召回率达到 77.4%。每个文档在 CPU 上的平均搜索时间为 3.2 秒。结果表明,创建的系统具有可扩展性,并且可以在生产中使用,其中需要对大量潜在重用资源集合的数十万个学者作业进行快速重用检测。所有实验均在 HWR200 公共数据集上进行。

更新日期:2024-03-11
down
wechat
bug