NDOrder: Exploring a novel decoding order for scene text recognition,Expert Systems with Applications

当前位置： X-MOL 学术 › Expert Syst. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

NDOrder: Exploring a novel decoding order for scene text recognition
Expert Systems with Applications ( IF 8.5 ) Pub Date : 2024-03-21 , DOI: 10.1016/j.eswa.2024.123771
Dajian Zhong , Hongjian Zhan , Shujing Lyu , Cong Liu , Bing Yin , Palaiahnakote Shivakumara , Umapada Pal , Yue Lu

Text recognition in scene images is still considered as a challenging task for the computer vision and pattern recognition community. For text images affected by multiple adverse factors, such as occlusion (due to obstacles) and poor quality (due to blur and low resolution), the performance of the state-of-the-art scene text recognition methods degrades. The key reason is that the existing encoder–decoder framework follows fixed left-to-right decoding order, which lacks sufficient contextual information. In this paper, we present a novel decoding order where good-quality characters can first be decoded followed by low-quality characters, which preserves the contextual information irrespective of the aforementioned difficult scenarios. Our method, named NDOrder, extracts visual features with a ViT encoder and then decodes with the Random Order Generation module (ROG) for learning to decode with random decoding orders and the Vision-Content-Position module (VCP) for exploiting the connections among visual information, content and position. In addition, a new dataset named OLQT (Occluded and Low-Quality Text) is created by manually collecting text images that suffer from occlusion or low-quality from several standard text recognition datasets. The dataset is now available at . Experiments on OLQT and public scene text recognition benchmarks show that the proposed method achieves state-of-the-art performance.

中文翻译：

NOrder：探索场景文本识别的新颖解码顺序

对于计算机视觉和模式识别界来说，场景图像中的文本识别仍然被认为是一项具有挑战性的任务。对于受多种不利因素影响的文本图像，例如遮挡（由于障碍物）和质量差（由于模糊和低分辨率），最先进的场景文本识别方法的性能会下降。关键原因是现有的编码器-解码器框架遵循固定的从左到右的解码顺序，缺乏足够的上下文信息。在本文中，我们提出了一种新颖的解码顺序，其中可以首先解码高质量字符，然后解码低质量字符，无论上述困难情况如何，都可以保留上下文信息。我们的方法名为 NDOrder，使用 ViT 编码器提取视觉特征，然后使用随机顺序生成模块 (ROG) 进行解码，以学习使用随机解码顺序进行解码，并使用视觉内容位置模块 (VCP) 进行解码，以利用视觉之间的联系信息、内容和立场。此外，通过从多个标准文本识别数据集中手动收集遭受遮挡或低质量的文本图像，创建了一个名为 OLQT（遮挡和低质量文本）的新数据集。该数据集现已在提供。在 OLQT 和公共场景文本识别基准上的实验表明，所提出的方法实现了最先进的性能。

更新日期：2024-03-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>