Cross-modal knowledge learning with scene text for fine-grained image classification,IET Image Processing

当前位置： X-MOL 学术 › IET Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cross-modal knowledge learning with scene text for fine-grained image classification
IET Image Processing ( IF 2.3 ) Pub Date : 2024-02-19 , DOI: 10.1049/ipr2.13039
Li Xiong _{1,

2} , Yingchi Mao _{1,

2} , Zicheng Wang _{1,

3} , Bingbing Nie ₄ , Chang Li _{1,

2}

Affiliation

Scene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross-modal scene text recognition, text semantic enhancement, and visual-text feature alignment. In the first stage, multi-attention is used to extract features layer by layer, and a self-mask-based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual-text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con-Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine-grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively.

中文翻译：

利用场景文本进行跨模态知识学习，实现细粒度图像分类

自然图像中的场景文本携带额外的语义信息以帮助图像分类。现有方法缺乏充分考虑对文本的深层理解和视觉文本关系，导致难以判断视觉文本的语义准确性和相关性。本文提出基于场景文本跨模态知识学习（CKLST）方法的图像分类。 CKLST由三个阶段组成：跨模态场景文本识别、文本语义增强和视觉文本特征对齐。第一阶段采用多重注意力机制逐层提取特征，并采用基于自掩模的迭代校正策略来提高场景文本识别精度。第二阶段，利用外部知识提取知识特征，并与文本特征融合，增强文本语义信息。在第三阶段，CKLST通过相似度矩阵实现跨注意力机制的视觉文本特征对齐，从而可以捕获图像和文本之间的相关性，以提高图像分类任务的准确性。在 Con-Text 数据集、Crowd Activity 数据集、Drink Bottle 数据集和 Synth Text 数据集上，CKLST 在细粒度图像分类上的表现明显优于其他基线，分别比其他基线提高了 3.54%、5.37%、3.28% 和 2.81%。 mAP 中的最佳基线。

更新日期：2024-02-20

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>