Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images,Transactions in GIS

当前位置： X-MOL 学术 › Trans. GIS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images
Transactions in GIS ( IF 2.568 ) Pub Date : 2024-02-24 , DOI: 10.1111/tgis.13146
Shangyou Wu ₁ , Wenhao Yu _{1,

2} , Yifan Zhang ₁ , Mengqiu Huang ₁

Affiliation

As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state‐of‐the‐art multimodal models pre‐trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at https://github.com/nwuSY/svtRetrieval.

中文翻译：

仅使用图像数据的多模态学习：通过融合图像的视觉和场景文本特征来进行街景图像检索的深度无监督模型

作为信息检索中的经典任务之一，图像检索的核心是识别与查询图像具有相似特征的图像，旨在使用户能够方便地从大量图像中找到所需的信息。尤其是街景图像检索，在许多领域都有广泛的应用，例如改进导航和地图服务、制定城市发展规划方案以及分析建筑物的历史演变。然而，街景图像中复杂的前景和背景细节，加上缺乏属性注释，使其成为实际应用中最具挑战性的问题之一。当前的图像检索研究主要使用完全依赖于图像视觉特征的视觉模型，以及需要额外数据源（例如注释文本）的多模态学习模型。然而，创建带注释的数据集非常昂贵，并且本身包含大量场景文本的街景图像通常未经注释。因此，本文提出一种深度无监督学习算法，结合图像数据的视觉和文本特征，以提高街景图像检索的准确性。具体来说，我们采用文本检测算法来识别场景文本，利用字符金字塔直方图编码预测器模型从图像中提取文本信息，部署深度卷积神经网络进行视觉特征提取，并结合对比学习模块进行图像检索。经过对三个街景图像数据集的测试，结果表明，我们的模型比在广泛数据集上预训练的最先进的多模态模型具有一定的优势，其特点是参数更少和浮点运算更少。代码和数据可在https://github.com/nwuSY/svtRetrieval。

更新日期：2024-02-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>