Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval
arXiv - CS - Information Retrieval Pub Date : 2024-03-20 , DOI: arxiv-2403.13317
Haoyu Liu, Yaoxian Song, Xuwu Wang, Zhu Xiangru, Zhixu Li, Wei Song, Tiefeng Li

With the explosive growth of multi-modal information on the Internet, unimodal search cannot satisfy the requirement of Internet applications. Text-image retrieval research is needed to realize high-quality and efficient retrieval between different modalities. Existing text-image retrieval research is mostly based on general vision-language datasets (e.g. MS-COCO, Flickr30K), in which the query utterance is rigid and unnatural (i.e. verbosity and formality). To overcome the shortcoming, we construct a new Compact and Fragmented Query challenge dataset (named Flickr30K-CFQ) to model text-image retrieval task considering multiple query content and style, including compact and fine-grained entity-relation corpus. We propose a novel query-enhanced text-image retrieval method using prompt engineering based on LLM. Experiments show that our proposed Flickr30-CFQ reveals the insufficiency of existing vision-language datasets in realistic text-image tasks. Our LLM-based Query-enhanced method applied on different existing text-image retrieval models improves query understanding performance both on public dataset and our challenge set Flickr30-CFQ with over 0.9% and 2.4% respectively. Our project can be available anonymously in https://sites.google.com/view/Flickr30K-cfq.

中文翻译：

Flickr30K-CFQ：用于文本图像检索的紧凑且碎片化的查询数据集

随着互联网上多模态信息的爆炸性增长，单模态搜索已经不能满足互联网应用的需求。需要进行文本图像检索研究以实现不同模态之间的高质量和高效检索。现有的文本图像检索研究大多基于通用视觉语言数据集（例如MS-COCO、Flickr30K），其中查询语句僵化且不自然（即冗长和形式化）。为了克服这个缺点，我们构建了一个新的紧凑和碎片查询挑战数据集（名为 Flickr30K-CFQ）来建模文本图像检索任务，考虑多种查询内容和风格，包括紧凑和细粒度的实体关系语料库。我们提出了一种基于 LLM 的提示工程的新型查询增强文本图像检索方法。实验表明，我们提出的 Flickr30-CFQ 揭示了现有视觉语言数据集在现实文本图像任务中的不足。我们基于 LLM 的查询增强方法应用于不同的现有文本图像检索模型，在公共数据集和挑战集 Flickr30-CFQ 上的查询理解性能分别提高了 0.9% 和 2.4% 以上。我们的项目可以在 https://sites.google.com/view/Flickr30K-cfq 中匿名获取。

更新日期：2024-03-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>