Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection
arXiv - CS - Multimedia Pub Date : 2024-04-15 , DOI: arxiv-2404.09654
Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu

Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.

中文翻译：

法学硕士了解视觉异常吗？揭示法学硕士在零样本异常检测中的能力

大型视觉语言模型（LVLM）在自然语言指导下导出视觉表示方面非常熟练。最近的探索利用 LVLM 来解决零样本视觉异常检测 (VAD) 挑战，方法是将图像与指示正常和异常条件的文本描述（称为异常提示）配对。然而，现有的方法依赖于容易出现跨语义歧义的静态异常提示，并且将全局图像级表示优先于关键的局部像素级图像到文本对齐，这是准确异常定位所必需的。在本文中，我们提出了 ALFA，这是一种无需训练的方法，旨在通过统一模型应对这些挑战。我们提出了一种运行时提示适应策略，该策略首先生成信息异常提示以利用大语言模型（LLM）的功能。该策略通过针对每个图像异常提示适应和跨语义歧义缓解的上下文评分机制得到增强。我们进一步引入了一种新颖的细粒度对齐器，通过将图像文本对齐从全局语义空间投影到局部语义空间，来融合局部像素级语义以实现精确的异常定位。对具有挑战性的 MVTec 和 VisA 数据集的广泛评估证实了 ALFA 在利用零样本 VAD 的语言潜力方面的有效性，与最先进的零样本相比，在 MVTec AD 上实现了 12.1% 的 PRO 显着改进，在 VisA 上实现了 8.9% 的 PRO 显着改进VAD 接近。

更新日期：2024-04-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>