Attention-based multimodal image matching,Computer Vision and Image Understanding

当前位置： X-MOL 学术 › Comput. Vis. Image Underst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Attention-based multimodal image matching
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2024-02-07 , DOI: 10.1016/j.cviu.2024.103949
Aviad Moreshet , Yosi Keller

We propose a method for matching multimodal image patches using a multiscale Transformer-Encoder that focuses on the feature maps of a Siamese CNN. It effectively combines multiscale image embeddings while improving task-specific and appearance-invariant image cues. We also introduce a residual attention architecture that allows for end-to-end training by using a residual connection. To the best of our knowledge, this is the first successful use of the Transformer-Encoder architecture in multimodal image matching. We motivate the use of task-specific multimodal descriptors by achieving new state-of-the-art accuracy on both multimodal and unimodal benchmarks, and demonstrate the quantitative and qualitative advantages of our approach over state-of-the-art unimodal image matching methods in multimodal matching. Our code is shared here: .

中文翻译：

基于注意力的多模态图像匹配

我们提出了一种使用多尺度 Transformer-Encoder 来匹配多模态图像块的方法，该编码器专注于 Siamese CNN 的特征图。它有效地结合了多尺度图像嵌入，同时改进了特定于任务和外观不变的图像线索。我们还引入了一种残差注意力架构，允许使用残差连接进行端到端训练。据我们所知，这是 Transformer-Encoder 架构在多模态图像匹配中的首次成功使用。我们通过在多模态和单模态基准上实现新的最先进的准确性来激励使用特定于任务的多模态描述符，并证明我们的方法相对于最先进的单模态图像匹配方法的定量和定性优势在多模态匹配中。我们的代码在这里共享：。

更新日期：2024-02-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>