当前位置: X-MOL 学术Displays › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
BVA-Transformer: Image-text multimodal classification and dialogue model architecture based on Blip and visual attention mechanism
Displays ( IF 4.3 ) Pub Date : 2024-04-12 , DOI: 10.1016/j.displa.2024.102710
Kaiyu Zhang , Fei Wu , Guowei Zhang , Jiawei Liu , Min Li

Multimodal tasks have become a hot research direction in recent years. The emergence of large-scale models has gradually elevated the extensive multimodal tasks, achieving remarkable achievements. However, when it comes to fusing multiple modalities for multimodal tasks, how to better integrate multimodal features is still a problem worth exploring. In tasks such as sentiment analysis targeting a wide range of social media content, the use of features derived solely from the [CLS] token may lead to insufficient information. This paper proposes BVA-Transformer model architecture for image-text multimodal classification and dialogue, which incorporates the EF-CaTrBERT method for feature fusion and introduces BLIP for the transformation of images to the textual space. This enables the fusion of images and text in the same information space, avoiding issues of information redundancy and conflict compared to traditional feature fusion methods. In addition, we proposed a Global Features Encoder (GFE) module based on visual attention in the BVA-Transformer, which can provide more global and targeted auxiliary features for the [CLS] token. This enables the model to utilize more feature information in classification tasks under this feature fusion method and dynamically select information to focus on. We also introduced the Trv structure from EVA-02 in the Decoder part of BVA-Transformer, investigating its impact on the model performance. Furthermore, we designed a three-stage training to further enhance the model’s performance. Experimental results demonstrate that BVA-Transformer achieves high-quality classification while generating dialogue sentences. Compared to existing multimodal classification models on our validation dataset, it exhibits excellent performance.

中文翻译:

BVA-Transformer:基于Blip和视觉注意力机制的图文多模态分类和对话模型架构

多模态任务已成为近年来的热门研究方向。大型模型的出现,逐渐提升了粗放的多式联运任务,取得了令人瞩目的成就。然而,当谈到融合多模态进行多模态任务时,如何更好地融合多模态特征仍然是一个值得探索的问题。在针对广泛社交媒体内容的情绪分析等任务中,使用仅源自 [CLS] 令牌的特征可能会导致信息不足。本文提出了用于图像文本多模态分类和对话的 BVA-Transformer 模型架构,它结合了 EF-CaTrBERT 方法进行特征融合,并引入了 BLIP 将图像转换为文本空间。这使得图像和文本能够在同一信息空间中融合,与传统的特征融合方法相比,避免了信息冗余和冲突的问题。此外,我们在BVA-Transformer中提出了基于视觉注意力的全局特征编码器(GFE)模块,它可以为[CLS]令牌提供更多全局性和针对性的辅助特征。这使得模型在这种特征融合方法下能够在分类任务中利用更多的特征信息,并动态选择要关注的信息。我们还在 BVA-Transformer 的解码器部分引入了 EVA-02 的 Trv 结构,研究了它对模型性能的影响。此外,我们设计了三阶段训练来进一步提高模型的性能。实验结果表明,BVA-Transformer 在生成对话句子的​​同时实现了高质量的分类。与我们的验证数据集上现有的多模态分类模型相比,它表现出了出色的性能。
更新日期:2024-04-12
down
wechat
bug