A Novel Pretrained General-Purpose Vision Language Model for the Vietnamese Language,ACM Transactions on Asian and Low-Resource Language Information Processing

当前位置： X-MOL 学术 › ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Novel Pretrained General-Purpose Vision Language Model for the Vietnamese Language
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2024-03-30 , DOI: 10.1145/3654796
Vu Dinh Anh ₁ , Pham Quang Nhat Minh ₂ , Giang Son Tran ₃

Affiliation

Lying in the cross-section of computer vision and natural language processing, vision language models are capable of processing images and text at once. These models are helpful in various tasks: text generation from image and vice versa, image-text retrieval, or visual navigation. Besides building a model trained on a dataset for a task, people also study general-purpose models to utilize many datasets for multitasks. Their two primary applications are image captioning and visual question answering. For English, large datasets and foundation models are already abundant. However, for Vietnamese, they are still limited. To expand the language range, this work proposes a pretrained general-purpose image-text model named VisualRoBERTa. A dataset of 600K images with captions (translated MS COCO 2017 from English to Vietnamese) is introduced to pretrain VisualRoBERTa. The model’s architecture is built using Convolutional Neural Network and Transformer blocks. Fine-tuning VisualRoBERTa shows promising results on the ViVQA dataset with 34.49% accuracy, 0.4173 BLEU 4, and 0.4390 RougeL (in visual question answering task), and best outcomes on the sViIC dataset with 0.6685 BLEU 4, 0.6320 RougeL (in image captioning task).

中文翻译：

一种新型的越南语预训练通用视觉语言模型

视觉语言模型位于计算机视觉和自然语言处理的交叉领域，能够同时处理图像和文本。这些模型有助于各种任务：从图像生成文本（反之亦然）、图像文本检索或视觉导航。除了构建在任务数据集上训练的模型之外，人们还研究通用模型以利用多个数据集来执行多任务。它们的两个主要应用是图像字幕和视觉问答。对于英语来说，大型数据集和基础模型已经很丰富。然而，对于越南人来说，它们仍然有限。为了扩大语言范围，这项工作提出了一种名为 VisualRoBERTa 的预训练通用图像文本模型。引入了包含 60 万张带字幕图像的数据集（将 MS COCO 2017 从英语翻译为越南语）来预训练 VisualRoBERTa。该模型的架构是使用卷积神经网络和 Transformer 块构建的。微调 VisualRoBERTa 在 ViVQA 数据集上显示出有希望的结果，准确度为 34.49%、0.4173 BLEU 4 和 0.4390 RougeL（在视觉问答任务中），并且在 sViIC 数据集上显示出最佳结果，为 0.6685 BLEU 4、0.6320 RougeL（在图像字幕任务中））。

更新日期：2024-03-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>