ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models
arXiv - CS - Multimedia Pub Date : 2024-03-29 , DOI: arxiv-2403.20194
Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang

This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on a distinct capability, mirroring the cognitive progression from basic perception to logical reasoning and ultimately to advanced creativity. ConvBench comprises 577 meticulously curated multi-turn conversations encompassing 215 tasks reflective of real-world demands. Automatic evaluations quantify response performance at each turn and overall conversation level. Leveraging the capability hierarchy, ConvBench enables precise attribution of conversation mistakes to specific levels. Experimental results reveal a performance gap between multi-modal models, including GPT4-V, and human performance in multi-turn conversations. Additionally, weak fine-grained perception in multi-modal models contributes to reasoning and creation failures. ConvBench serves as a catalyst for further research aimed at enhancing visual dialogues.

中文翻译：

ConvBench：针对大型视觉语言模型的具有分层功能的多轮对话评估基准

本文提出了 ConvBench，这是一种专为大型视觉语言模型 (LVLM) 量身定制的新型多轮对话评估基准。与评估单轮对话中个人能力的现有基准不同，ConvBench 采用三级多模态能力层次结构，通过叠加感知、推理和创造力来模仿人类认知过程。每个级别都侧重于一种独特的能力，反映了从基本感知到逻辑推理并最终到高级创造力的认知进程。 ConvBench 包含 577 个精心策划的多轮对话，其中包含反映现实世界需求的 215 项任务。自动评估量化每个回合和整体对话级别的响应性能。利用能力层次结构，ConvBench 可以将对话错误精确归因于特定级别。实验结果揭示了多模态模型（包括 GPT4-V）与人类在多轮对话中的表现之间的性能差距。此外，多模态模型中细粒度感知的薄弱也会导致推理和创造失败。 ConvBench 充当旨在增强视觉对话的进一步研究的催化剂。

更新日期：2024-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>