当前位置:
X-MOL 学术
›
arXiv.cs.MM
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models
arXiv - CS - Multimedia Pub Date : 2024-03-29 , DOI: arxiv-2403.20194 Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang
arXiv - CS - Multimedia Pub Date : 2024-03-29 , DOI: arxiv-2403.20194 Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang
This paper presents ConvBench, a novel multi-turn conversation evaluation
benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing
benchmarks that assess individual capabilities in single-turn dialogues,
ConvBench adopts a three-level multimodal capability hierarchy, mimicking human
cognitive processes by stacking up perception, reasoning, and creativity. Each
level focuses on a distinct capability, mirroring the cognitive progression
from basic perception to logical reasoning and ultimately to advanced
creativity. ConvBench comprises 577 meticulously curated multi-turn
conversations encompassing 215 tasks reflective of real-world demands.
Automatic evaluations quantify response performance at each turn and overall
conversation level. Leveraging the capability hierarchy, ConvBench enables
precise attribution of conversation mistakes to specific levels. Experimental
results reveal a performance gap between multi-modal models, including GPT4-V,
and human performance in multi-turn conversations. Additionally, weak
fine-grained perception in multi-modal models contributes to reasoning and
creation failures. ConvBench serves as a catalyst for further research aimed at
enhancing visual dialogues.
中文翻译:
ConvBench:针对大型视觉语言模型的具有分层功能的多轮对话评估基准
本文提出了 ConvBench,这是一种专为大型视觉语言模型 (LVLM) 量身定制的新型多轮对话评估基准。与评估单轮对话中个人能力的现有基准不同,ConvBench 采用三级多模态能力层次结构,通过叠加感知、推理和创造力来模仿人类认知过程。每个级别都侧重于一种独特的能力,反映了从基本感知到逻辑推理并最终到高级创造力的认知进程。 ConvBench 包含 577 个精心策划的多轮对话,其中包含反映现实世界需求的 215 项任务。自动评估量化每个回合和整体对话级别的响应性能。利用能力层次结构,ConvBench 可以将对话错误精确归因于特定级别。实验结果揭示了多模态模型(包括 GPT4-V)与人类在多轮对话中的表现之间的性能差距。此外,多模态模型中细粒度感知的薄弱也会导致推理和创造失败。 ConvBench 充当旨在增强视觉对话的进一步研究的催化剂。
更新日期:2024-04-01
中文翻译:
ConvBench:针对大型视觉语言模型的具有分层功能的多轮对话评估基准
本文提出了 ConvBench,这是一种专为大型视觉语言模型 (LVLM) 量身定制的新型多轮对话评估基准。与评估单轮对话中个人能力的现有基准不同,ConvBench 采用三级多模态能力层次结构,通过叠加感知、推理和创造力来模仿人类认知过程。每个级别都侧重于一种独特的能力,反映了从基本感知到逻辑推理并最终到高级创造力的认知进程。 ConvBench 包含 577 个精心策划的多轮对话,其中包含反映现实世界需求的 215 项任务。自动评估量化每个回合和整体对话级别的响应性能。利用能力层次结构,ConvBench 可以将对话错误精确归因于特定级别。实验结果揭示了多模态模型(包括 GPT4-V)与人类在多轮对话中的表现之间的性能差距。此外,多模态模型中细粒度感知的薄弱也会导致推理和创造失败。 ConvBench 充当旨在增强视觉对话的进一步研究的催化剂。