Crowdsourcing Thumbnail Captions: Data Collection and Validation,ACM Transactions on Interactive Intelligent Systems

当前位置： X-MOL 学术 › ACM Trans. Interact. Intell. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Crowdsourcing Thumbnail Captions: Data Collection and Validation
ACM Transactions on Interactive Intelligent Systems ( IF 3.4 ) Pub Date : 2023-03-28 , DOI: https://dl.acm.org/doi/10.1145/3589346
Carlos Aguirre, Shiye Cao, Amama Mahmood, Chien-Ming Huang

Speech interfaces, such as personal assistants and screen readers, read image captions to users—but typically only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension, yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail; we consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.

中文翻译：

众包缩略图字幕：数据收集和验证

个人助理和屏幕阅读器等语音界面向用户朗读图像说明，但通常每幅图像只有一个说明可用，这可能不适用于所有情况（例如，浏览大量图像）。长字幕提供了对图像的更深入理解，但需要更多时间来聆听，而较短的字幕可能无法实现如此彻底的理解，但具有更快消费的优势。我们探索如何有效地收集缩略图说明（旨在快速使用的简洁图像描述）和综合说明（使个人能够更详细地理解视觉内容）；我们考虑使用基于文本的说明和时间受限的方法来收集这两个详细级别的描述，并发现时间受限的方法对于收集缩略图字幕最有效，同时保持字幕的准确性。此外，我们验证使用这种时间限制方法的字幕作者仍然能够通过跟踪他们的眼睛注视来关注图像中最重要的区域。我们沿着人类评分轴评估我们收集的字幕——正确性、流畅性、细节量和重要概念的提及——并讨论基于模型的指标在未来执行大规模自动评估的潜力。我们验证使用这种时间限制方法的字幕作者仍然能够通过跟踪他们的眼睛注视来关注图像中最重要的区域。我们沿着人类评分轴评估我们收集的字幕——正确性、流畅性、细节量和重要概念的提及——并讨论基于模型的指标在未来执行大规模自动评估的潜力。我们验证使用这种时间限制方法的字幕作者仍然能够通过跟踪他们的眼睛注视来关注图像中最重要的区域。我们沿着人类评分轴评估我们收集的字幕——正确性、流畅性、细节量和重要概念的提及——并讨论基于模型的指标在未来执行大规模自动评估的潜力。

更新日期：2023-03-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>