当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SonicVisionLM: Playing Sound with Vision Language Models
arXiv - CS - Sound Pub Date : 2024-01-09 , DOI: arxiv-2401.04394
Zhifeng Xie, Shengye Yu, Mengtian Li, Qile He, Chaofeng Chen, Yu-Gang Jiang

There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision language models. Instead of generating audio directly from video, we use the capabilities of powerful vision language models (VLMs). When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed temporally controlled audio adapters. Our approach surpasses current state-of-the-art methods for converting video to audio, resulting in enhanced synchronization with the visuals and improved alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/

中文翻译:

SonicVisionLM:用视觉语言模型播放声音

人们对为无声视频生成声音的任务越来越感兴趣,主要是因为它在简化视频后期制作方面的实用性。然而,现有的视频声音生成方法试图直接从视觉表示创建声音,由于难以将视觉表示与音频表示对齐,这可能具有挑战性。在本文中,我们提出了 SonicVisionLM,这是一种新颖的框架,旨在通过利用视觉语言模型生成各种声音效果。我们没有直接从视频生成音频,而是使用强大的视觉语言模型 (VLM) 的功能。当提供无声视频时,我们的方法首先使用 VLM 识别视频中的事件,以建议与视频内容匹配的可能声音。这种方法的转变将图像和音频对齐的挑战性任务转变为通过流行的扩散模型对齐图像到文本和文本到音频的更深入研究的子问题。为了提高法学硕士的音频推荐质量,我们收集了一个广泛的数据集,将文本描述映射到特定的声音效果,并开发了时间控制的音频适配器。我们的方法超越了当前将视频转换为音频的最先进方法,从而增强了与视觉效果的同步并改善了音频和视频组件之间的对齐。项目页面:https://yusiissy.github.io/SonicVisionLM.github.io/
更新日期:2024-01-11
down
wechat
bug