SonicVisionLM: Playing Sound with Vision Language Models

Xie, Zhifeng; Yu, Shengye; He, Qile; Li, Mengtian

Computer Science > Multimedia

arXiv:2401.04394 (cs)

[Submitted on 9 Jan 2024 (v1), last revised 3 Apr 2024 (this version, v3)]

Title:SonicVisionLM: Playing Sound with Vision Language Models

Authors:Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

View PDF HTML (experimental)

Abstract:There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: this https URL

Comments:	CVPR 2024
Subjects:	Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2401.04394 [cs.MM]
	(or arXiv:2401.04394v3 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2401.04394

Submission history

From: Shengye Yu [view email]
[v1] Tue, 9 Jan 2024 07:30:10 UTC (21,124 KB)
[v2] Sat, 27 Jan 2024 07:01:10 UTC (21,128 KB)
[v3] Wed, 3 Apr 2024 10:23:06 UTC (7,118 KB)

Computer Science > Multimedia

Title:SonicVisionLM: Playing Sound with Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:SonicVisionLM: Playing Sound with Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators