当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification
arXiv - CS - Sound Pub Date : 2024-03-26 , DOI: arxiv-2403.17864
Francesca Ronchini, Luca Comanducci, Fabio Antonacci

In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.

中文翻译:

合成音景:利用文本到音频模型进行环境声音分类

在过去的几年中,文本到音频模型已经成为自动音频生成领域的重大进步。尽管它们代表了令人印象深刻的技术进步,但它们在音频应用程序开发中的使用效果仍然不确定。本文旨在研究这些方面,特别关注环境声音分类的任务。本研究分析了使用文本转音频模型生成的数据进行训练时两种不同环境分类系统的性能。考虑两种情况:a)当训练数据集由来自两个不同文本到音频模型的数据增强时; b) 当训练数据集仅由生成的合成音频组成时。在这两种情况下,分类任务的性能都是在真实数据上进行测试的。结果表明,文本到音频模型对于数据集增强是有效的,而当仅依赖生成的音频时,模型的性能会下降。
更新日期:2024-03-28
down
wechat
bug