当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
arXiv - CS - Sound Pub Date : 2024-03-26 , DOI: arxiv-2403.17508
Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto

This paper explores whether considering alternative domain-specific embeddings to calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds from the DCASE 2023 Task 7 dataset. Using perceptual data from the same task, we find that PANNs-WGM-LogMel produces the best correlation between FAD scores and perceptual ratings of both audio quality and perceived fit with a Spearman correlation higher than 0.5. We also find that music-specific embeddings resulted in significantly lower results. Interestingly, VGGish, the embedding used for the original Fr\'echet calculation, yielded a correlation below 0.1. These results underscore the critical importance of the choice of embedding for the FAD metric design.

中文翻译:

Fréchet 音频距离与人类对环境音频的感知的相关性依赖于嵌入

本文探讨了考虑替代的特定领域嵌入来计算 Fr'echet 音频距离 (FAD) 指标是否可以帮助 FAD 更好地与环境声音的感知评级相关联。我们使用了 VGGish、PANN、MS-CLAP、L-CLAP 和 MERT 的嵌入,它们是为音乐或环境声音评估量身定制的。 FAD 分数是根据 DCASE 2023 任务 7 数据集的声音计算的。使用来自同一任务的感知数据,我们发现 PANNs-WGM-LogMel 在 FAD 分数与音频质量和感知适合度的感知评级之间产生最佳相关性,Spearman 相关性高于 0.5。我们还发现特定于音乐的嵌入会导致结果显着降低。有趣的是,用于原始 Fr\'echet 计算的嵌入 VGGish 产生的相关性低于 0.1。这些结果强调了 FAD 度量设计中嵌入选择的至关重要性。
更新日期:2024-03-28
down
wechat
bug