当前位置: X-MOL 学术Appl. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Polyphonic sound event localization and detection using channel-wise FusionNet
Applied Intelligence ( IF 5.3 ) Pub Date : 2024-04-13 , DOI: 10.1007/s10489-024-05438-6
Spoorthy V. , Shashidhar G. Kooolagudi

Sound Event Localization and Detection (SELD) is the task of spatial and temporal localization of various sound events and their classification. Commonly, multitask models are used to perform SELD. In this work, a deep learning network model named channel-wise ‘FusionNet’ is designed to perform the SELD task. The novel fusion layer is introduced into the regular Deep Neural Network (DNN), where the input is fed channel-wise, and the outputs of all channels are fused to form a new feature representation. The key contribution of this work is the neural network model which helps to retain the channel-wise information from the multichannel input along with the spatial and temporal information. The proposed network utilizes separable convolution blocks in the convolution layers, therefore, the complexity of the model is low in terms of both time and space. The features used as input are Mel-band energies for Sound Event Detection (SED) and intensity vectors for the Direction-of-Arrival (DOA) estimation. The proposed network’s fusion layer provides a better representation of features for both SED and DOA estimation tasks. Experiments are performed on the recordings of the First-order Ambisonic (FOA) array format of the TAU-NIGENS Spatial Sound Events 2020 dataset. An improved performance is achieved in terms of Error Rate (ER), DOA error, and Frame Recall (FR) has been observed in comparison to the state-of-the-art SELD systems.



中文翻译:

使用通道式 FusionNet 进行和弦声音事件定位和检测

声音事件定位和检测(SELD)是各种声音事件的空间和时间定位及其分类的任务。通常,多任务模型用于执行 SELD。在这项工作中,设计了一个名为通道“FusionNet”的深度学习网络模型来执行 SELD 任务。新颖的融合层被引入到常规深度神经网络(DNN)中,其中输入按通道馈送,并且所有通道的输出被融合以形成新的特征表示。这项工作的关键贡献是神经网络模型,它有助于保留来自多通道输入的通道信息以及空间和时间信息。所提出的网络在卷积层中使用可分离的卷积块,因此,模型在时间和空间上的复杂度都很低。用作输入的特征是用于声音事件检测 (SED) 的梅尔带能量和用于到达方向 (DOA) 估计的强度向量。所提出的网络融合层为 SED 和 DOA 估计任务提供了更好的特征表示。在 TAU-NIGENS 空间声音事件 2020 数据集的一阶 Ambisonic (FOA) 阵列格式的录音上进行了实验。与最先进的 SELD 系统相比,在错误率 (ER)、DOA 错误和帧召回 (FR) 方面实现了改进的性能。

更新日期:2024-04-14
down
wechat
bug