当前位置: X-MOL 学术Expert Syst. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Lightweight Channel and Time Attention Enhanced 1D CNN Model for Environmental Sound Classification
Expert Systems with Applications ( IF 8.5 ) Pub Date : 2024-03-21 , DOI: 10.1016/j.eswa.2024.123768
Huaxing Xu , Yunzhi Tian , Haichuan Ren , Xudong Liu

One dimension convolutional neural networks (1D CNN) that directly take raw waveforms as input has less competition than 2D CNN recognizing environmental sound. In order to overcome its disadvantages, we propose a novel lightweight 1D CNN structure by employing attention mechanism, which has significant improvement in both accuracy and computational complexity. Concretely, (1) two attention modules are constructed along channel and time dimension separately, and combined to give an intermediate feature map, which focus on key frequency band and semantically related time frame information. (2) Without increasing training overhead, snapshot ensemble is employed to further improve performance. Results from two benchmarking datasets (UrbanSound8k, ESC-10) demonstrated that: by employing attention mechanism, our model outperforms all of the previously reported 1D CNN approaches in accuracy with less parameters. Meanwhile with improved performance gain, the proposed model is superior than most of the existing spectral-based 2D CNN approaches and competitive with SOTA performance, while with orders of magnitude parameters fewer. Overall, it indicates our model is compact and has good potential in practical resource-limited applications, such as sound recognition on embedded platform.

中文翻译:

用于环境声音分类的轻量级通道和时间注意力增强一维 CNN 模型

直接将原始波形作为输入的一维卷积神经网络 (1D CNN) 比识别环境声音的 2D CNN 竞争更小。为了克服其缺点,我们通过采用注意力机制提出了一种新颖的轻量级一维CNN结构,该结构在准确性和计算复杂度上都有显着的提高。具体来说,(1)分别沿着通道和时间维度构建两个注意力模块,并组合起来给出中间特征图,该中间特征图关注关键频带和语义相关的时间帧信息。 (2)在不增加训练开销的情况下,采用快照集成来进一步提高性能。两个基准数据集(UrbanSound8k、ESC-10)的结果表明:通过采用注意力机制,我们的模型在参数较少的情况下在准确度上优于所有先前报道的一维 CNN 方法。同时,随着性能增益的提高,所提出的模型优于大多数现有的基于频谱的 2D CNN 方法,并且与 SOTA 性能具有竞争力,同时参数数量级更少。总的来说,这表明我们的模型很紧凑,并且在资源有限的实际应用中具有良好的潜力,例如嵌入式平台上的声音识别。
更新日期:2024-03-21
down
wechat
bug