当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer
arXiv - CS - Sound Pub Date : 2024-03-26 , DOI: arxiv-2403.17327
Jeong-Yoon Kim, Seung-Ho Lee

In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram to analyze the correlation of frequencies over time. This type of patch allows us to correlate the most relevant frequencies for a particular emotion with the time they were uttered. ii) We propose the use of image coordinate encoding, an absolute positional encoding suitable for ViT. By normalizing the x, y coordinates of the image to -1 to 1 and concatenating them to the image, we can effectively provide valid absolute positional information for ViT. iii) Through feature map matching, the locality and location information of the teacher network is effectively transmitted to the student network. Teacher network is a ViT that contains locality of convolutional stem and absolute position information through image coordinate encoding, and student network is a structure that lacks positional encoding in the basic ViT structure. In feature map matching stage, we train through the mean absolute error (L1 loss) to minimize the difference between the feature maps of the two networks. To validate the proposed method, three emotion datasets (SAVEE, EmoDB, and CREMA-D) consisting of speech were converted into log-Mel spectrograms for comparison experiments. The experimental results show that the proposed method significantly outperforms the state-of-the-art methods in terms of weighted accuracy while requiring significantly fewer floating point operations (FLOPs). Overall, the proposed method offers an promising solution for SER by providing improved efficiency and performance.

中文翻译:

利用时间频率相关性和通过知识迁移学习位置信息的频谱图语音情感识别的准确性增强方法

在本文中,我们提出了一种通过使用视觉变换器(ViT)关注频谱图中频率(y轴)与时间(x轴)的相关性并传递位置信息来提高语音情感识别(SER)准确性的方法。 ViT 之间通过知识转移传递信息。所提出的方法具有以下独创性i)我们使用log-Mel频谱图的垂直分段块来分析频率随时间的相关性。这种类型的补丁使我们能够将特定情绪的最相关频率与其说出的时间相关联。 ii)我们建议使用图像坐标编码,这是一种适合 ViT 的绝对位置编码。通过将图像的x、y坐标标准化为-1到1并将它们连接到图像,我们可以有效地为ViT提供有效的绝对位置信息。 iii)通过特征图匹配,将教师网络的局部性和位置信息有效地传输到学生网络。教师网络是通过图像坐标编码包含卷积干的局部性和绝对位置信息的ViT,而学生网络是基本ViT结构中缺乏位置编码的结构。在特征图匹配阶段,我们通过平均绝对误差(L1损失)进行训练,以最小化两个网络特征图之间的差异。为了验证所提出的方法,将由语音组成的三个情感数据集(SAVEE、EmoDB 和 CREMA-D)转换为 log-Mel 频谱图以进行比较实验。实验结果表明,所提出的方法在加权精度方面显着优于最先进的方法,同时需要显着更少的浮点运算(FLOP)。总体而言,所提出的方法通过提高效率和性能,为 SER 提供了一种有前途的解决方案。
更新日期:2024-03-28
down
wechat
bug