Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition
Speech Communication ( IF 3.2 ) Pub Date : 2023-11-22 , DOI: 10.1016/j.specom.2023.103010
Minying Liu , Alex Noel Joseph Raj , Vijayarajan Rajangam , Kunwu Ma , Zhemin Zhuang , Shuxin Zhuang

Speech emotion recognition (SER) is a crucial field of research in artificial intelligence and human–computer interaction. Extracting effective speech features for emotion recognition is a continuing research focus in SER. Most research has focused on finding an optimal speech feature to extract hidden local features while ignoring the global relationships of the speech signal. In this paper, we propose a method that utilizes a multiscale-multichannel feature extraction structure with global and local information to obtain comprehensive speech features. Our approach employs a one-dimensional convolutional neural network (1D CNN) for feature learning and emotion recognition, capturing both spectral and spatial characteristics of speech for superior learning capabilities with improved SER results. We conducted extensive experiments on publicly available emotion recognition datasets, employing three distinct data augmentation (DA) techniques to enhance model generalization. Our model utilized Mel-frequency cepstral coefficients and zero-crossing rate features from speech samples for training and outperformed state-of-the-art techniques in terms of accuracy. Additionally, we conducted experiments to validate the effectiveness and reliability of our proposed method.

中文翻译：

通过一维卷积神经网络进行多尺度多通道特征提取和分类用于语音情感识别

语音情感识别（SER）是人工智能和人机交互的一个重要研究领域。提取有效的语音特征以进行情感识别是 SER 持续的研究重点。大多数研究都集中在寻找最佳语音特征来提取隐藏的局部特征，而忽略语音信号的全局关系。在本文中，我们提出了一种利用具有全局和局部信息的多尺度多通道特征提取结构来获得综合语音特征的方法。我们的方法采用一维卷积神经网络 (1D CNN) 进行特征学习和情感识别，捕获语音的频谱和空间特征，以实现卓越的学习能力和改进的 SER 结果。我们对公开的情感识别数据集进行了广泛的实验，采用三种不同的数据增强（DA）技术来增强模型泛化。我们的模型利用语音样本中的梅尔频率倒谱系数和过零率特征进行训练，并且在准确性方面优于最先进的技术。此外，我们还进行了实验来验证我们提出的方法的有效性和可靠性。

更新日期：2023-11-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>