当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Pre-trained models for detection and severity level classification of dysarthria from speech
Speech Communication ( IF 3.2 ) Pub Date : 2024-02-14 , DOI: 10.1016/j.specom.2024.103047
Farhad Javanmardi , Sudarsana Reddy Kadiri , Paavo Alku

Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech. The experiments were conducted using two publicly available databases (UA-Speech and TORGO). One machine learning-based model (support vector machine, SVM) and one deep learning-based model (convolutional neural network, CNN) was used as the classifier. In order to compare the performance of the wav2vec2-BASE, wav2vec2-LARGE, and HuBERT features, three popular acoustic feature sets, namely, mel-frequency cepstral coefficients (MFCCs), openSMILE and extended Geneva minimalistic acoustic parameter set (eGeMAPS) were considered. Experimental results revealed that the features derived from the pre-trained models outperformed the three baseline features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features. In particular, when compared to the best-performing baseline feature (openSMILE), the HuBERT features showed in the detection problem absolute accuracy improvements that varied between 1.33% (the SVM classifier, the TORGO database) and 2.86% (the SVM classifier, the UA-Speech database). In the severity level classification problem, the HuBERT features showed absolute accuracy improvements that varied between 6.54% (the SVM classifier, the TORGO database) and 10.46% (the SVM classifier, the UA-Speech database) compared to the best-performing baseline feature (eGeMAPS).

中文翻译:

用于从语音中检测构音障碍并对其严重程度进行分类的预训练模型

通过语音自动检测构音障碍并对其严重程度进行分类,可以实现非侵入性且有效的诊断,有助于对患者进行药物和治疗的临床决策。在这项工作中,研究了三种预训练模型(wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT)来提取特征,以构建构音障碍语音的自动检测和严重程度分类系统。实验是使用两个公开可用的数据库(UA-Speech 和 TORGO)进行的。使用一种基于机器学习的模型(支持向量机,SVM)和一种基于深度学习的模型(卷积神经网络,CNN)作为分类器。为了比较 wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT 特征的性能,考虑了三种流行的声学特征集,即梅尔频率倒谱系数 (MFCC)、openSMILE 和扩展的Geneva简约声学参数集 (eGeMAPS) 。实验结果表明,从预训练模型导出的特征优于三个基线特征。研究还发现,HuBERT 特征的性能优于 wav2vec2-BASE 和 wav2vec2-LARGE 特征。特别是,与性能最佳的基线特征 (openSMILE) 相比,HuBERT 特征在检测问题中显示出绝对准确度的提高,变化范围在 1.33%(SVM 分类器、TORGO 数据库)和 2.86%(SVM 分类器、TORGO 数据库)之间。 UA-语音数据库)。在严重性级别分类问题中,与表现最佳的基线特征相比,HuBERT 特征的绝对准确度提高了 6.54%(SVM 分类器、TORGO 数据库)和 10.46%(SVM 分类器、UA-Speech 数据库) (eGeMAPS)。
更新日期:2024-02-14
down
wechat
bug