Pre-trained models for detection and severity level classification of dysarthria from speech,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pre-trained models for detection and severity level classification of dysarthria from speech
Speech Communication ( IF 3.2 ) Pub Date : 2024-02-14 , DOI: 10.1016/j.specom.2024.103047
Farhad Javanmardi , Sudarsana Reddy Kadiri , Paavo Alku

Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech. The experiments were conducted using two publicly available databases (UA-Speech and TORGO). One machine learning-based model (support vector machine, SVM) and one deep learning-based model (convolutional neural network, CNN) was used as the classifier. In order to compare the performance of the wav2vec2-BASE, wav2vec2-LARGE, and HuBERT features, three popular acoustic feature sets, namely, mel-frequency cepstral coefficients (MFCCs), openSMILE and extended Geneva minimalistic acoustic parameter set (eGeMAPS) were considered. Experimental results revealed that the features derived from the pre-trained models outperformed the three baseline features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features. In particular, when compared to the best-performing baseline feature (openSMILE), the HuBERT features showed in the detection problem absolute accuracy improvements that varied between 1.33% (the SVM classifier, the TORGO database) and 2.86% (the SVM classifier, the UA-Speech database). In the severity level classification problem, the HuBERT features showed absolute accuracy improvements that varied between 6.54% (the SVM classifier, the TORGO database) and 10.46% (the SVM classifier, the UA-Speech database) compared to the best-performing baseline feature (eGeMAPS).

中文翻译：

用于从语音中检测构音障碍并对其严重程度进行分类的预训练模型

通过语音自动检测构音障碍并对其严重程度进行分类，可以实现非侵入性且有效的诊断，有助于对患者进行药物和治疗的临床决策。在这项工作中，研究了三种预训练模型（wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT）来提取特征，以构建构音障碍语音的自动检测和严重程度分类系统。实验是使用两个公开可用的数据库（UA-Speech 和 TORGO）进行的。使用一种基于机器学习的模型（支持向量机，SVM）和一种基于深度学习的模型（卷积神经网络，CNN）作为分类器。为了比较 wav2vec2-BASE、wav2vec2-LARGE 和 HuBERT 特征的性能，考虑了三种流行的声学特征集，即梅尔频率倒谱系数 (MFCC)、openSMILE 和扩展的Geneva简约声学参数集 (eGeMAPS) 。实验结果表明，从预训练模型导出的特征优于三个基线特征。研究还发现，HuBERT 特征的性能优于 wav2vec2-BASE 和 wav2vec2-LARGE 特征。特别是，与性能最佳的基线特征 (openSMILE) 相比，HuBERT 特征在检测问题中显示出绝对准确度的提高，变化范围在 1.33%（SVM 分类器、TORGO 数据库）和 2.86%（SVM 分类器、TORGO 数据库）之间。 UA-语音数据库）。在严重性级别分类问题中，与表现最佳的基线特征相比，HuBERT 特征的绝对准确度提高了 6.54%（SVM 分类器、TORGO 数据库）和 10.46%（SVM 分类器、UA-Speech 数据库）（eGeMAPS）。

更新日期：2024-02-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>