当前位置: X-MOL 学术J. Supercomput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms
The Journal of Supercomputing ( IF 3.3 ) Pub Date : 2024-03-20 , DOI: 10.1007/s11227-024-06015-x
Sunakshi Mehra , Virender Ranga , Ritu Agarwal

Abstract

Assessing the intelligibility of dysarthric speech, characterized by intricate speaking rhythms presents formidable challenges. Current techniques for objectively testing speech intelligibility are burdensome and subjective, particularly struggling with dysarthric spoken utterances. To tackle these hurdles, our method conducts an ablation analysis across speakers afflicted with speech impairment. We utilize a unified approach that incorporates both auditory and visual elements to improve the classification of dysarthric spoken utterances. In our quest to enhance spoken utterance recognition, we propose employing two distinctive extractive transformer-based approaches. Initially, we employ SepFormer to refine the speech signal, prioritizing the enhancement of signal clarity. Subsequently, we feed the improved audio samples into Swin transformer after converting them into log mel spectrograms. Additionally, we harness the power of the Swin transformer for visual classification, trained on a dataset of 14 million annotated images from ImageNet. The pre-trained scores from the Swin transformer are utilized as input for the deep bidirectional long short-term memory with gated recurrent unit (deep BiLSTM-GRU) model, facilitating the classification of spoken utterances. Our proposed deep BiLSTM-GRU model for spoken utterance classification yields impressive results on the EasyCall speech corpus, encompassing cognitive characteristics across spoken utterances ranging from 10 to 20, delivered by both healthy individuals and those with dysarthria. Notably, our results showcase an accuracy of 98.56% for 20 utterances in male speakers, 95.11% in female speakers, and 97.64% in combined male and female speakers. Across diverse scenarios, our approach consistently achieves remarkable accuracy, surpassing other contemporary methods, all without necessitating data augmentation.



中文翻译:

使用 BiLSTM-GRU、语音提示过滤和对数梅尔频谱图进行构音障碍言语分类的深度学习方法

摘要

评估以复杂的说话节奏为特征的构音障碍语音的清晰度提出了巨大的挑战。目前用于客观测试语音清晰度的技术是繁琐且主观的,尤其是在处理构音困难的口语话语时。为了解决这些障碍,我们的方法对患有言语障碍的说话者进行了消融分析。我们利用结合听觉和视觉元素的统一方法来改进构音障碍口语的分类。在我们寻求增强语音识别的过程中,我们建议采用两种独特的基于提取变压器的方法。最初,我们使用 SepFormer 来细化语音信号,优先考虑增强信号清晰度。随后,我们将改进后的音频样本转换为对数梅尔频谱图后,将其输入 Swin 变压器。此外,我们利用 Swin 转换器的强大功能进行视觉分类,并在包含来自 ImageNet 的 1400 万张带注释图像的数据集上进行训练。Swin 变压器的预训练分数用作带有门控循环单元的深度双向长短期记忆(深度 BiLSTM-GRU)模型的输入,有助于对口语话语进行分类。我们提出的用于口语话语分类的深度 BiLSTM-GRU 模型在 EasyCall 语音语料库上产生了令人印象深刻的结果,涵盖了健康个体和构音障碍患者所提供的 10 到 20 条口语话语的认知特征。值得注意的是,我们的结果显示,男性说话者的 20 个话语的准确度为 98.56%,女性说话者的准确度为 95.11%,男女说话者组合的准确度为 97.64%。在不同的场景中,我们的方法始终实现了卓越的准确性,超越了其他当代方法,并且不需要数据增强。

更新日期:2024-03-20
down
wechat
bug