Time domain speech enhancement with CNN and time-attention transformer,Digital Signal Processing

当前位置： X-MOL 学术 › Digit. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Time domain speech enhancement with CNN and time-attention transformer
Digital Signal Processing ( IF 2.9 ) Pub Date : 2024-02-02 , DOI: 10.1016/j.dsp.2024.104408
Nasir Saleem , Teddy Surya Gunawan , Sami Dhahbi , Sami Bourouis

Speech enhancement in the time domain involves improving the quality and intelligibility of noisy speech by processing the waveform directly without the need for explicit feature extraction or domain transformation. Deep learning is a powerful approach for time domain speech enhancement, offering significant improvements over traditional techniques. Formulating a resource-efficient deep neural model in the time domain without ignoring the contextual information and detailed features of input speech is still a vital challenge. To address this challenge, this study proposes a speech enhancement model using 1D-time domain dilated residual blocks in the convolutional encoder-decoder framework. Further, this study integrates a time-attention transformer (TAT) bottleneck between the encoder-decoder. The TAT model extends the transformer architecture by incorporating a time-attention mechanism, which enables the model to selectively attend to different segments of the speech signal over time. This allows the model to effectively capture long-term dependencies in the speech and learn to recognize important features. The experimental results indicate that the proposed speech enhancement outperforms the recent deep neural networks (DNNs) and substantially improves the intelligibility and quality of noisy speech. With the WSJ0 SI-84 database, the proposed SE improves the STOI and PESQ by 21.51% and 1.14 over noisy speech.

中文翻译：

使用 CNN 和时间注意力变换器的时域语音增强

时域语音增强涉及通过直接处理波形来提高噪声语音的质量和清晰度，而不需要显式特征提取或域转换。深度学习是一种强大的时域语音增强方法，与传统技术相比具有显着改进。在时域中制定资源高效的深度神经模型而不忽略输入语音的上下文信息和详细特征仍然是一个至关重要的挑战。为了应对这一挑战，本研究提出了一种在卷积编码器-解码器框架中使用一维时域扩张残差块的语音增强模型。此外，这项研究在编码器-解码器之间集成了时间注意力变换器（TAT）瓶颈。TAT 模型通过合并时间注意力机制来扩展 Transformer 架构，这使得模型能够随着时间的推移选择性地关注语音信号的不同部分。这使得模型能够有效地捕获语音中的长期依赖性并学习识别重要特征。实验结果表明，所提出的语音增强优于最近的深度神经网络（DNN），并显着提高了噪声语音的清晰度和质量。利用 WSJ0 SI-84 数据库，所提出的 SE 将 STOI 和 PESQ 比噪声语音提高了 21.51% 和 1.14。

更新日期：2024-02-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>