Multi-layer encoder–decoder time-domain single channel speech separation,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-layer encoder–decoder time-domain single channel speech separation
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-03-27 , DOI: 10.1016/j.patrec.2024.03.020
Debang Liu , Tianqi Zhang , Mads Græsbøll Christensen , Chen Yi , Ying Wei

With the emergence of more advanced separation networks, significant progress has been made in time-domain speech separation methods. These methods typically use a temporal encoder–decoder structure to encode speech feature sequences, thereby accomplishing the separation task. However, due to the limitation of traditional encoder–decoder structure, the separation performance decreases sharply when the encoded sequence is short, and when encoded sequence is sufficiently long, the separation performance improves, but which leads to an increase in computational complexity and training cost. Therefore, this paper compresses and reconstructs the speech feature sequence through a multi-layer convolution structure, and proposes a multi-layer encoder–decoder time-domain speech separation model (MLED). In this model, our encoder–decoder structure can compress speech sequence to a short length while ensuring the separation performance does not decrease. And combined with our multi-scale temporal attention (MSTA) separation network, MLED achieves efficient and precise separation of short encoded sequences. Therefore, compared to previous advanced time-domain separation methods, our experiments show that MLED achieves competitive separation performance with smaller model size, lower computational complexity, and training cost.

中文翻译：

多层编码器-解码器时域单通道语音分离

随着更先进的分离网络的出现，时域语音分离方法取得了重大进展。这些方法通常使用时间编码器-解码器结构来编码语音特征序列，从而完成分离任务。然而，由于传统编解码器结构的限制，当编码序列较短时，分离性能急剧下降，当编码序列足够长时，分离性能有所提高，但导致计算复杂度和训练成本增加。因此，本文通过多层卷积结构对语音特征序列进行压缩和重构，提出了多层编解码器时域语音分离模型（MLED）。在这个模型中，我们的编码器-解码器结构可以将语音序列压缩到较短的长度，同时确保分离性能不降低。并结合我们的多尺度时间注意力（MSTA）分离网络，MLED实现了短编码序列的高效、精确分离。因此，与之前的先进时域分离方法相比，我们的实验表明，MLED 以更小的模型尺寸、更低的计算复杂度和训练成本实现了有竞争力的分离性能。

更新日期：2024-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>