Coarse-to-fine speech separation method in the time-frequency domain,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Coarse-to-fine speech separation method in the time-frequency domain
Speech Communication ( IF 3.2 ) Pub Date : 2023-11-04 , DOI: 10.1016/j.specom.2023.103003
Xue Yang , Changchun Bao , Xianhong Chen

Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.

中文翻译：

时频域由粗到精的语音分离方法

尽管时域语音分离方法在无回声场景中表现出了出色的性能，但在混响场景中其有效性却大大降低。与时域方法相比，时频（TF）域的语音分离方法主要关注结构化的 TF 表示，并且最近显示出巨大的潜力。在本文中，我们提出了一种在TF域中从粗到细的语音分离方法，该方法包括两个步骤：1）在粗略阶段进行粗略分离，2）在精炼阶段完成精确提取。在粗略阶段，所有说话者的语音信号最初以粗略的方式分离，导致估计信号中存在一定程度的失真。在细化阶段，每个估计信号的TF表示作为提取相应说话人的剩余TF表示的指导，这有助于减少粗略阶段引起的失真。此外，用于粗略和精炼阶段的专门设计的网络经过联合训练，以获得卓越的性能。此外，利用并行分支循环注意力（RAPB）块来充分利用整个 TF 特征中包含的上下文信息，所提出的模型在具有少量参数的干净数据集上展示了竞争性能。此外，所提出的方法显示出更强的鲁棒性，并在更真实的数据集上实现了最先进的结果。

更新日期：2023-11-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>