Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks
arXiv - CS - Sound Pub Date : 2024-03-26 , DOI: arxiv-2403.17378
Yang Ai, Zhen-Hua Ling

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

中文翻译：

基于并行估计架构和抗包裹损失的语音生成任务的低延迟神经语音相位预测

本文提出了一种新颖的神经语音相位预测模型，该模型直接从幅度谱预测包裹相位谱。所提出的模型是残差卷积网络和并行估计架构的级联。并行估计架构是直接包裹相位预测的核心模块。该架构由两个并行的线性卷积层和一个相位计算公式组成，模仿从复谱的实部和虚部计算相位谱的过程，并将预测的相位值严格限制在主值区间内。为了避免相位缠绕引起的误差扩展问题，我们通过使用抗缠绕函数激活瞬时相位误差、群延迟误差和瞬时角频率误差，设计了在预测缠绕相位谱和自然相位谱之间定义的抗缠绕训练损失。我们从数学上证明了反包裹函数应具有三个属性，即奇偶性、周期性和单调性。我们还通过结合因果卷积和知识蒸馏训练策略来实现低延迟流式相位预测。对于分析合成和特定语音生成任务，实验结果表明，我们提出的神经语音相位预测模型在相位预测精度、效率和鲁棒性方面优于迭代相位估计算法和基于神经网络的相位预测方法。与基于 HiFi-GAN 的波形重建方法相比，我们提出的模型在保证合成语音质量的同时也表现出了突出的效率优势。据我们所知，我们是第一个仅通过神经网络从幅度谱直接预测语音相位谱的人。

更新日期：2024-03-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>