Emotion Selectable End-to-End Text-based Speech Editing,Artificial Intelligence

当前位置： X-MOL 学术 › Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Emotion Selectable End-to-End Text-based Speech Editing
Artificial Intelligence ( IF 14.4 ) Pub Date : 2024-01-23 , DOI: 10.1016/j.artint.2024.104076
Tao Wang , Jiangyan Yi , Ruibo Fu , Jianhua Tao , Zhengqi Wen , Chu Yuan Zhang

Text-based speech editing is a convenient way for users to edit speech by intuitively cutting, copying, and pasting text. Previous work introduced CampNet, a context-aware mask prediction network that significantly improved the quality of edited speech. However, this paper proposes a new task: adding emotional effects to the edited speech during text-based speech editing to enhance the expressiveness and controllability of the edited speech. To achieve this, we introduce Emo-CampNet, which allows users to select emotional attributes for the generated speech and has the ability to edit the speech of unseen speakers. Firstly, the proposed end-to-end model controls the generated speech's emotion by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent emotional interference from the original speech, a neutral content generator is proposed to remove the emotional components, which is optimized using the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set. Experimental results¹ show that Emo-CampNet effectively controls the generated speech's emotion and can edit the speech of unseen speakers. Ablation experiments further validate the effectiveness of emotional selectivity and data augmentation methods.

中文翻译：

情感可选的基于文本的端到端语音编辑

基于文本的语音编辑是用户通过直观地剪切、复制和粘贴文本来编辑语音的便捷方式。之前的工作介绍了 CampNet，这是一种上下文感知掩模预测网络，可以显着提高编辑语音的质量。然而，本文提出了一个新的任务：在基于文本的语音编辑过程中为编辑后的语音添加情感效果，以增强编辑后的语音的表现力和可控性。为了实现这一目标，我们引入了 Emo-CampNet，它允许用户为生成的语音选择情感属性，并能够编辑看不见的说话者的语音。首先，所提出的端到端模型通过引入基于上下文感知掩模预测网络的附加情感属性来控制生成的语音的情感。其次，为了防止原始语音的情绪干扰，提出了一种中性内容生成器来去除情绪成分，并使用生成对抗框架对其进行了优化。第三，提出了两种数据增强方法来丰富训练集中的情感和发音信息。实验结果¹表明，Emo-CampNet有效控制了生成语音的情感，并且可以编辑未见过的说话人的语音。消融实验进一步验证了情感选择性和数据增强方法的有效性。

更新日期：2024-01-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>