Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
arXiv - CS - Sound Pub Date : 2023-12-19 , DOI: arxiv-2312.11947
Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS.

中文翻译：

使用基于异构图的上下文建模进行会话语音合成的情感渲染

会话语音合成（CSS）旨在在会话环境中以适当的韵律和情感变化准确地表达话语。在认识到 CSS 任务重要性的同时，由于情感会话数据集的稀缺和状态情感建模的困难，先前的研究尚未彻底研究情感表达问题。在本文中，我们提出了一种新颖的情感CSS模型，称为ECSS，它包括两个主要组成部分：1）为了增强情感理解，我们引入了一种基于异构图的情感上下文建模机制，该机制以多源对话历史作为输入对对话情境进行建模并从情境中学习情感线索；2）为了实现情感渲染，我们采用基于对比学习的情感渲染模块来推断目标话语的准确情感风格。为了解决数据稀缺的问题，我们精心创建了类别和强度方面的情感标签，并在现有会话数据集（DailyTalk）上注释了额外的情感信息。客观和主观评估都表明我们的模型在理解和渲染情感方面优于基线模型。这些评估还强调了全面情感注释的重要性。代码和音频示例可以在以下位置找到：https://github.com/walker-hyf/ECSS。

更新日期：2023-12-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>