Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking
arXiv - CS - Sound Pub Date : 2023-12-04 , DOI: arxiv-2312.01842
Jihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee

Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST.

中文翻译：

探索合成音频数据用于基于音频的对话状态跟踪的可行性

对话状态跟踪在面向任务的对话系统中提取信息中起着至关重要的作用。然而，先前的研究仅限于文本模式，这主要是由于缺乏真实的人类音频数据集。我们通过研究基于音频的 DST 的合成音频数据来解决这个问题。为此，我们开发了级联和端到端模型，使用我们的合成音频数据集对其进行训练，并在实际的人类语音数据上对其进行测试。为了促进针对音频模式的评估，我们引入了一种新颖的 PhonemeF1 来捕获发音相似性。实验结果表明，仅在合成数据集上训练的模型可以将其性能推广到人类语音数据。通过消除对人类语音数据收集的依赖，这些见解为基于音频的 DST 的重大实际进步铺平了道路。数据和代码可在 https://github.com/JihyunLee1/E2E-DST 获取。

更新日期：2023-12-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>