Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models
arXiv - CS - Sound Pub Date : 2023-12-06 , DOI: arxiv-2312.03632
Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

中文翻译：

具有大型基础模型的多模态数据和资源高效的设备导向语音检测

与虚拟助手的交互通常以触发短语开始，然后是命令。在这项工作中，我们探索了通过消除对触发短语的需要使这些交互更加自然的可能性。我们的目标是根据从设备麦克风录制的流音频中获得的信号来确定用户是否向虚拟助理发出指令。我们通过将来自自动语音识别系统的 1-best 假设和解码器信号与来自音频编码器的声学表示相结合作为大型语言模型 (LLM) 的输入特征来解决此任务。我们特别对数据和资源高效的系统感兴趣，这些系统只需要少量的训练数据，并且可以在设备上只有一个冻结的 LLM 的情况下运行。因此，我们的模型使用低秩自适应和前缀调整的组合，在 80k 或更少的多模态数据示例上进行训练。我们将所提出的系统与单模态基线进行比较，结果表明多模态方法可实现较低的等错误率 (EER)，同时仅使用一小部分训练数据。我们还表明，低维专用音频表示比高维通用音频表示的 EER 更低。

更新日期：2023-12-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>