Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training
arXiv - CS - Sound Pub Date : 2023-12-03 , DOI: arxiv-2312.01515
Sean Robertson, Ewan Dunbar

It has been generally assumed in the automatic speech recognition (ASR) literature that it is better for models to have access to wider context windows. Yet, many of the potential reasons this might be true in the supervised setting do not necessarily transfer over to the case of unsupervised learning. We investigate how much context is necessary to achieve high-quality pre-trained acoustic models using self-supervised learning. We principally investigate contrastive predictive coding (CPC), which we adapt to be able to precisely control the amount of context visible to the model during training and inference. We find that phone discriminability in the resulting model representations peaks at around 40~ms of preceding context, and that having too much context (beyond around 320 ms) substantially degrades the quality of the representations. Surprisingly, we find that this pattern also transfers to supervised ASR when the pre-trained representations are used as frozen input features. Our results point to potential changes in the design of current upstream architectures to better facilitate a variety of downstream tasks.

中文翻译：

越大并不总是越好：上下文大小对语音预训练的影响

自动语音识别（ASR）文献中普遍认为模型最好能够访问更宽的上下文窗口。然而，在监督环境中这种情况可能成立的许多潜在原因并不一定适用于无监督学习的情况。我们研究了使用自我监督学习实现高质量预训练声学模型需要多少上下文。我们主要研究对比预测编码（CPC），我们对其进行调整，以便能够在训练和推理过程中精确控制模型可见的上下文量。我们发现，所得模型表示中的音素辨别能力在先前上下文的 40~ms 左右达到峰值，并且过多的上下文（超过 320 ms 左右）会大大降低表示的质量。令人惊讶的是，我们发现当预训练表示用作冻结输入特征时，这种模式也会转移到监督 ASR。我们的结果表明当前上游架构的设计可能会发生变化，以更好地促进各种下游任务。

更新日期：2023-12-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>