当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Digits micro-model for accurate and secure transactions
arXiv - CS - Sound Pub Date : 2024-02-02 , DOI: arxiv-2402.01931
Chirag Chhablani, Nikhita Sharma, Jordan Hosier, Vijay K. Gurbani

Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text (STT), or Amazon Transcribe -- or open source (OpenAI's Whisper). Such ASR models are trained on hundreds of thousands of hours of audio data and require considerable resources to run. Despite recent progress large speech recognition models, we highlight the potential of smaller, specialized "micro" models. Such light models can be trained perform well on number recognition specific tasks, competing with general models like Whisper or Google STT while using less than 80 minutes of training time and occupying at least an order of less memory resources. Also, unlike larger speech recognition models, micro-models are trained on carefully selected and curated datasets, which makes them highly accurate, agile, and easy to retrain, while using low compute resources. We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. Our work contributes to domain-specific ASR models, improving digit recognition accuracy, and privacy of data. An added advantage, their low resource consumption allows them to be hosted on-premise, keeping private data local instead uploading to an external cloud. Our results indicate that our micro-model makes less errors than the best-of-breed commercial or open-source ASRs in recognizing digits (1.8% error rate of our best micro-model versus 5.8% error rate of Whisper), and has a low memory footprint (0.66 GB VRAM for our model versus 11 GB VRAM for Whisper).

中文翻译:

数字微模型实现准确、安全的交易

自动语音识别 (ASR) 系统用于金融领域,通过实现自然语言理解和促进高效直观的交互来增强呼叫者体验。 ASR 系统的使用越来越多,要求此类系统具有非常低的错误率。收集数字数据的主要 ASR 模型是大型通用商业模型 - Google Speech-to-text (STT) 或 Amazon Transcribe - 或开源模型(OpenAI 的 Whisper)。此类 ASR 模型经过数十万小时的音频数据训练,需要大量资源才能运行。尽管大型语音识别模型最近取得了进展,但我们强调了更小、专门的“微型”模型的潜力。这种轻型模型可以在数字识别特定任务上进行良好的训练,与 Whisper 或 Google STT 等通用模型竞争,同时使用不到 80 分钟的训练时间并占用至少一个数量级的内存资源。此外,与较大的语音识别模型不同,微模型是在精心挑选和策划的数据集上进行训练的,这使得它们高度准确、敏捷且易于重新训练,同时使用的计算资源较少。我们展示了我们为多位数字识别创建微模型的工作,该模型可以处理反映真实世界发音模式的多种说话风格。我们的工作有助于特定领域的 ASR 模型、提高数字识别准确性和数据隐私。另一个优势是,它们的资源消耗低,允许在本地托管,将私有数据保留在本地,而不是上传到外部云。我们的结果表明,我们的微模型在识别数字方面比同类最佳的商业或开源 ASR 犯的错误更少(我们最好的微模型的错误率为 1.8%,而 Whisper 的错误率为 5.8%),并且具有低内存占用(我们的模型为 0.66 GB VRAM,而 Whisper 为 11 GB VRAM)。
更新日期:2024-02-06
down
wechat
bug