Transformer Attractors for Robust and Efficient End-to-End Neural Diarization,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transformer Attractors for Robust and Efficient End-to-End Neural Diarization
arXiv - CS - Sound Pub Date : 2023-12-11 , DOI: arxiv-2312.06253
Lahiru Samarakoon, Samuel J. Broughton, Marc Härkönen, Ivan Fung

End-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) is a method to perform diarization in a single neural network. EDA handles the diarization of a flexible number of speakers by using an LSTM-based encoder-decoder that generates a set of speaker-wise attractors in an autoregressive manner. In this paper, we propose to replace EDA with a transformer-based attractor calculation (TA) module. TA is composed of a Combiner block and a Transformer decoder. The main function of the combiner block is to generate conversational dependent (CD) embeddings by incorporating learned conversational information into a global set of embeddings. These CD embeddings will then serve as the input for the transformer decoder. Results on public datasets show that EEND-TA achieves 2.68% absolute DER improvement over EEND-EDA. EEND-TA inference is 1.28 times faster than that of EEND-EDA.

中文翻译：

用于稳健且高效的端到端神经二化的 Transformer Attractors

使用基于编码器-解码器的吸引器的端到端神经二值化（EEND-EDA）是一种在单个神经网络中执行二值化的方法。 EDA 通过使用基于 LSTM 的编码器-解码器来处理灵活数量的说话人的二值化，该编码器-解码器以自回归方式生成一组说话人吸引子。在本文中，我们建议用基于变压器的吸引子计算（TA）模块取代 EDA。 TA由Combiner块和Transformer解码器组成。组合器块的主要功能是通过将学习到的会话信息合并到全局嵌入集中来生成会话相关（CD）嵌入。这些 CD 嵌入将作为 Transformer 解码器的输入。公共数据集的结果表明，EEND-TA 比 EEND-EDA 的绝对 DER 提高了 2.68%。 EEND-TA 推理速度比 EEND-EDA 快 1.28 倍。

更新日期：2023-12-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>