-
Deep functional multiple index models with an application to SER arXiv.cs.SD Pub Date : 2024-03-26 Matthieu Saumard, Abir El Haj, Thibault Napoleon
Speech Emotion Recognition (SER) plays a crucial role in advancing human-computer interaction and speech processing capabilities. We introduce a novel deep-learning architecture designed specifically for the functional data model known as the multiple-index functional model. Our key innovation lies in integrating adaptive basis layers and an automated data transformation search within the deep learning
-
Detection of Deepfake Environmental Audio arXiv.cs.SD Pub Date : 2024-03-26 Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller
With the ever-rising quality of deep generative models, it is increasingly important to be able to discern whether the audio data at hand have been recorded or synthesized. Although the detection of fake speech signals has been studied extensively, this is not the case for the detection of fake environmental audio. We propose a simple and efficient pipeline for detecting fake environmental sounds based
-
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant arXiv.cs.SD Pub Date : 2024-03-26 Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M. Heller, Keisuke Imoto, Yuki Okamoto
This paper explores whether considering alternative domain-specific embeddings to calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds from
-
Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks arXiv.cs.SD Pub Date : 2024-03-26 Yang Ai, Zhen-Hua Ling
This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers
-
Theoretical Analysis of Quality of Conventional Beamforming for Phased Microphone Arrays arXiv.cs.SD Pub Date : 2024-03-24 Dheepak Khatri, Kenneth Granlund
A theoretical study is performed to analyze the directional response of different types of microphone array designs. 1-D (linear) and 2-D (planar) microphone array types are considered, and the delay and sum beamforming and conventional beamforming techniques are employed to localize the sound source. A non-dimensional parameter, G, is characterized to simplify and standardize the rejection performance
-
Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer arXiv.cs.SD Pub Date : 2024-03-26 Jeong-Yoon Kim, Seung-Ho Lee
In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram
-
Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification arXiv.cs.SD Pub Date : 2024-03-26 Francesca Ronchini, Luca Comanducci, Fabio Antonacci
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study
-
Speaker Distance Estimation in Enclosures from Single-Channel Audio arXiv.cs.SD Pub Date : 2024-03-26 Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen
Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the
-
Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge arXiv.cs.SD Pub Date : 2024-03-26 Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim
The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization
-
Infrastructure-less Localization from Indoor Environmental Sounds Based on Spectral Decomposition and Spatial Likelihood Model arXiv.cs.SD Pub Date : 2024-03-26 Satoki Ogiso, Yoshiaki Bando, Takeshi Kurata, Takashi Okuma
Human and/or asset tracking using an attached sensor units helps understand their activities. Most common indoor localization methods for human tracking technologies require expensive infrastructures, deployment and maintenance. To overcome this problem, environmental sounds have been used for infrastructure-free localization. While they achieve room-level classification, they suffer from two problems:
-
Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy arXiv.cs.SD Pub Date : 2024-03-24 Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng
Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate
-
Exploring Green AI for Audio Deepfake Detection arXiv.cs.SD Pub Date : 2024-03-21 Subhajit Saha, Md Sahidullah, Swagatam Das
The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent
-
Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization arXiv.cs.SD Pub Date : 2024-03-21 Nikhil Raghav, Md Sahidullah
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization
-
ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning arXiv.cs.SD Pub Date : 2024-03-04 Kuan-Hsun Ho, Jeih-weih Hung, Berlin Chen
Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit
-
Beyond Voice Assistants: Exploring Advantages and Risks of an In-Car Social Robot in Real Driving Scenarios arXiv.cs.SD Pub Date : 2024-02-19 Yuanchao Li, Lachlan Urquhart, Nihan Karatas, Shun Shao, Hiroshi Ishiguro, Xun Shen
In-car Voice Assistants (VAs) play an increasingly critical role in automotive user interface design. However, existing VAs primarily perform simple 'query-answer' tasks, limiting their ability to sustain drivers' long-term attention. In this study, we investigate the effectiveness of an in-car Robot Assistant (RA) that offers functionalities beyond voice interaction. We aim to answer the question:
-
Exploring and Applying Audio-Based Sentiment Analysis in Music arXiv.cs.SD Pub Date : 2024-02-22 Etash Jhanji
Sentiment analysis is a continuously explored area of text processing that deals with the computational analysis of opinions, sentiments, and subjectivity of text. However, this idea is not limited to text and speech, in fact, it could be applied to other modalities. In reality, humans do not express themselves in text as deeply as they do in music. The ability of a computational model to interpret
-
APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding arXiv.cs.SD Pub Date : 2024-02-16 Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling
This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed
-
Evaluating and Improving Continual Learning in Spoken Language Understanding arXiv.cs.SD Pub Date : 2024-02-16 Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj
Continual learning has emerged as an increasingly important challenge across various tasks, including Spoken Language Understanding (SLU). In SLU, its objective is to effectively handle the emergence of new concepts and evolving environments. The evaluation of continual learning algorithms typically involves assessing the model's stability, plasticity, and generalizability as fundamental aspects of
-
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls arXiv.cs.SD Pub Date : 2024-02-14 Liwei Lin, Gus Xia, Yixiao Zhang, Junyan Jiang
Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To bridge this gap, we introduce a novel Parameter-Efficient Fine-Tuning (PEFT) method. This approach enables autoregressive language models to seamlessly
-
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds arXiv.cs.SD Pub Date : 2024-02-05 Masato Hagiwara, Marius Miron, Jen-Yu Liu
Traditionally, bioacoustics has relied on spectrograms and continuous, per-frame audio representations for the analysis of animal sounds, also serving as input to machine learning models. Meanwhile, the International Phonetic Alphabet (IPA) system has provided an interpretable, language-independent method for transcribing human speech sounds. In this paper, we introduce ISPA (Inter-Species Phonetic
-
Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding arXiv.cs.SD Pub Date : 2024-02-05 Yasar Abbas Ur Rehman, Kin Wai Lau, Yuyang Xie, Lan Ma, Jiajun Shen
The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However, rare efforts have been made to investigate the SSL models in the FL regime for general-purpose audio understanding, especially when the training data is generated by
-
Focal Modulation Networks for Interpretable Sound Classification arXiv.cs.SD Pub Date : 2024-02-05 Luca Della Libera, Cem Subakan, Mirco Ravanelli
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem
-
How phonemes contribute to deep speaker models? arXiv.cs.SD Pub Date : 2024-02-05 Pengqi Li, Tianhao Wang, Lantian Li, Askar Hamdulla, Dong Wang
Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still unexplored
-
Adversarial Data Augmentation for Robust Speaker Verification arXiv.cs.SD Pub Date : 2024-02-05 Zhenyu Zhou, Junhui Chen, Namin Wang, Lantian Li, Dong Wang
Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential
-
Natural language guidance of high-fidelity text-to-speech with synthetic annotations arXiv.cs.SD Pub Date : 2024-02-02 Dan Lyth, Simon King
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and
-
Identification of Cognitive Decline from Spoken Language through Feature Selection and the Bag of Acoustic Words Model arXiv.cs.SD Pub Date : 2024-02-02 Marko Niemelä, Mikaela von Bonsdorff, Sami Äyrämö, Tommi Kärkkäinen
Memory disorders are a central factor in the decline of functioning and daily activities in elderly individuals. The confirmation of the illness, initiation of medication to slow its progression, and the commencement of occupational therapy aimed at maintaining and rehabilitating cognitive abilities require a medical diagnosis. The early identification of symptoms of memory disorders, especially the
-
Digits micro-model for accurate and secure transactions arXiv.cs.SD Pub Date : 2024-02-02 Chirag Chhablani, Nikhita Sharma, Jordan Hosier, Vijay K. Gurbani
Automatic Speech Recognition (ASR) systems are used in the financial domain to enhance the caller experience by enabling natural language understanding and facilitating efficient and intuitive interactions. Increasing use of ASR systems requires that such systems exhibit very low error rates. The predominant ASR models to collect numeric data are large, general-purpose commercial models -- Google Speech-to-text
-
Harnessing Smartwatch Microphone Sensors for Cough Detection and Classification arXiv.cs.SD Pub Date : 2024-01-31 Pranay Jaiswal, Haroon R. Lone
This study investigates the potential of using smartwatches with built-in microphone sensors for monitoring coughs and detecting various cough types. We conducted a study involving 32 participants and collected 9 hours of audio data in a controlled manner. Afterward, we processed this data using a structured approach, resulting in 223 positive cough samples. We further improved the dataset through
-
Proactive Detection of Voice Cloning with Localized Watermarking arXiv.cs.SD Pub Date : 2024-01-30 Robin San Roman, Pierre Fernandez, Alexandre Défossez, Teddy Furon, Tuan Tran, Hady Elsahar
In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark
-
MunTTS: A Text-to-Speech System for Mundari arXiv.cs.SD Pub Date : 2024-01-28 Varun Gumma, Rishav Hada, Aditya Yadavalli, Pamir Gogoi, Ishani Mondal, Vivek Seshadri, Kalika Bali
We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end
-
Improving Design of Input Condition Invariant Speech Enhancement arXiv.cs.SD Pub Date : 2024-01-25 Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian
Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was
-
Music Genre Classification: A Comparative Analysis of CNN and XGBoost Approaches with Mel-frequency cepstral coefficients and Mel Spectrograms arXiv.cs.SD Pub Date : 2024-01-09 Yigang Meng
In recent years, various well-designed algorithms have empowered music platforms to provide content based on one's preferences. Music genres are defined through various aspects, including acoustic features and cultural considerations. Music genre classification works well with content-based filtering, which recommends content based on music similarity to users. Given a considerable dataset, one premise
-
SonicVisionLM: Playing Sound with Vision Language Models arXiv.cs.SD Pub Date : 2024-01-09 Zhifeng Xie, Shengye Yu, Mengtian Li, Qile He, Chaofeng Chen, Yu-Gang Jiang
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper
-
CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition arXiv.cs.SD Pub Date : 2024-01-04 Junfeng Hou, Peiyao Wang, Jincheng Zhang, Meng Yang, Minwei Feng, Jingcheng Yin
Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that
-
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling arXiv.cs.SD Pub Date : 2023-12-19 Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling
-
Extending Whisper with prompt tuning to target-speaker ASR arXiv.cs.SD Pub Date : 2023-12-13 Hao Ma, Zhiyuan Peng, Mingjie Shao, Jing Li, Ju Liu
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt
-
Transformer Attractors for Robust and Efficient End-to-End Neural Diarization arXiv.cs.SD Pub Date : 2023-12-11 Lahiru Samarakoon, Samuel J. Broughton, Marc Härkönen, Ivan Fung
End-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) is a method to perform diarization in a single neural network. EDA handles the diarization of a flexible number of speakers by using an LSTM-based encoder-decoder that generates a set of speaker-wise attractors in an autoregressive manner. In this paper, we propose to replace EDA with a transformer-based attractor calculation
-
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism arXiv.cs.SD Pub Date : 2023-12-11 Georgios Milis, Panagiotis P. Filntisis, Anastasios Roussos, Petros Maragos
Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans. The state of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. However, having the ability to synthesize talking humans from text transcriptions rather than audio is particularly beneficial for many applications
-
A Practical Survey on Emerging Threats from AI-driven Voice Attacks: How Vulnerable are Commercial Voice Control Systems? arXiv.cs.SD Pub Date : 2023-12-10 Yuanda Wang, Qiben Yan, Nikolay Ivanov, Xun Chen
The emergence of Artificial Intelligence (AI)-driven audio attacks has revealed new security vulnerabilities in voice control systems. While researchers have introduced a multitude of attack strategies targeting voice control systems (VCS), the continual advancements of VCS have diminished the impact of many such attacks. Recognizing this dynamic landscape, our study endeavors to comprehensively assess
-
ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning arXiv.cs.SD Pub Date : 2023-12-11 Xincheng Yu, Dongyue Guo, Jianwei Zhang, Yi Lin
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and further impacts automatic speech recognition (ASR) accuracy. In this work, a recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy, which serves as a plug-and-play tool in ATC scenarios and does not require
-
DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors arXiv.cs.SD Pub Date : 2023-12-07 Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget
Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based
-
Towards small and accurate convolutional neural networks for acoustic biodiversity monitoring arXiv.cs.SD Pub Date : 2023-12-06 Serge Zaugg, Mike van der Schaar, Florence Erbs, Antonio Sanchez, Joan V. Castell, Emiliano Ramallo, Michel André
Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performance
-
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models arXiv.cs.SD Pub Date : 2023-12-06 Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi
Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this
-
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis arXiv.cs.SD Pub Date : 2023-12-06 Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu
In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian
-
Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data arXiv.cs.SD Pub Date : 2023-12-06 Tashi Namgyal, Alexander Hepburn, Raul Santos-Rodriguez, Valero Laparra, Jesus Malo
Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea
-
Detecting Voice Cloning Attacks via Timbre Watermarking arXiv.cs.SD Pub Date : 2023-12-06 Chang Liu, Jie Zhang, Tianwei Zhang, Xi Yang, Weiming Zhang, Nenghai Yu
Nowadays, it is common to release audio content to the public. However, with the rise of voice cloning technology, attackers have the potential to easily impersonate a specific person by utilizing his publicly released audio without any permission. Therefore, it becomes significant to detect any potential misuse of the released audio content and protect its timbre from being impersonated. To this end
-
Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification arXiv.cs.SD Pub Date : 2023-12-06 Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li
Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech
-
Subnetwork-to-go: Elastic Neural Network with Dynamic Training and Customizable Inference arXiv.cs.SD Pub Date : 2023-12-06 Kai Li, Yi Luo
Deploying neural networks to different devices or platforms is in general challenging, especially when the model size is large or model complexity is high. Although there exist ways for model pruning or distillation, it is typically required to perform a full round of model training or finetuning procedure in order to obtain a smaller model that satisfies the model size or complexity constraints. Motivated
-
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation arXiv.cs.SD Pub Date : 2023-12-06 Wonjun Lee, Gary Geunbae Lee, Yunsu Kim
This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. Our approach optimizes these two stages to improve speech recognition across languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics, thus improving recognition accuracy. Additionally
-
Leveraging Laryngograph Data for Robust Voicing Detection in Speech arXiv.cs.SD Pub Date : 2023-12-05 Yixuan Zhang, Heming Wang, DeLiang Wang
Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges
-
Distributed Speech Dereverberation Using Weighted Prediction Error arXiv.cs.SD Pub Date : 2023-12-05 Ziye Yang, Mengfei Zhang, Jie Chen
Speech dereverberation aims to alleviate the negative impact of late reverberant reflections. The weighted prediction error (WPE) method is a well-established technique known for its superior performance in dereverberation. However, in scenarios where microphone nodes are dispersed, the centralized approach of the WPE method requires aggregating all observations for inverse filtering, resulting in
-
Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson's Disease (A Comparative Study) arXiv.cs.SD Pub Date : 2023-12-04 Mahboobeh Parsapoor
Changes in speech and language are among the first signs of Parkinson's disease (PD). Thus, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals'
-
Integrating Plug-and-Play Data Priors with Weighted Prediction Error for Speech Dereverberation arXiv.cs.SD Pub Date : 2023-12-05 Ziye Yang, Wenxing Yang, Kai Xie, Jie Chen
Speech dereverberation aims to alleviate the detrimental effects of late-reverberant components. While the weighted prediction error (WPE) method has shown superior performance in dereverberation, there is still room for further improvement in terms of performance and robustness in complex and noisy environments. Recent research has highlighted the effectiveness of integrating physics-based and data-driven
-
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler arXiv.cs.SD Pub Date : 2023-12-05 Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the
-
Auralization based on multi-perspective ambisonic room impulse responses arXiv.cs.SD Pub Date : 2023-12-05 Kaspar Müller, Franz Zotter
Most often, virtual acoustic rendering employs real-time updated room acoustic simulations to accomplish auralization for a variable listener perspective. As an alternative, we propose and test a technique to interpolate room impulse responses, specifically Ambisonic room impulse responses (ARIRs) available at a grid of spatially distributed receiver perspectives, measured or simulated in a desired
-
Acoustic Signal Analysis with Deep Neural Network for Detecting Fault Diagnosis in Industrial Machines arXiv.cs.SD Pub Date : 2023-12-02 Mustafa Yurdakul, Sakir Tasdemir
Detecting machine malfunctions at an early stage is crucial for reducing interruptions in operational processes within industrial settings. Recently, the deep learning approach has started to be preferred for the detection of failures in machines. Deep learning provides an effective solution in fault detection processes thanks to automatic feature extraction. In this study, a deep learning-based system
-
Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training arXiv.cs.SD Pub Date : 2023-12-03 Sean Robertson, Ewan Dunbar
It has been generally assumed in the automatic speech recognition (ASR) literature that it is better for models to have access to wider context windows. Yet, many of the potential reasons this might be true in the supervised setting do not necessarily transfer over to the case of unsupervised learning. We investigate how much context is necessary to achieve high-quality pre-trained acoustic models
-
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling arXiv.cs.SD Pub Date : 2023-12-02 Shentong Mo, Pedro Morgado
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges
-
Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking arXiv.cs.SD Pub Date : 2023-12-04 Jihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio
-
A text-dependent speaker verification application framework based on Chinese numerical string corpus arXiv.cs.SD Pub Date : 2023-12-04 Litong Zheng, Feng Hong, Weijie Xu
Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies