-
The Effect of Musical Expertise on Whistled Vowel Identification Speech Commun. (IF 3.2) Pub Date : 2024-03-17 Anaïs Tran Ngoc, Julien Meyer, Fanny Meunier
In this paper, we looked at the impact of musical experience on whistled vowel categorization by native French speakers. Whistled speech, a natural, yet modified speech type, augments speech amplitude while transposing the signal to a range of fairly high frequencies, i.e. 1 to 4kHz. The whistled vowels are simple pitches of different heights depending on the vowel position, and generally represent
-
Symmetric and asymmetric Gaussian weighted linear prediction for voice inverse filtering Speech Commun. (IF 3.2) Pub Date : 2024-03-06 I.A. Zalazar, G.A. Alzamendi, G. Schlotthauer
Weighted linear prediction (WLP) has demonstrated its significance in voice inverse filtering, contributing to enhanced methods for estimating both the vocal tract filter and the glottal source. WLP provides a mechanism to mitigate the effect on the linear prediction model of voice samples that affects the vocal tract filter estimation, particularly those samples around glottal closure instants (GCIs)
-
Speech-driven head motion generation from waveforms Speech Commun. (IF 3.2) Pub Date : 2024-03-01 JinHong Lu, Hiroshi Shimodaira
Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to
-
A distortionless convolution beamformer design method based on the weighted minimum mean square error for joint dereverberation and denoising Speech Commun. (IF 3.2) Pub Date : 2024-02-24 Jing Zhou, Changchun Bao, Maoshen Jia, Wenmeng Xiong
This paper designs a weighted minimum mean square error (WMMSE) based distortionless convolution beamformer (DCBF) for joint dereverberation and denoising. By effectively using WMMSE with the constraint of distortionless, a DCBF is deduced, where the outputs of the weighted prediction error (WPE) filter and the WPE-based minimum variance distortionless response (MVDR) beamformer are combined to initialize
-
Language fusion via adapters for low-resource speech recognition Speech Commun. (IF 3.2) Pub Date : 2024-02-23 Qing Hu, Yan Zhang, Xianlei Zhang, Zongyu Han, Xiuxia Liang
Data scarcity makes low-resource speech recognition systems suffer from severe overfitting. Although fine-tuning addresses this issue to some extent, it leads to parameter-inefficient training. In this paper, a novel language knowledge fusion method, named LanFusion, is proposed. It is built on the recent popular adapter-tuning technique, thus maintaining better parameter efficiency compared with conventional
-
PLDE: A lightweight pooling layer for spoken language recognition Speech Commun. (IF 3.2) Pub Date : 2024-02-23 Zimu Li, Yanyan Xu, Dengfeng Ke, Kaile Su
-
Analysis of forced aligner performance on L2 English speech Speech Commun. (IF 3.2) Pub Date : 2024-02-16 Samantha Williams, Paul Foulkes, Vincent Hughes
There is growing interest in how speech technologies perform on L2 speech. Largely omitted from this discussion are tools used in the early data processing steps, such as forced aligners, that can introduce errors and biases. This study adds to the conversation and tests how well a model pre-trained for the alignment of L1 American English speech performs on L2 English speech. We test and discuss the
-
-
Pre-trained models for detection and severity level classification of dysarthria from speech Speech Commun. (IF 3.2) Pub Date : 2024-02-14 Farhad Javanmardi, Sudarsana Reddy Kadiri, Paavo Alku
Automatic detection and severity level classification of dysarthria from speech enables non-invasive and effective diagnosis that helps clinical decisions about medication and therapy of patients. In this work, three pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) are studied to extract features to build automatic detection and severity level classification systems for dysarthric speech
-
On intrusive speech quality measures and a global SNR based metric Speech Commun. (IF 3.2) Pub Date : 2024-02-14 Chao Pan, Jingdong Chen, Jacob Benesty
Measuring the quality of noisy speech signals has been an increasingly important problem in the field of speech processing as more and more speech-communication and human-machine-interface systems are deployed in practical applications. In this paper, we study four widely used classical performance measures: signal-to-distortion ratio (SDR), short-time objective intelligibility (STOI), signal-to-noise
-
Deletion and insertion tampering detection for speech authentication based on fluctuating super vector of electrical network frequency Speech Commun. (IF 3.2) Pub Date : 2024-02-12 Chunyan Zeng, Shuai Kong, Zhifeng Wang, Shixiong Feng, Nan Zhao, Juan Wang
-
Some properties of mental speech preparation as revealed by self-monitoring Speech Commun. (IF 3.2) Pub Date : 2024-02-09 Hugo Quené, Sieb G. Nooteboom
The main goal of this paper is to improve our insight in the mental preparation of speech, based on speakers' self-monitoring behavior. To this end we re-analyze the aggregated responses from earlier published experiments eliciting speech sound errors. The re-analyses confirm or show that (1) “early” and “late” detections of elicited speech sound errors can be distinguished, with a time delay in the
-
Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions Speech Commun. (IF 3.2) Pub Date : 2024-02-09 Francesco Sigona, Mirko Grimaldi
In this work, we tested different variants of a Forensic Automatic Speaker Recognition (FASR) system based on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN). To this scope, conditions reflecting those of a real forensic voice comparison case have been taken into consideration according to the evaluation campaign settings. Using this recent neural
-
Automatic classification of neurological voice disorders using wavelet scattering features Speech Commun. (IF 3.2) Pub Date : 2024-01-27 Madhu Keerthana Yagnavajjula, Kiran Reddy Mittapalle, Paavo Alku, Sreenivasa Rao K., Pabitra Mitra
Neurological voice disorders are caused by problems in the nervous system as it interacts with the larynx. In this paper, we propose to use wavelet scattering transform (WST)-based features in automatic classification of neurological voice disorders. As a part of WST, a speech signal is processed in stages with each stage consisting of three operations – convolution, modulus and averaging – to generate
-
AVID: A speech database for machine learning studies on vocal intensity Speech Commun. (IF 3.2) Pub Date : 2024-01-23 Paavo Alku, Manila Kodali, Laura Laaksonen, Sudarsana Reddy Kadiri
Vocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration
-
Monophthong vocal tract shapes are sufficient for articulatory synthesis of German primary diphthongs Speech Commun. (IF 3.2) Pub Date : 2024-01-24 Simon Stone, Peter Birkholz
German primary diphthongs are conventionally transcribed using the same symbols used for some monophthong vowels. However, if the corresponding vocal tract shapes are used for articulatory synthesis, the results often sound unnatural. Furthermore, there is no clear consensus in the literature if diphthongs have monopthong constituents and if so, which ones. This study therefore analyzed a set of audio
-
The impact of non-native English speakers’ phonological and prosodic features on automatic speech recognition accuracy Speech Commun. (IF 3.2) Pub Date : 2024-01-13 Ingy Farouk Emara, Nabil Hamdy Shaker
The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the
-
-
Deep temporal clustering features for speech emotion recognition Speech Commun. (IF 3.2) Pub Date : 2024-01-02 Wei-Cheng Lin, Carlos Busso
-
LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild Speech Commun. (IF 3.2) Pub Date : 2023-12-24 Zhipeng Chen, Xinheng Wang, Lun Xie, Haijie Yuan, Hang Pan
Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM
-
Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network Speech Commun. (IF 3.2) Pub Date : 2023-12-14 Nan Li, Longbiao Wang, Meng Ge, Masashi Unoki, Sheng Li, Jianwu Dang
Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability to discern speech in complex and noisy environments
-
Performance of single-channel speech enhancement algorithms on Mandarin listeners with different immersion conditions in New Zealand English Speech Commun. (IF 3.2) Pub Date : 2023-12-14 Yunqi C. Zhang, Yusuke Hioka, C.T. Justine Hui, Catherine I. Watson
Speech enhancement (SE) is a widely used technology to improve the quality and intelligibility of noisy speech. So far, SE algorithms were designed and evaluated on native listeners only, but not on non-native listeners who are known to be more disadvantaged when listening in noisy environments. This paper investigates the performance of five widely used single-channel SE algorithms on early-immersed
-
Back to grammar: Using grammatical error correction to automatically assess L2 speaking proficiency Speech Commun. (IF 3.2) Pub Date : 2023-12-12 Stefano Bannò, Marco Matassoni
-
Speakers’ vocal expression of sexual orientation depends on experimenter gender Speech Commun. (IF 3.2) Pub Date : 2023-12-04 Sven Kachel, Adrian P. Simpson, Melanie C. Steffens
Since the early days of (phonetic) convergence research, one of the main questions is which individuals are more likely to adapt their speech to others. Especially differences between women and men have been researched with a high intensity. Using a differential approach as well, we complement the existing literature by focusing on another gender-related characteristic, namely sexual orientation. The
-
Choosing only the best voice imitators: Top-K many-to-many voice conversion with StarGAN Speech Commun. (IF 3.2) Pub Date : 2023-11-30 Claudio Fernandez-Martín, Adrian Colomer, Claudio Panariello, Valery Naranjo
Voice conversion systems have become increasingly important as the use of voice technology grows. Deep learning techniques, specifically generative adversarial networks (GANs), have enabled significant progress in the creation of synthetic media, including the field of speech synthesis. One of the most recent examples, StarGAN-VC, uses a single pair of generator and discriminator to convert voices
-
An introduction to pluricentric languages in speech science and technology Speech Commun. (IF 3.2) Pub Date : 2023-11-21 Barbara Schuppler, Martine Adda-Decker, Catia Cucchiarini, Rudolf Muhr
Pluricentric languages are languages that are spoken in at least two countries where they have an official function and thus develop national varieties with specific linguistic and pragmatic features. Presently 43 languages have been identified as belonging to this category, for instance, English, Spanish, German, Bengali, Hindi and Urdu. This article forms an introduction to the special issue “Pluricentric
-
Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition Speech Commun. (IF 3.2) Pub Date : 2023-11-22 Minying Liu, Alex Noel Joseph Raj, Vijayarajan Rajangam, Kunwu Ma, Zhemin Zhuang, Shuxin Zhuang
Speech emotion recognition (SER) is a crucial field of research in artificial intelligence and human–computer interaction. Extracting effective speech features for emotion recognition is a continuing research focus in SER. Most research has focused on finding an optimal speech feature to extract hidden local features while ignoring the global relationships of the speech signal. In this paper, we propose
-
Selective transfer subspace learning for small-footprint end-to-end cross-domain keyword spotting Speech Commun. (IF 3.2) Pub Date : 2023-11-22 Fei Ma, Chengliang Wang, Xusheng Li, Zhuo Zeng
In small-footprint end-to-end keyword spotting, it is often expensive and time-consuming to acquire sufficient labels in various speech scenarios. To overcome this problem, transfer learning leverages the rich knowledge of the auxiliary domain to annotate the unlabeled target data. However, most existing transfer learning methods typically learn a domain-invariant feature representation while ignoring
-
Detecting Wilson's disease from unstructured connected speech: An embedding-based approach augmented by attention and bi-directional dependency Speech Commun. (IF 3.2) Pub Date : 2023-11-17 Zhenglin Zhang, Li-Zhuang Yang, Xun Wang, Hongzhi Wang, Stephen T.C. Wong, Hai Li
Wilson's disease (WD) is a neurodegenerative genetic disorder in which dysarthria is the initial neurological symptom. Automated WD diagnosis from speech is thus a promising and clinically valuable approach. The present study investigates the feasibility of WD detection from unstructured connected speech (UCS) using the embedding-based approach augmented by the attention mechanism and bi-directional
-
Pronunciation error detection model based on feature fusion Speech Commun. (IF 3.2) Pub Date : 2023-11-14 Cuicui Zhu, Aishan Wumaier, Dongping Wei, Zhixing Fan, Jianlei Yang, Heng Yu, Zaokere Kadeer, Liejun Wang
Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a
-
Adapted Weighted Linear Prediction with Attenuated Main Excitation for formant frequency estimation in high-pitched singing Speech Commun. (IF 3.2) Pub Date : 2023-11-09 Eduardo Barrientos, Edson Cataldo
This paper aims to show how to improve the accuracy of formant frequency estimation in the singing voice of a lyric soprano. Conventional methods of formant frequency estimation may not accurately capture the formant frequencies of the singing voice, particularly in the highest pitch range of a lyric soprano, where the lowest formants are biased by the pitch harmonics. To address this issue, the study
-
JNV corpus: A corpus of Japanese nonverbal vocalizations with diverse phrases and emotions Speech Commun. (IF 3.2) Pub Date : 2023-11-04 Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari
We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora either lack phrase diversity or focus on a small number of emotions, which makes it difficult to analyze the characteristics of Japanese NVs and support downstream tasks like emotion recognition. We first propose a corpus-design
-
Multimodal Arabic emotion recognition using deep learning Speech Commun. (IF 3.2) Pub Date : 2023-11-04 Noora Al Roken, Gerassimos Barlas
Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities
-
Coarse-to-fine speech separation method in the time-frequency domain Speech Commun. (IF 3.2) Pub Date : 2023-11-04 Xue Yang, Changchun Bao, Xianhong Chen
Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose
-
Disordered speech recognition considering low resources and abnormal articulation Speech Commun. (IF 3.2) Pub Date : 2023-10-29 Yuqin Lin, Longbiao Wang, Jianwu Dang, Sheng Li, Chenchen Ding
The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, e.g, unclear, unstable, and incorrect pronunciations. To realize the
-
Dual-model self-regularization and fusion for domain adaptation of robust speaker verification Speech Commun. (IF 3.2) Pub Date : 2023-10-27 Yibo Duan, Yanhua Long, Jiaen Liang
Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically
-
The Role of Auditory and Visual Cues in the Perception of Mandarin Emotional Speech in Male Drug Addicts Speech Commun. (IF 3.2) Pub Date : 2023-10-12 Puyang Geng, Ningxue Fan, Rong Ling, Hong Guo, Qimeng Lu, Xingwen Chen
Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: reporting positive/negative responses when hearing white noise) and negative reinforcement. Due to these emotion processing disorders, drug addicts may experience
-
Classification of functional dysphonia using the tunable Q wavelet transform Speech Commun. (IF 3.2) Pub Date : 2023-10-06 Kiran Reddy Mittapalle, Madhu Keerthana Yagnavajjula, Paavo Alku
Functional dysphonia (FD) refers to an abnormality in voice quality in the absence of an identifiable lesion. In this paper, we propose an approach based on the tunable Q wavelet transform (TQWT) to automatically classify two types of FD (hyperfunctional dysphonia and hypofunctional dysphonia) from a healthy voice using the acoustic voice signal. Using TQWT, voice signals were decomposed into sub-bands
-
Graph attention-based deep embedded clustering for speaker diarization Speech Commun. (IF 3.2) Pub Date : 2023-10-05 Yi Wei, Haiyan Guo, Zirui Ge, Zhen Yang
Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC)
-
Subband fusion of complex spectrogram for fake speech detection Speech Commun. (IF 3.2) Pub Date : 2023-09-29 Cunhang Fan, Jun Xue, Shunbo Dong, Mingming Ding, Jiangyan Yi, Jinpeng Li, Zhao Lv
The phase information was shown useful in fake speech detection. However, the most common reason why phase-based features are not widely used is phase wrapping. This makes the original phase hard to model directly. Therefore, it remains a challenge how to utilize the phase information effectively. To address this issue, this paper proposes a novel subband fusion of the complex spectrogram method for
-
Post-processing automatic transcriptions with machine learning for verbal fluency scoring Speech Commun. (IF 3.2) Pub Date : 2023-09-27 Justin Bushnell, Frederick Unverzagt, Virginia G. Wadley, Richard Kennedy, John Del Gaizo, David Glenn Clark
Objective To compare verbal fluency scores derived from manual transcriptions to those obtained using automatic speech recognition enhanced with machine learning classifiers. Methods Using Amazon Web Services, we automatically transcribed verbal fluency recordings from 1400 individuals who performed both animal and letter F verbal fluency tasks. We manually adjusted timings and contents of the automatic
-
Chirplet transform based time frequency analysis of speech signal for automated speech emotion recognition Speech Commun. (IF 3.2) Pub Date : 2023-09-23 Siba Prasad Mishra, Pankaj Warule, Suman Deb
Nowadays, the recognition of emotion using the speech signal has gained popularity because of its vast number of applications in different fields like medicine, online marketing, online search engines, the education system, criminal investigations, traffic collisions, etc. Many researchers have adopted different methodologies to improve emotion classification accuracy using speech signals. In our study
-
CAST: Context-association architecture with simulated long-utterance training for mandarin speech recognition Speech Commun. (IF 3.2) Pub Date : 2023-09-22 Yue Ming, Boyang Lyu, Zerui Li
End-to-end (E2E) models are widely used because they significantly improve the performance of automatic speech recognition (ASR). However, based on the limitations of existing hardware computing devices, previous studies mainly focus on short utterances. Typically, utterances used for ASR training do not last much longer than 15 s, and therefore the models often fail to generalize to longer utterances
-
Comparing Levenshtein distance and dynamic time warping in predicting listeners’ judgments of accent distance Speech Commun. (IF 3.2) Pub Date : 2023-09-21 Holly C. Lind-Combs, Tessa Bent, Rachael F. Holt, Cynthia G. Clopper, Emma Brown
Listeners attend to variation in segmental and prosodic cues when judging accent strength. The relative contributions of these cues to perceptions of accentedness in English remains open for investigation, although objective accent distance measures (such as Levenshtein distance) appear to be reliable tools for predicting perceptual distance. Levenshtein distance, however, only accounts for phonemic
-
Determining spectral stability in vowels: A comparison and assessment of different metrics Speech Commun. (IF 3.2) Pub Date : 2023-09-15 Jérémy Genette, Jose Manuel Rivera Espejo, Steven Gillis, Jo Verhoeven
-
Toward enriched decoding of mandarin spontaneous speech Speech Commun. (IF 3.2) Pub Date : 2023-09-14 Yu-Chih Deng, Yuan-Fu Liao, Yih-Ru Wang, Sin-Horng Chen
A deep neural network (DNN)-based automatic speech recognition (ASR) method for enriched decoding of Mandarin spontaneous speech is proposed. It adopts an enhanced approach over the baseline model built with factored time delay neural networks (TDNN-f) and rescored with RNNLM to first building a baseline system composed of a TDNN-f acoustic model (AM), a trigram language model (LM), and a recurrent
-
Acoustic properties of non-native clear speech: Korean speakers of English Speech Commun. (IF 3.2) Pub Date : 2023-09-10 Ye-Jee Jung, Olga Dmitrieva
The present study examined the acoustic properties of clear speech produced by non-native speakers of English (L1 Korean), in comparison to native clear speech. L1 Korean speakers of English (N=30) and native speakers of English (N=20) read an English word-list in casual and clear speaking styles. Analysis included clear speech correlates thought to be universal across languages (vowel space expansion
-
Model predictive PESQ-ANFIS/FUZZY C-MEANS for image-based speech signal evaluation Speech Commun. (IF 3.2) Pub Date : 2023-09-09 Eder Pereira Neves, Marco Aparecido Queiroz Duarte, Jozue Vieira Filho, Caio Cesar Enside de Abreu, Bruno Rodrigues de Oliveira
This paper presents a new method to evaluate the quality of speech signals through images generated from a psychoacoustic model to estimate PESQ (ITU-T P862) values using a first-order Fuzzy Sugeno approach implemented in the Adaptive Neuro-Fuzzy Inference System - ANFIS. The factors feeding the network were obtained using an image-processing technique from the perceptual model coefficients. All simulations
-
Speech emotion recognition approaches: A systematic review Speech Commun. (IF 3.2) Pub Date : 2023-09-07 Ahlam Hashem, Muhammad Arif, Manal Alghamdi
The speech emotion recognition (SER) field has been active since it became a crucial feature in advanced Human–Computer Interaction (HCI), and wide real-life applications use it. In recent years, numerous SER systems have been covered by researchers, including the availability of appropriate emotional databases, selecting robustness features, and applying suitable classifiers using Machine Learning
-
DNN controlled adaptive front-end for replay attack detection systems Speech Commun. (IF 3.2) Pub Date : 2023-08-20 Buddhi Wickramasinghe, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Julien Epps, Haizhou Li, Ting Dang
Developing robust countermeasures to protect automatic speaker verification systems against replay spoofing attacks is a well-recognized challenge. Current approaches to spoofing detection are generally based on a fixed front-end, typically a time-invariant filter bank, followed by a machine learning back-end. In this paper, we propose a novel approach whereby the front-end comprises an adaptive filter
-
Fractional feature-based speech enhancement with deep neural network Speech Commun. (IF 3.2) Pub Date : 2023-08-19 Liyun Xu, Tong Zhang
Speech enhancement (SE) has become a considerable promise application of deep learning. Commonly, the deep neural network (DNN) in the SE task is trained to learn a mapping from the noisy features to the clean. However, the features are usually extracted in the time or frequency domain. In this paper, the improved features in the fractional domain are presented based on the flexible character of fractional
-
Correction of whitespace and word segmentation in noisy Pashto text using CRF Speech Commun. (IF 3.2) Pub Date : 2023-08-14 Ijazul Haq, Weidong Qiu, Jie Guo, Peng Tang
Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique
-
The dependence of accommodation processes on conversational experience Speech Commun. (IF 3.2) Pub Date : 2023-07-26 L. Ann Burchfield, Mark Antoniou, Anne Cutler
Conversational partners accommodate to one another's speech, a process that greatly facilitates perception. This process occurs in both first (L1) and second languages (L2); however, recent research has revealed that adaptation can be language-specific, with listeners sometimes applying it in one language but not in another. Here, we investigate whether a supply of novel talkers impacts whether the
-
Multimodal attention for lip synthesis using conditional generative adversarial networks Speech Commun. (IF 3.2) Pub Date : 2023-07-22 Andrea Vidal, Carlos Busso
-
Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet Speech Commun. (IF 3.2) Pub Date : 2023-07-21 Jamaladdin Hasanov, Nigar Alishzade, Aykhan Nazimzade, Samir Dadashzade, Toghrul Tahirov
The paper introduces a real-time fingerspelling-to-text translation system for the Azerbaijani Sign Language (AzSL), targeted to the clarification of the words with no available or ambiguous signs. The system consists of both statistical and probabilistic models, used in the sign recognition and sequence generation phases. Linguistic, technical, and human–computer interaction-related challenges, which
-
Real-time intelligibility affects the realization of French word-final schwa Speech Commun. (IF 3.2) Pub Date : 2023-07-17
Speech variation has been hypothesized to reflect both speaker-internal influences of lexical access on production and adaptive modifications to make words more intelligible to the listener. The current study considers categorical and gradient variation in the production of word-final schwa in French as explained by lexical access processes, phonological, and/or listener-oriented influences on speech
-
Investigating prosodic entrainment from global conversations to local turns and tones in Mandarin conversations Speech Commun. (IF 3.2) Pub Date : 2023-07-17 Zhihua Xia, Julia Hirschberg, Rivka Levitan
Previous research on acoustic entrainment has paid less attention to tones than to other prosodic features. This study sets a hierarchical framework by three layers of conversations, turns and tone units, investigates prosodic entrainment in Mandarin spontaneous dialogues at each level, and compares the three. Our research has found that (1) global and local entrainment exist independently, and local
-
Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech Speech Commun. (IF 3.2) Pub Date : 2023-07-14 Li Chai, Hang Chen, Jun Du, Qing-Feng Liu, Chin-Hui Lee
We propose a space-and-speaker-aware (SSA) approach to acoustic modeling (AM), denoted as SSA-AM, to improve system performances of automatic speech recognition (ASR) in distant multi-array conversational scenarios. In contrast to conventional AM which only uses spectral features from a target speaker as inputs, the inputs to SSA-AM consists of speech features from both the target and interfering speakers
-
Addressing the semi-open set dialect recognition problem under resource-efficient considerations Speech Commun. (IF 3.2) Pub Date : 2023-07-01 Spandan Dey, Goutam Saha
This work presents a resource-efficient solution for the spoken dialect recognition task under semi-open set evaluation scenarios, where a closed set model is exposed to unknown class inputs. We have primarily explored the task 2 of the OLR 2020 challenge for our experiments. In this task, three Chinese dialects Hokkien, Sichuanese, and Shanghainese, are to be recognized. For evaluation, along with
-
Fusion-based speech emotion classification using two-stage feature selection Speech Commun. (IF 3.2) Pub Date : 2023-06-27 Jie Xie, Mingying Zhu, Kai Hu
Speech emotion recognition plays an important role in human–computer interaction, which uses speech signals to determine the emotional state. Previous studies have proposed various features and feature selection methods. However, few studies have investigated the two-stage feature selection method for speech emotion classification. In this study, we propose a novel speech emotion classification algorithm