Compensation Approaches for Far-field Speaker Identification (original) (raw)

Compensating noise and reverberation in far-field Multichannel Speaker Verification

Le Centre pour la Communication Scientifique Directe - HAL - Université de Nantes, 2022

Speaker verification (SV) suffers from unsatisfactory performance in far-field scenarios due to environmental noise and the adverse impact of room reverberation. This work presents a benchmark of multichannel speech enhancement for far-field speaker verification. One approach is a deep neural network-based, and the other is a combination of deep neural network and signal processing. We integrated a DNN architecture with signal processing techniques to carry out various experiments. Our approach is compared to the existing state-of-the-art approaches. We examine the importance of enrollment in pre-processing, which has been largely overlooked in previous studies. Experimental evaluation shows that pre-processing can improve the SV performance as long as the enrollment files are processed similarly to the test data and that test and enrollment occur within similar SNR ranges. Considerable improvement is obtained on the generated and all the noise conditions of the VOiCES dataset.

Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014

Room reverberation poses various deleterious effects on performance of automatic speech systems. Speaker identification (SID) performance, in particular, degrades rapidly as reverberation time increases. Reverberation causes two forms of spectro-temporal distortions on speech signals: i) self-masking which is due to early reflections and ii) overlap-masking which is due to late reverberation. Overlap-masking effect of reverberation has been shown to have a greater adverse impact on performance of speech systems. Motivated by this fact, this study proposes a blind spectral weighting (BSW) technique for suppressing the reverberation overlap-masking effect on SID systems. The technique is blind in the sense that prior knowledge of neither the anechoic signal nor the room impulse response is required. Performance of the proposed technique is evaluated on speaker verification tasks under simulated and actual reverberant mismatched conditions. Evaluations are conducted in the context of the conventional GMM-UBM as well as the state-of-the-art i-vector based systems. The GMM-UBM experiments are performed using speech material from a new data corpus well suited for speaker verification experiments under actual reverberant mismatched conditions, entitled MultiRoom8. The i-vector experiments are carried out with microphone (interview and phonecall) data from the NIST SRE 2010 extended evaluation set which are digitally convolved with three different measured room impulse responses extracted from the Aachen impulse response (AIR) database. Experimental results prove that incorporating the proposed blind technique into the standard MFCC feature extraction framework yields significant improvement in SID performance under reverberation mismatch.

Far-Field Speaker Recognition

IEEE Transactions on Audio, Speech and Language Processing, 2000

In this article we study robust speaker recognition in far-field microphone situations. Two approaches are investigated to improve the robustness of speaker recognition in such scenarios. The first approach applies traditional techniques based on acoustic features. We introduce reverberation compensation as well as feature warping and gain significant improvements, even under mismatched training-testing conditions. In addition, we performed multiple channel combination experiments to make use of information from multiple distant microphones. Overall, we achieved up to 87.1% relative improvements on our Distant Microphone database and found that the gains hold across different data conditions and microphone settings. The second approach makes use of higher-level linguistic features. To capture speaker idiosyncrasies, we apply n-gram models trained on multilingual phone strings and show that higherlevel features are more robust under mismatching conditions. Furthermore, we compared the performances between multilingual and multi-engine systems, and examined the impact of number of involved languages on recognition results. Our findings confirm the usefulness of language variety and indicate a language independent nature of this approach, which suggests that speaker recognition using multilingual phone strings could be successfully applied to any given language.

Improving the performance of far-field speaker verification using multi-condition training: the case of GMM-UBM and i-vector systems

Interspeech 2014, 2014

While considerable work has been done to characterize the detrimental effects of channel variability on automatic speaker verification (ASV) performance, little attention has been paid to the effects of room reverberation. This paper investigates the effects of room acoustics on the performance of two far-field ASV systems: GMM-UBM (Gaussian mixture model-universal background model) and i-vector. We show that ASV performance is severely affected by reverberation, particularly for i-vector based systems. Three multi-condition training methods are then investigated to mitigate such detrimental effects. The first uses matched train/test speaker models based on estimated reverberation time (RT) values. The second utilizes twocondition training where clean and reverberant models are used. Lastly, a four-condition training setup is proposed where models for clean, mild, moderate, and severe reverberation levels are used. Experimental results show the first and third multicondition training methods providing significant gains in performance relative to the baseline, with the latter being more suitable for practical resource-constrained far-field applications.

Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation

EURASIP Journal on Audio, Speech, and Music Processing, 2014

Previously, a dereverberation method based on generalized spectral subtraction (GSS) using multi-channel least mean-squares (MCLMS) has been proposed. The results of speech recognition experiments showed that this method achieved a significant improvement over conventional methods. In this paper, we apply this method to distant-talking (far-field) speaker recognition. However, for far-field speech, the GSS-based dereverberation method using clean speech models degrades the speaker recognition performance. This may be because GSS-based dereverberation causes some distortion between clean speech and dereverberant speech. In this paper, we address this problem by training speaker models using dereverberant speech obtained by suppressing reverberation from arbitrary artificial reverberant speech. Furthermore, we propose an efficient computational method for a combination of the likelihood of dereverberant speech using multiple compensation parameter sets. This addresses the problem of determining optimal compensation parameters for GSS. We report the results of a speaker recognition experiment performed on large-scale far-field speech with different reverberant environments to the training environments. The proposed GSS-based dereverberation method achieves a recognition rate of 92.2%, which compares well with conventional cepstral mean normalization with delay-and-sum beamforming using a clean speech model (49.0%) and a reverberant speech model (88.4%). We also compare the proposed method with another dereverberation technique, multi-step linear prediction-based spectral subtraction (MSLP-GSS). The proposed method achieves a better recognition rate than the 90.6% of MSLP-GSS. The use of multiple compensation parameters further improves the speech recognition performance, giving our approach a recognition rate of 93.6%. We implement this method in a real environment using the optimal compensation parameters estimated from an artificial environment. The results show a recognition rate of 87.8% compared with 72.5% for delay-and-sum beamforming using a reverberant speech model.

Investigating the use of modulation spectral features within an i-vector framework for far-field automatic speaker verification

2014 International Telecommunications Symposium (ITS), 2014

It is known that channel variability compromises automatic speaker recognition accuracy. However, little attention has been given so far to the detrimental effects encountered under reverberant environments. In this paper, we focus on the issue of automatic speaker verification (ASV) under several levels of room reverberation. Alternative auditory inspired features are explored. Specifically, we investigate whether the performance of the so-called modulation spectral features (MSFs) can overcome the well-known mel-frequency cepstral coefficients (MFCCs). Experiments were conducted with an ASV system based on the state-of-the-art i-vector. The main contribution of this paper is to verify if MSFs combined with i-vectors are able to present the same performance encountered in the literature regarding speech recognition and speaker identification systems in reverberant environment.

Speaker identification with distant microphone speech

Acoustics Speech and …, 2010

The field of speaker identification has recently seen significant advancement, but improvements have tended to be benchmarked on near-field speech, ignoring the more realistic setting of far-field-instrumented speakers. In this work we present several findings on far-field speech from the MIXER5 Corpus, in the areas of feature extraction, speaker modeling, and multichannel score combination. First, we observe that minimum-variance distortionless response (MVDR) features outperform Mel-frequency cepstral coefficient (MFCC) features, and that fundamental frequency variation (FFV) features offer complimentary information to both MFCC and MVDR features. Second, we present evidence that factor analysis significantly improves system performance, compared to the more traditional GMM/UBM strategy. Third, we find that frame-based score competition significantly improves performance under mismatched conditions with multiple channels available.

Robust Speaker Recognition in Noisy Conditions

IEEE Transactions on Audio, Speech and Language Processing, 2000

This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the Internet. While the technologies promise an additional biometric layer of security to protect the user, the practical implementation of such systems faces many challenges. One of these is environmental noise. Due to the mobile nature of such systems, the noise sources can be highly time-varying and potentially unknown. This raises the requirement for noise robustness in the absence of information of the noise. This paper describes a method, named universal compensation (UC), that combines multi-condition training and the missing-feature method to model noises with unknown temporal-spectral characteristics. Multi-condition training is conducted using simulated noisy data with limited noise varieties, providing a "coarse" compensation for the noise, and the missing-feature method refines the compensation by ignoring noise variations outside the given training conditions, thereby reducing the training and testing mismatch. This paper is focused on several issues relating to the implementation of the UC model for real-world applications. These include the generation of multi-condition training data to model real-world noisy speech, the combination of different training data to optimize the recognition performance, and the reduction of the model's complexity. Two databases were used to test the UC algorithm. The first is a re-development of the TIMIT database by re-recording the data in the presence of various noises, used to test the model for speaker identification with a focus on the noise varieties. The second is a handheld-device database collected in realistic noisy conditions, used to further validate the model on the real-world data for speaker verification. The new model was compared to baseline systems and has shown improved identification and verification performance.

Speaker Identification in Reverberant Environments

2020

The goal of this project was to explore Computational Auditory Scene Analysis (CASA), specifically, blind source separation in reverberant environments. Additionally, speaker identification, vowel classification and speech generation were also explored.