A robust speaker recognition system combining factor analysis techniques (original) (raw)

Acoustic Factor Analysis for Robust Speaker Verification

Factor analysis based channel mismatch compensation methods for speaker recognition are based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM) mean super-vectors can be constrained to reside in a lower dimensional subspace. This approach does not consider the fact that conventional acoustic feature vectors also reside in a lower dimensional manifold of the feature space, when feature covariance matrices contain close to zero eigenvalues. In this study, based on observations of the covariance structure of acoustic features, we propose a factor analysis modeling scheme in the acoustic feature space instead of the super-vector space and derive a mixture dependent feature transformation. We demonstrate how this single linear transformation performs feature dimensionality reduction, de-correlation, normalization and enhancement, at once. The proposed transformation is shown to be closely related to signal subspace based speech enhancement schemes. In contrast to traditional front-end mixture dependent feature transformations, where feature alignment is performed using the highest scoring mixture, the proposed transformation is integrated within the speaker recognition system using a probabilistic feature alignment technique, which nullifies the need for regenerating the features/retraining the Universal Background Model (UBM). Incorporating the proposed method with a state-of-the-art i-vector and Gaussian Probabilistic Linear Discriminant Analysis (PLDA) framework, we perform evaluations on National Institute of Science and Technology (NIST) Speaker Recognition Evaluation (SRE) 2010 core telephone and microphone tasks. The experimental results demonstrate the superiority of the proposed scheme compared to both full-covariance and diagonal covariance UBM based systems. Simple equal-weight fusion of baseline and proposed systems also yield significant performance gains.

A straightforward and efficient implementation of the factor analysis model for speaker verification

2007

For a few years, the problem of session variability in text- independent automatic speaker verification is being tackled ac- tively. A new paradigm based on a factor analysis model have successfully been applied for this task. While very efficient, its implementation is demanding. In this paper, the algorithms in- volved in the eigenchannel MAP model are written down for a straightforward implementation, without referring to previous work or complex mathematics. In addition, a different com- pensation scheme is proposed where the standard GMM likeli- hood can be used without any modification to obtain good per- formance (even without the need of score normalization). The use of the compensated supervectors within a SVM classifier through a distance based kernel is also investigated. Experi- ments results shows an overall 50% relative gain over the stan- dard GMM-UBM system on NIST SRE 2005 and 2006 proto- cols (both at the DCFmin and EER). Index Terms: Speaker Verification, Session ...

Joint Factor Analysis Versus Eigenchannels in Speaker Recognition

IEEE Transactions on Audio, Speech and Language Processing, 2000

We compare two approaches to the problem of session variability in GMM-based speaker verification, eigenchannels and joint factor analysis, on the NIST 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all stages except for the enrollment of target speakers. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. We found that factor analysis was far more effective than eigenchannel modeling. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.

Improvements in Factor Analysis Based Speaker Verification

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006

We present the results of speaker verification experiments conducted on the NIST 2005 evaluation data using a factor analysis of speaker and session variability in 6 telephone speech corpora distributed by the Linguistic Data Consortium. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.

Support vector machines and Joint Factor Analysis for speaker verification

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Model supervectors and speakers and Common factors. We found that using SVM in JFA factors gave the best results especially when within class covariance normalization method is applied in order to compensate for the channel effect. The new combination results are comparable to other classical JFA scoring techniques.

The Geometry of the Channel Space in GMM-Based Speaker Recognition

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop, 2006

We describe an extension of the joint factor analysis model of speaker and channel variability in which channel supervectors are modeled by mixtures of low-rank Gaussians rather than by a unimodal Gaussian. This version of the joint factor analysis model includes datadriven feature mapping and the standard joint factor analysis models as limiting cases and it enables us to explore a range of possibilities between these two extremes. Our experimental results indicate that unimodal models of relatively high rank perform better than mixture models of lower rank and they confirm the appropriateness of the unimodal assumption in the standard joint factor analysis model.

The IIR NIST SRE 2008 and 2010 summed channel speaker recognition systems

2010

This paper reports the IIR speaker recognition system for the summed channel evaluation tasks in the NIST SRE 2008 and 2010. The system includes three main modules: voice activity detection, speaker diarization and speaker recognition. The front-end process employs a voice activity detection algorithm for effective speech frame selection. The speaker diarization system that was developed for 2007 and 2009 NIST RT Evaluations is adopted for summed channel speech segmentation. A hybrid purifying and clustering algorithm is developed to segregate the summed channel speech by speakers. The GMM-SVM speaker recognition system is adopted to evaluate the performance with both MFCC and LPCC features. The system achieves an overall EER of 3.46% in the 1conv-summed task and 1.87% in the 8conv-summed task, respectively, where only all English trials are involved.

A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification

IEEE Transactions on Audio, Speech and Language Processing, 2000

This paper addresses the issue of speaker variability and session variability in text-independent Gaussian mixture model (GMM)-based speaker verification. A speaker model adaptation procedure is proposed which is based on a joint factor analysis approach to speaker verification. It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability. The paper also deals with the interaction between this model adaptation approach and score normalization strategies which act to reduce the variation in likelihood ratio scores. This issue is particularly important in establishing decision thresholds in practical speaker verification systems since the variability of likelihood ratio scores can increase as a result of progressive model adaptation. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 Speaker Recognition Evaluation Plan, which is based on conversation sides derived from telephone speech utterances. It was found that when target speaker models were trained from a single conversation, an equal error rate (EER) of 4.5% was obtained under the NIST unsupervised speaker adaptation scenario.

Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System

IEEE Transactions on Audio, Speech and Language Processing, 2000

In this paper, several feature extraction and channel compensation techniques found in state-of-the-art speaker verification systems are analyzed and discussed. For the NIST SRE 2006 submission, Cepstral Mean Subtraction, Feature Warping, RASTA filtering, HLDA, Feature Mapping and Eigenchannel Adaptation were incrementally added to minimize the system's error rate. The paper deals with Eigenchannel Adaptation in more detail, and includes its theoretical background and implementation issues. The key part of the paper is however the post-evaluation analysis, undermining a common myth that "the more boxes in the scheme, the better the system". All results are presented on NIST SRE 2005 and 2006 data.