A Study of Interspeaker Variability in Speaker Verification (original) (raw)

Joint Factor Analysis Versus Eigenchannels in Speaker Recognition

IEEE Transactions on Audio, Speech and Language Processing, 2000

We compare two approaches to the problem of session variability in GMM-based speaker verification, eigenchannels and joint factor analysis, on the NIST 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all stages except for the enrollment of target speakers. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. We found that factor analysis was far more effective than eigenchannel modeling. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.

Speaker and Session Variability in GMM-Based Speaker Verification

IEEE Transactions on Audio, Speech and Language Processing, 2000

We present a corpus-based approach to speaker verification in which maximum likelihood II criteria are used to train a large scale generative model of speaker and session variability which we call joint factor analysis. Enrolling a target speaker consists in calculating the posterior distribution of the hidden variables in the factor analysis model and verification tests are conducted using a new type of likelihood II ratio statistic. Using the NIST 1999 and 2000 speaker recognition evaluation data sets, we show that the effectiveness of this approach depends on the availability of a training corpus which is well matched with the evaluation set used for testing. Experiments on the NIST 1999 evaluation set using a mismatched corpus to train factor analysis models did not result in any improvement over standard methods but we found that, even with this type of mismatch, feature warping performs extremely well in conjunction with the factor analysis model and this enabled us to obtain very good results (equal error rates of about 6.2%).

A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification

IEEE Transactions on Audio, Speech and Language Processing, 2000

This paper addresses the issue of speaker variability and session variability in text-independent Gaussian mixture model (GMM)-based speaker verification. A speaker model adaptation procedure is proposed which is based on a joint factor analysis approach to speaker verification. It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability. The paper also deals with the interaction between this model adaptation approach and score normalization strategies which act to reduce the variation in likelihood ratio scores. This issue is particularly important in establishing decision thresholds in practical speaker verification systems since the variability of likelihood ratio scores can increase as a result of progressive model adaptation. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 Speaker Recognition Evaluation Plan, which is based on conversation sides derived from telephone speech utterances. It was found that when target speaker models were trained from a single conversation, an equal error rate (EER) of 4.5% was obtained under the NIST unsupervised speaker adaptation scenario.

Acoustic Factor Analysis for Robust Speaker Verification

Factor analysis based channel mismatch compensation methods for speaker recognition are based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM) mean super-vectors can be constrained to reside in a lower dimensional subspace. This approach does not consider the fact that conventional acoustic feature vectors also reside in a lower dimensional manifold of the feature space, when feature covariance matrices contain close to zero eigenvalues. In this study, based on observations of the covariance structure of acoustic features, we propose a factor analysis modeling scheme in the acoustic feature space instead of the super-vector space and derive a mixture dependent feature transformation. We demonstrate how this single linear transformation performs feature dimensionality reduction, de-correlation, normalization and enhancement, at once. The proposed transformation is shown to be closely related to signal subspace based speech enhancement schemes. In contrast to traditional front-end mixture dependent feature transformations, where feature alignment is performed using the highest scoring mixture, the proposed transformation is integrated within the speaker recognition system using a probabilistic feature alignment technique, which nullifies the need for regenerating the features/retraining the Universal Background Model (UBM). Incorporating the proposed method with a state-of-the-art i-vector and Gaussian Probabilistic Linear Discriminant Analysis (PLDA) framework, we perform evaluations on National Institute of Science and Technology (NIST) Speaker Recognition Evaluation (SRE) 2010 core telephone and microphone tasks. The experimental results demonstrate the superiority of the proposed scheme compared to both full-covariance and diagonal covariance UBM based systems. Simple equal-weight fusion of baseline and proposed systems also yield significant performance gains.

A straightforward and efficient implementation of the factor analysis model for speaker verification

2007

For a few years, the problem of session variability in text- independent automatic speaker verification is being tackled ac- tively. A new paradigm based on a factor analysis model have successfully been applied for this task. While very efficient, its implementation is demanding. In this paper, the algorithms in- volved in the eigenchannel MAP model are written down for a straightforward implementation, without referring to previous work or complex mathematics. In addition, a different com- pensation scheme is proposed where the standard GMM likeli- hood can be used without any modification to obtain good per- formance (even without the need of score normalization). The use of the compensated supervectors within a SVM classifier through a distance based kernel is also investigated. Experi- ments results shows an overall 50% relative gain over the stan- dard GMM-UBM system on NIST SRE 2005 and 2006 proto- cols (both at the DCFmin and EER). Index Terms: Speaker Verification, Session ...

Within-session variability modelling for factor analysis speaker verification

2009

Abstract This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability. The goals of the proposed extended JFA model are to improve verification performance with short utterances by compensating for the effects of limited or imbalanced phonetic coverage, and to produce a flexible JFA model that is effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results on the 2006 NIST SRE corpus demonstrate the flexibility of the proposed model by providing competitive results over a wide range of utterance lengths without retraining and also yielding modest improvements in a number of conditions over current state-of-the-art.

A robust speaker recognition system combining factor analysis techniques

2014 21th Iranian Conference on Biomedical Engineering (ICBME), 2014

in this paper we implement state of the art factor analysis based methods and fused their scores to gain a channel robust speaker recognition system. These two methods are joint factor analysis (JFA) and i-Vector which define low-dimensional speaker and channel dependent spaces. For score fusion we propose a simple weight computation without training step. We experiment our method on two conditions; 1) in channel matched training and test channel (telephone in training phase/telephone in test phase) task and 2) the channel mismatched condition (telephone training phase/microphone, GSM and VOIP in test phase) task. Our strategies outperform a state-of-the-art GMM-UBM based system. We obtained more than 4% absolute EER improvement for both channel dependent and channel independent condition compared to the standard GMM-UBM based method. Simulation also results that the combined i-Vector and JFA based system give better performance than all implemented method.

Improvements in Factor Analysis Based Speaker Verification

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006

We present the results of speaker verification experiments conducted on the NIST 2005 evaluation data using a factor analysis of speaker and session variability in 6 telephone speech corpora distributed by the Linguistic Data Consortium. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.

Support vector machines and Joint Factor Analysis for speaker verification

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Model supervectors and speakers and Common factors. We found that using SVM in JFA factors gave the best results especially when within class covariance normalization method is applied in order to compensate for the channel effect. The new combination results are comparable to other classical JFA scoring techniques.