Joint Factor Analysis Versus Eigenchannels in Speaker Recognition (original) (raw)

Speaker and Session Variability in GMM-Based Speaker Verification

IEEE Transactions on Audio, Speech and Language Processing, 2000

We present a corpus-based approach to speaker verification in which maximum likelihood II criteria are used to train a large scale generative model of speaker and session variability which we call joint factor analysis. Enrolling a target speaker consists in calculating the posterior distribution of the hidden variables in the factor analysis model and verification tests are conducted using a new type of likelihood II ratio statistic. Using the NIST 1999 and 2000 speaker recognition evaluation data sets, we show that the effectiveness of this approach depends on the availability of a training corpus which is well matched with the evaluation set used for testing. Experiments on the NIST 1999 evaluation set using a mismatched corpus to train factor analysis models did not result in any improvement over standard methods but we found that, even with this type of mismatch, feature warping performs extremely well in conjunction with the factor analysis model and this enabled us to obtain very good results (equal error rates of about 6.2%).

A straightforward and efficient implementation of the factor analysis model for speaker verification

2007

For a few years, the problem of session variability in text- independent automatic speaker verification is being tackled ac- tively. A new paradigm based on a factor analysis model have successfully been applied for this task. While very efficient, its implementation is demanding. In this paper, the algorithms in- volved in the eigenchannel MAP model are written down for a straightforward implementation, without referring to previous work or complex mathematics. In addition, a different com- pensation scheme is proposed where the standard GMM likeli- hood can be used without any modification to obtain good per- formance (even without the need of score normalization). The use of the compensated supervectors within a SVM classifier through a distance based kernel is also investigated. Experi- ments results shows an overall 50% relative gain over the stan- dard GMM-UBM system on NIST SRE 2005 and 2006 proto- cols (both at the DCFmin and EER). Index Terms: Speaker Verification, Session ...

Improvements in Factor Analysis Based Speaker Verification

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006

We present the results of speaker verification experiments conducted on the NIST 2005 evaluation data using a factor analysis of speaker and session variability in 6 telephone speech corpora distributed by the Linguistic Data Consortium. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.

The Geometry of the Channel Space in GMM-Based Speaker Recognition

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop, 2006

We describe an extension of the joint factor analysis model of speaker and channel variability in which channel supervectors are modeled by mixtures of low-rank Gaussians rather than by a unimodal Gaussian. This version of the joint factor analysis model includes datadriven feature mapping and the standard joint factor analysis models as limiting cases and it enables us to explore a range of possibilities between these two extremes. Our experimental results indicate that unimodal models of relatively high rank perform better than mixture models of lower rank and they confirm the appropriateness of the unimodal assumption in the standard joint factor analysis model.

A Joint Factor Analysis Approach to Progressive Model Adaptation in Text-Independent Speaker Verification

IEEE Transactions on Audio, Speech and Language Processing, 2000

This paper addresses the issue of speaker variability and session variability in text-independent Gaussian mixture model (GMM)-based speaker verification. A speaker model adaptation procedure is proposed which is based on a joint factor analysis approach to speaker verification. It is shown in this paper that this approach facilitates the implementation of a progressive unsupervised adaptation strategy which is able to produce an improved model of speaker identity while minimizing the influence of channel variability. The paper also deals with the interaction between this model adaptation approach and score normalization strategies which act to reduce the variation in likelihood ratio scores. This issue is particularly important in establishing decision thresholds in practical speaker verification systems since the variability of likelihood ratio scores can increase as a result of progressive model adaptation. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 Speaker Recognition Evaluation Plan, which is based on conversation sides derived from telephone speech utterances. It was found that when target speaker models were trained from a single conversation, an equal error rate (EER) of 4.5% was obtained under the NIST unsupervised speaker adaptation scenario.

A robust speaker recognition system combining factor analysis techniques

2014 21th Iranian Conference on Biomedical Engineering (ICBME), 2014

in this paper we implement state of the art factor analysis based methods and fused their scores to gain a channel robust speaker recognition system. These two methods are joint factor analysis (JFA) and i-Vector which define low-dimensional speaker and channel dependent spaces. For score fusion we propose a simple weight computation without training step. We experiment our method on two conditions; 1) in channel matched training and test channel (telephone in training phase/telephone in test phase) task and 2) the channel mismatched condition (telephone training phase/microphone, GSM and VOIP in test phase) task. Our strategies outperform a state-of-the-art GMM-UBM based system. We obtained more than 4% absolute EER improvement for both channel dependent and channel independent condition compared to the standard GMM-UBM based method. Simulation also results that the combined i-Vector and JFA based system give better performance than all implemented method.

Acoustic Factor Analysis for Robust Speaker Verification

Factor analysis based channel mismatch compensation methods for speaker recognition are based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM) mean super-vectors can be constrained to reside in a lower dimensional subspace. This approach does not consider the fact that conventional acoustic feature vectors also reside in a lower dimensional manifold of the feature space, when feature covariance matrices contain close to zero eigenvalues. In this study, based on observations of the covariance structure of acoustic features, we propose a factor analysis modeling scheme in the acoustic feature space instead of the super-vector space and derive a mixture dependent feature transformation. We demonstrate how this single linear transformation performs feature dimensionality reduction, de-correlation, normalization and enhancement, at once. The proposed transformation is shown to be closely related to signal subspace based speech enhancement schemes. In contrast to traditional front-end mixture dependent feature transformations, where feature alignment is performed using the highest scoring mixture, the proposed transformation is integrated within the speaker recognition system using a probabilistic feature alignment technique, which nullifies the need for regenerating the features/retraining the Universal Background Model (UBM). Incorporating the proposed method with a state-of-the-art i-vector and Gaussian Probabilistic Linear Discriminant Analysis (PLDA) framework, we perform evaluations on National Institute of Science and Technology (NIST) Speaker Recognition Evaluation (SRE) 2010 core telephone and microphone tasks. The experimental results demonstrate the superiority of the proposed scheme compared to both full-covariance and diagonal covariance UBM based systems. Simple equal-weight fusion of baseline and proposed systems also yield significant performance gains.

Factored covariance modeling for text-independent speaker verification

International Conference on Acoustics, Speech, and Signal Processing, 2011

Gaussian mixture models (GMMs) are commonly used to model the spectral distribution of speech signals for text-independent speaker verification. Mean vectors of the GMM, used in conjunction with support vector machine (SVM), have shown to be effective in characterizing speaker information. In addition to the mean vectors, covariance matrices capture the correlation between spectral features, which also represent some salient information about speaker identity. This paper investigates the use of local correlation between different dimensions of acoustic vector by using factor analysis and linear Gaussian model. Log-Euclidean inner product kernel is used to measure the similarity between two speech utterances in the form of covariance matrices. Experiments carried on NIST 2006 speaker verification tasks shows promising results.

A Study of Interspeaker Variability in Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing, 2000

We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10-15% reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task.

Speaker Recognition on Single- and Multispeaker Data

Digital Signal Processing, 2000

We discuss Dragon Systems' approach to the NIST Speaker Recognition tasks. For the one-speaker task, we employ a combination of methods: a basic GMM system and two LVCSR-based systems, one using standard mixture models and the other using nonparametric techniques. We discuss some explorations of the recently introduced two-speaker tasks based on the GMM system alone. "Cheating" tests using NIST-supplied keys lead us to some improvements in channel normalization, and illuminate the roles that speaker segmentation and segment selection play in these tasks.