Improving Short Utterance based I-vector Speaker Recognition using Source and Utterance-Duration Normalization Techniques (original) (raw)

Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques

Speech Communication, 2014

This paper proposes techniques to improve the performance of i-vector based speaker verification systems when only short utterances are available. Short-length utterance i-vectors vary with speaker, session variations, and the phonetic content of the utterance. Well established methods such as linear discriminant analysis (LDA), source-normalized LDA (SN-LDA) and within-class covariance normalisation (WCCN) exist for compensating the session variation but we have identified the variability introduced by phonetic content due to utterance variation as an additional source of degradation when short-duration utterances are used. To compensate for utterance variations in short i-vector speaker verification systems using cosine similarity scoring (CSS), we have introduced a short utterance variance normalization (SUVN) technique and a short utterance variance (SUV) modelling approach at the i-vector feature level. A combination of SUVN with LDA and SN-LDA is proposed to compensate the session and utterance variations and is shown to provide improvement in performance over the traditional approach of using LDA and/or SN-LDA followed by WCCN. An alternative approach is also introduced using probabilistic linear discriminant analysis (PLDA) approach to directly model the SUV. The combination of SUVN, LDA and SN-LDA followed by SUV PLDA modelling provides an improvement over the baseline PLDA approach. We also show that for this combination of techniques, the utterance variation information needs to be artificially added to full-length i-vectors for PLDA modelling.

I-Vector Based Speaker Recognition on Short Utterances

2011

Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.

Duration mismatch compensation for i-vector based speaker recognition systems

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

We analyze the effect of duration variability in speech utterances phoneme distribution and the i-vector space. It is known that systems trained on full duration speech segments perform significantly worse when short duration utterances are encountered during test. We model the duration variability as an additive noise model and propose three different strategies to compensate for this mismatch: using multi-duration training in a Probabilistic Linear Discriminant Analysis (PLDA) model, score domain compensation using Quality Measure Function (QMF) calibration, and multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Science and Technology (NIST) Speaker Recognition Evaluation (SRE) target speakers and protocol, systematically varying the test utterance duration. The results obtained using the proposed techniques show significant improvement in system performances with respect to the SRE'12 primary cost function, especially when using the QMF calibration in short duration utterances test conditions.

Alleviating the small sample-size problem in i-vector based speaker verification

2012 8th International Symposium on Chinese Spoken Language Processing, 2012

This paper investigates the small sample-size problem in i-vector based speaker verification systems. The idea of i-vectors is to represent the characteristics of speakers in the factors of a factor analyzer. Because the factor loading matrix defines the possible speakerand channel-variability of i-vectors, it is important to suppress the unwanted channel variability. Linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and probabilistic LDA are commonly used for such purpose. These methods, however, require training data comprising many speakers each providing sufficient recording sessions for good performance. Performance will suffer when the number of speakers and/or number of sessions per speaker are too small. This paper compares four approaches to addressing this small sample-size problem: (1) preprocessing the ivectors by PCA before applying LDA (PCA+LDA), (2) replacing the matrix inverse in LDA by pseudo-inverse, (3) applying multiway LDA by exploiting the microphone and speaker labels of the training data, and (4) increasing the matrix rank in LDA by generating more i-vectors using utterance partitioning. Results based on NIST 2010 SRE suggests that utterance partitioning performs the best, followed by multi-way LDA and PCA+LDA.

Weighted LDA Techniques for I-vector based Speaker Verification

IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012

This paper introduces the Weighted Linear Discriminant Analysis (WLDA) technique, based upon the weighted pairwise Fisher criterion, for the purposes of improving i-vector speaker verification in the presence of high intersession variability. By taking advantage of the speaker discriminative information that is available in the distances between pairs of speakers clustered in the development i-vector space, the WLDA technique is shown to provide an improvement in speaker verification performance over traditional Linear Discriminant Analysis (LDA) approaches. A similar approach is also taken to extend the recently developed Source Normalised LDA (SNLDA) into Weighted SNLDA (WSNLDA) which, similarly, shows an improvement in speaker verification performance in both matched and mismatched enrolment/verification conditions. Based upon the results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset, we believe that both WLDA and WSNLDA are viable as replacement techniques to improve the performance of LDA and SNLDA-based i-vector speaker verification.

I-vectors in the context of phonetically-constrained short utterances for speaker verification

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

Short speech duration remains a critical factor of performance degradation when deploying a speaker verification system. To overcome this difficulty, a large number of commercial applications impose the use of fixed pass-phrases. In this context, we show that the performance of the popular i-vector approach can be greatly improved by taking advantage of the phonetic information that they convey. Moreover, as i-vectors require a conditioning process to reach high accuracy, we show that further improvements are possible by taking advantage of this phonetic information within the normalisation process. We compare two methods, Within Class Covariance Normalization (WCCN) and Eigen Factor Radial (EFR), both relying on parameters estimated on the same development data. Our study suggests that WCCN is more robust to data mismatch but less efficient than EFR when the development data has a better match with the test data.

Improved speaker recognition when using i-vectors from multiple speech sources

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

The concept of speaker recognition using i-vectors was recently introduced offering state-of-the-art performance. An i-vector is a compact representation of a speaker's utterance after projection into a low-dimensional, total variability subspace trained using factor analysis. A secondary process involving linear discriminant analysis (LDA) is then used to improve the discrimination of i-vectors from different speakers. The newness of this technology invokes the question as to the best way to train the total variability subspace and LDA matrix when using speech collected from distinctly different sources. This paper presents a comparative study of a number of subspace training techniques and a novel source-normalisedand-weighted LDA algorithm for the purpose of improving i-vectorbased speaker recognition under mis-matched evaluation conditions. Results from the NIST 2010 speaker recognition evaluation (SRE) suggest that accounting for source conditions in the LDA matrix as opposed to the total variability subspace training regime provides improved robustness to mis-matched evaluation conditions.

Probabilistic approach using joint long and short session i-vectors modeling to deal with short utterances for speaker recognition

Speaker recognition with short utterance is highly challenging. The use of i-vectors in SR systems became a standard in the last years and many algorithms were developed to deal with the short utterances problem. We present in this paper a new technique based on modeling jointly the i-vectors corresponding to short utterances and those of long utterances. The joint distribution is estimated using a large number of i-vectors pairs (coming from short and long utterances) corresponding to the same session. The obtained distribution is then integrated in an MMSE estimator in the test phase to compute an " improved " version of short utterance i-vectors. We show that this technique can be used to deal with duration mismatch and that it achieves up to 40% of relative improvement in EER(%) when used on NIST data. We also apply this technique on the recently published SITW database and show that it yields 25% of EER(%) improvement compared to a regular PLDA scoring.

Boosting the Performance of I-Vector Based Speaker Verification via Utterance Partitioning

IEEE Transactions on Audio, Speech, and Language Processing, 2000

The success of the recent i-vector approach to speaker verification relies on the capability of i-vectors to capture speaker characteristics and the subsequent channel compensation methods to suppress channel variability. Typically, given an utterance, an i-vector is determined from the utterance regardless of its length. This paper investigates how the utterance length affects the discriminative power of i-vectors and demonstrates that the discriminative power of i-vectors reaches a plateau quickly when the utterance length increases. This observation suggests that it is possible to make the best use of a long conversation by partitioning it into a number of sub-utterances so that more ivectors can be produced for each conversation. To increase the number of sub-utterances without scarifying the representation power of the corresponding i-vectors, repeated applications of frame-index randomization and utterance partitioning are applied. Results on NIST 2010 speaker recognition evaluation (SRE) suggest that (1) using more i-vectors per conversation can help to find more robust linear discriminant analysis (LDA) and withinclass covariance normalization (WCCN) transformation matrices, especially when the number of conversations per training speaker is limited; and (2) increasing the number of i-vectors per target speaker helps the i-vector based support vector machines (SVM) to find better decision boundaries, thus making SVM scoring outperforms cosine distance scoring by 19% and 9% in terms of minimum normalized DCF and EER.

Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

The recently developed i-vector framework for speaker recognition has set a new performance standard in the research field. An i-vector is a compact representation of a speaker utterance extracted from a low-dimensional total variability subspace. Prior to classification using a cosine kernel, i-vectors are projected into an LDA space in order to reduce inter-session variability and enhance speaker discrimination. The accurate estimation of this LDA space from a training dataset is crucial to classification performance. A typical training dataset, however, does not consist of utterances acquired from all sources of interest (ie., telephone, microphone and interview speech sources) for each speaker. This has the effect of introducing source-related variation in the between-speaker covariance matrix and results in an incomplete representation of the within-speaker scatter matrix used for LDA.