Nonlinear discriminant feature extraction for robust text-independent speaker recognition (original) (raw)
Related papers
PCA/LDA approach for text-independent speaker recognition
Independent Component Analyses, Compressive Sampling, Wavelets, Neural Net, Biosystems, and Nanoengineering X, 2012
Various algorithms for text-independent speaker recognition have been developed through the decades, aiming to improve both accuracy and efficiency. This paper presents a novel PCA/LDA-based approach that is faster than traditional statistical model-based methods and achieves competitive results. First, the performance based on only PCA and only LDA is measured; then a mixed model, taking advantages of both methods, is introduced. A subset of the TIMIT corpus composed of 200 male speakers, is used for enrollment, validation and testing. The best results achieve 100%, 96% and 95% classification rate at population level 50, 100 and 200, using 39dimensional MFCC features with delta and double delta. These results are based on 12-second text-independent speech for training and 4-second data for test. These are comparable to the conventional MFCC-GMM methods, but require significantly less time to train and operate.
Speaker-Aware Linear Discriminant Analysis in Speaker Verification
Interspeech 2020
Linear discriminant analysis (LDA) is an effective and widely used discriminative technique for speaker verification. However, it only utilizes the information on global structure to perform classification. Some variants of LDA, such as local pairwise LDA (LPLDA), are proposed to preserve more information on the local structure in the linear projection matrix. However, considering that the local structure may vary a lot in different regions, summing up related components to construct a single projection matrix may not be sufficient. In this paper, we present a speaker-aware strategy focusing on preserving distinct information on local structure in a set of linear discriminant projection matrices, and allocating them to different local regions for dimension reduction and classification. Experiments on NIST SRE2010 and NIST SRE2016 show that the speaker-aware strategy can boost the performance of both LDA and LPLDA backends in i-vector systems and x-vector systems.
Application of LDA to speaker recognition
2000
The speaker recognition task falls under the general problem of pattern classification. Speaker recognition as a pattern classification problem, its ultimate objective is design of a system that classifies the vector of features in different classes by partitioning the feature space into optimal speaker discriminative space. Linear Discriminant Analysis (LDA) is a feature extraction method that provides a linear transformation of n-dimensional feature vectors (or samples) into mdimensional space (m < n), so that samples belonging to the same class are close together but samples from different classes are far apart from each other. In this paper we discuss the issue of the application of LDA to our Gaussian Mixture Model (GMM) based speaker identification task. Applying LDA improved the identification performance.
MLP Internal Representation as Discriminative Features for Improved Speaker Recognition
2005
Feature projection by non-linear discriminant analysis (NLDA) can substantially increase classification performance. In automatic speech recognition (ASR) the projection provided by the pre-squashed outputs from a one hidden layer multi-layer perceptron (MLP) trained to recognise speech sub-units (phonemes) has previously been shown to significantly increase ASR performance. An analogous approach cannot be applied directly to speaker recognition because there is no recognised set of "speaker sub-units" to provide a finite set of MLP target classes, and for many applications it is not practical to train an MLP with one output for each target speaker. In this paper we show that the output from the second hidden layer (compression layer) of an MLP with three hidden layers trained to identify a subset of 100 speakers selected at random from a set of 300 training speakers in Timit, can provide a 77% relative error reduction for common Gaussian mixture model (GMM) based speaker identification.
A PLDA Approach for Language and Text Independent Speaker Recognition
Odyssey 2016, 2016
There are many factors affecting the variability of an i-vector extracted from a speech segment such as the acoustic content, segment duration, handset type and background noise. The language being spoken is one of the sources of variation which has received limited focus due to the lack of multilingual resources available. Consequently, the discrimination performance is much lower under multilingual trial condition. Standard session-compensation techniques such as Within-Class Covariance Normalization (WCCN), Linear Discriminant Analysis (LDA) and Probabilistic LDA (PLDA) cannot robustly compensate for language source of variation as the amount of data is limited to represent such variability. Source normalization technique which was developed to compensate for speech-source-variation, offered superior performance in cross-language trials by providing better estimation of within-speaker scatter matrix in WCCN and LDA techniques. However, neither language normalization nor the state-of-the-art PLDA algorithm is capable of modeling language variability on a dataset with insufficient multilingual utterances for each speaker, resulting in a poor performance in cross-language trial condition. This study is an extension to our initial developments of a language-independent PLDA training algorithm which aimed at reducing the effect of language as a source of variability on the performance of speaker recognition. We will provide a thorough analysis of how the proposed approach can utilize multilingual training data from bilingual speakers to robustly compensate for the effect of languages. Evaluated on multilingual trial condition, the proposed solution demonstrated over 10% EER and 13% minimum DCF relative improvement on NIST 2008 speaker recognition evaluation as well as 12.4% EER and 23% minimum DCF on PRISM evaluation set over the baseline system while also providing improvement in other trial conditions.
Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Recently, i-vector extraction and Probabilistic Linear Discriminant Analysis (PLDA) have proven to provide state-of-the-art speaker verification performance. In this paper, the speaker verification score for a pair of i-vectors representing a trial is computed with a functional form derived from the successful PLDA generative model. In our case, however, parameters of this function are estimated based on a discriminative training criterion. We propose to use the objective function to directly address the task in speaker verification: discrimination between same-speaker and different-speaker trials. Compared with a baseline which uses a generatively trained PLDA model, discriminative training provides up to 40% relative improvement on the NIST SRE 2010 evaluation task.
MLP trained to separate problem speakers provides improved features for speaker identification
2005
In automatic speech recognition (ASR) the non-linear data projection provided by a one hidden layer multilayer perceptron (MLP), trained to recognise phonemes, has previously been shown to provide feature enhancement which can substantially increase ASR performance, especially in noise. Previous attempts to apply an analogous approach to speaker identification have not succeeded in improving performance, except by combining MLP processed features with other features. We present test results for the TIMIT database which show that the advantage of MLP preprocessing for open set speaker identification increases with the number of speakers used to train the MLP and that improved identification is obtained as this number increases beyond sixty. We also present a method for selecting the speakers used for MLP training which further improves identification performance.
A comparative study of linear and nonlinear dimensionality reduction for speaker identification
Digital Signal Processing, 2007 15th …, 2007
In this paper we apply linear and nonlinear dimensionality reduction methods to speech produced by a number of different speakers in an effort to yield low dimensional features capable of discriminating between speakers. The classical linear dimensionality reduction method, principal component analysis (PCA), and the nonlinear manifold learning method, Isomap, are investigated. The resulting features are evaluated in GMM-based speaker identification experiments and compared to conventional cepstral features. Isomap is shown to give the highest accuracy for very low dimensions, outperforming MFCCs and PCA transformed features. Isomap is shown to be useful for visualisation of speaker clusters. For higher dimensions, speaker identification results indicate that features resulting from PCA offer improvements over conventional MFCCs.
On deep speaker embeddings for text-independent speaker recognition
Odyssey 2018 The Speaker and Language Recognition Workshop, 2018
We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.
Discriminative Feature Projection for Noise Robust Speaker Identification
In automatic speech recognition (ASR) the technique of discriminative feature projection (DFP) by non-linear discriminative feature analysis (NLDA), especially in cooperation with multi-condition training (MCT) and multistream combination (MSC), has been shown to provide very substantial increases for recognition performance in noise. In this article we investigate the benefit of DFP for speaker identification. Previous applications of DFP to speaker recognition have not clearly separated the performance advantages due to DFP and MCT, or demonstrated clearly improved performance except for DFP, MCT and MSC in combination. In this article we investigate the application of DFP and MCT to noise robust speaker identification, both alone and in combination. We also compare these to two stage processing in which noise condition detection (NCD) permits the subsequent use of noise matched models.