SRI's 2004 NIST Speaker Recognition Evaluation System (original) (raw)
Related papers
The SRI NIST 2010 speaker recognition evaluation system
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
The SRI speaker recognition system for the 2010 NIST speaker recognition evaluation (SRE) incorporates multiple subsystems with a variety of features and modeling techniques. We describe our strategy for this year's evaluation, from the use of speech recognition and speech segmentation to the individual system descriptions as well as the final combination. Our results show that under most conditions, the cepstral systems tend to perform the best, but that other, non-cepstral systems have the most complementarity. The combination of several subsystems with the use of adequate side information gives a 35% improvement on the standard telephone condition. We also show that a constrained cepstral system based on nasal syllables tends to be more robust to vocal effort variabilities.
THE SRI NIST 2008 speaker recognition evaluation system
2009
The SRI speaker recognition system for the 2008 NIST speaker recognition evaluation (SRE) incorporates a variety of models and features, both cepstral and stylistic. We highlight the improvements made to specific subsystems and analyze the performance of various subsystem combinations in different data conditions. We show the importance of language and nativeness conditioning, as well as the role of ASR for speaker verification.
2006
Recent work in speaker recognition has demonstrated the advantage of modeling stylistic features in addition to traditional cepstral features, but to date there has been little study of the relative contributions of these different feature types to a state-of-the-art system. In this paper we provide such an analysis, based on SRI's submission to the NIST 2005 Speaker Recognition Evaluation. The system consists of 7 subsystems (3 cepstral, 4 stylistic). By running independent N-way subsystem combinations for increasing values of N, we find that (1) a monotonic pattern in the choice of the best N systems allows for the inference of subsystem importance; (2) the ordering of subsystems alternates between cepstral and stylistic; (3) syllable-based prosodic features are the strongest stylistic features, and (4) overall subsystem ordering depends crucially on the amount of training data (1 versus 8 conversation sides). Improvements over the baseline cepstral system, when all systems are combined, range from 47% to 67%, with larger improvements for the 8-side condition. These results provide direct evidence of the complementary contributions of cepstral and stylistic features to speaker discrimination.
Performance Evaluation of Feature Extraction and Modeling Methods for Speaker Recognition
Annals of Reviews & Research, 2018
In this study, the performance of the prominent feature extraction and modeling methods in speaker recognition systems are evaluated on the specifically created database. The main feature of the database is that subjects are siblings or relatives. After giving the basic information about speaker recognition systems, outstanding properties of the methods are briefly mentioned. While Linear Predictive Cepstral Coefficients (LPCC) and Mel-Frequency Cepstral Coefficients (MFCC) methods are preferred for feature extraction, Gaussian Mixture Model (GMM) and I-Vector methods are employed for modeling. The best results are tried to be obtained by changing the parameters of these methods. A number of features for LPCC and MFCC and number of mixture components for GMM are the parameters experimented by changing. The aim of this study is to find out which parameters of the most commonly used methods contribute the success and at the same time, to determine the best combination of feature extraction and modeling methods for the speakers having similar sounds. This study is also a good resource and guidance for the researchers in the area of speaker recognition.
Toward 2003 NIST Speaker Recognition Evaluation: The WCL-1 System
2003
A detailed description of our text-independent speaker verification (SV) system, referred to as WCL-1, a participant in the one-speaker detection task of the 2003 NIST Speaker Recognition Evaluation (SRE) is presented. It is an improved version of our baseline system, which has successfully participated in the 2002 NIST SRE. In addition to the short-term spectrum represented by the Mel-frequency scaled cepstral coefficients (MFCCs), the improved WCL-1 system exploits also prosodic information to account for the speaking style of the users. A logarithm of the energy, computed for the corresponding speech frame, replaces the first MFCC coefficient, which was found very much influenced by the transmission channel and the handset characteristics. Furthermore, a logarithm of the fundamental frequency f0 is added to the other parameters, to form the final feature vector. Instead of the traditional ln(f0), we propose ln(f0-f0 min ), which we found out to be much more effective, due to its extended dynamic range that better corresponds to the relative importance of the fundamental frequency. The constant f0 min is derived as 90% of the minimal fundamental frequency the pitch estimator can detect. Comparative results between the improved WCL-1 system and the baseline version, obtained in the one-speaker detection task over the 2001 NIST SRE database, are reported.
The NIST Speaker Recognition Evaluations: 1996-2001
2001
We discuss the history and purposes of the NIST evaluations of speaker recognition performance. We cover the sites that have participated, the performance measures used, and the formats used to report results. We consider the extent to which there has been measurable progress over the years. In particular, we examine apparent performance improvements seen in the 2001 evaluation. Information for prospective participants is included.
Speaker recognition using prosodic and lexical features
2003
Conventional speaker recognition systems identify speakers by using spectral information from very short slices of speech. Such systems perform well (especially in quiet conditions), but fail to capture idiosyncratic longer-term patterns in a speaker's habitual speaking style, including duration and pausing patterns, intonation contours, and the use of particular phrases. We investigate the contribution of modeling such prosodic and lexical patterns, on performance in the NIST 2003 Speaker Recognition Evaluation extended data task. We report results for: (1) systems based on individual feature types alone; (2) systems in combination with a state-of-the-art frame-based baseline system; (3) an all-system combination. Our results show that certain longer-term stylistic features provide powerful complementary information to both frame-level cepstral features and to each other. Stylistic features thus significantly improve speaker recognition performance over conventional systems, and offer promise for a variety of intelligence and security applications.
The IIR Submission to CSLP 2006 Speaker Recognition Evaluation
Chinese Spoken Language Processing, 2006
This paper describes the design and implementation of a practical automatic speaker recognition system for the CSLP speaker recognition evaluation (SRE). The speaker recognition system is built upon four subsystems using speaker information from acoustic spectral features. In addition to the conventional spectral features, a novel temporal discrete cosine transform (TDCT) feature is introduced in order to capture long-term speech dynamic. The speaker information is modeled using two complementary speaker modeling techniques, namely, Gaussian mixture model (GMM) and support vector machine (SVM). The resulting subsystems are then integrated at the score level through a multilayer perceptron (MLP) neural network. Evaluation results confirm that the feature selection, classifier design, and fusion strategy are successful, giving rise to an effective speaker recognition system.
NIST Speaker Recognition Evaluation Chronicles - Part 2
2006 IEEE Odyssey - The Speaker and Language Recognition Workshop, 2006
NIST has coordinated annual evaluations of textindependent speaker recognition since 1996. During the course of this series of evaluations there have been notable milestones related to the development of the evaluation paradigm and the performance achievements of state-of-the-art systems. We document here the variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task. Finally, we discuss the data collection and protocols for the 2004 evaluation and beyond.
A note on performance metrics for Speaker Recognition using multiple conditions in an evaluation
In this paper we put forward arguments for pooling different evaluation conditions for calculating speaker recognition system performance measures. We propose a condition-based weighting of trials, and derive expressions for the basic speaker recognition performance mea- sures Cdet, Cllr, as well as the DET curve, from which EER and Cmin det can be computed. We show that trials-based weighting is essential for computing Cmin llr in a pooled condition evaluation. Examples of pooling of conditions are show on SRE-2008 data, including speaker sex and microphone type and speaking style.