THE SRI NIST 2008 speaker recognition evaluation system (original) (raw)
Related papers
The SRI NIST 2010 speaker recognition evaluation system
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
The SRI speaker recognition system for the 2010 NIST speaker recognition evaluation (SRE) incorporates multiple subsystems with a variety of features and modeling techniques. We describe our strategy for this year's evaluation, from the use of speech recognition and speech segmentation to the individual system descriptions as well as the final combination. Our results show that under most conditions, the cepstral systems tend to perform the best, but that other, non-cepstral systems have the most complementarity. The combination of several subsystems with the use of adequate side information gives a 35% improvement on the standard telephone condition. We also show that a constrained cepstral system based on nasal syllables tends to be more robust to vocal effort variabilities.
SRI's 2004 NIST Speaker Recognition Evaluation System
2005
This paper describes our recent efforts in exploring longerrange features and their statistical modeling techniques for speaker recognition. In particular, we describe a system that uses discriminant features from cepstral coefficients, and systems that use discriminant models from word n-grams and syllable-based NERF n-grams. These systems together with a cepstral baseline system are evaluated on the 2004 NIST speaker recognition evaluation dataset. The effect of the development set is measured using two different datasets, one from Switchboard databases and another from the FISHER database. Results show that the difference between the development and evaluation sets affects the performance of the systems only when more training data is available. Results also show that systems using longer-range features combined with the baseline result in about a 31% improvement with 1-side training over the baseline system and about a 61% improvement with 8-side training over the baseline system.
2006
Recent work in speaker recognition has demonstrated the advantage of modeling stylistic features in addition to traditional cepstral features, but to date there has been little study of the relative contributions of these different feature types to a state-of-the-art system. In this paper we provide such an analysis, based on SRI's submission to the NIST 2005 Speaker Recognition Evaluation. The system consists of 7 subsystems (3 cepstral, 4 stylistic). By running independent N-way subsystem combinations for increasing values of N, we find that (1) a monotonic pattern in the choice of the best N systems allows for the inference of subsystem importance; (2) the ordering of subsystems alternates between cepstral and stylistic; (3) syllable-based prosodic features are the strongest stylistic features, and (4) overall subsystem ordering depends crucially on the amount of training data (1 versus 8 conversation sides). Improvements over the baseline cepstral system, when all systems are combined, range from 47% to 67%, with larger improvements for the 8-side condition. These results provide direct evidence of the complementary contributions of cepstral and stylistic features to speaker discrimination.
The NIST Speaker Recognition Evaluations: 1996-2001
2001
We discuss the history and purposes of the NIST evaluations of speaker recognition performance. We cover the sites that have participated, the performance measures used, and the formats used to report results. We consider the extent to which there has been measurable progress over the years. In particular, we examine apparent performance improvements seen in the 2001 evaluation. Information for prospective participants is included.
NIST Speaker Recognition Evaluation Chronicles - Part 2
2006 IEEE Odyssey - The Speaker and Language Recognition Workshop, 2006
NIST has coordinated annual evaluations of textindependent speaker recognition since 1996. During the course of this series of evaluations there have been notable milestones related to the development of the evaluation paradigm and the performance achievements of state-of-the-art systems. We document here the variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task. Finally, we discuss the data collection and protocols for the 2004 evaluation and beyond.
Loquendo - Politecnico di Torino's 2010 NIST speaker recognition evaluation system
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This paper describes the improvements introduced in the Loquendo-Politecnico di Torino (LPT) speaker recognition system submitted to the NIST SRE08 evaluation campaign. This system, which was among the best participants in this evaluation, combines the results of three core acoustic systems, two based on Gaussian Mixture Models (GMMs), and one on Phonetic GMMs. We discuss the results of the experiments performed for the 10sec-10sec condition and for the core condition, including the challenging tasks involving a target speaker and an interviewer. The error rate reduction of our SRE08 system compared to the SRE06 system ranges from 25% of the telephone-interview condition to 57% of the interview-interview condition. On the test with telephone and microphone conversations, the improvements range from 9% to 32%.
Toward 2003 NIST Speaker Recognition Evaluation: The WCL-1 System
2003
A detailed description of our text-independent speaker verification (SV) system, referred to as WCL-1, a participant in the one-speaker detection task of the 2003 NIST Speaker Recognition Evaluation (SRE) is presented. It is an improved version of our baseline system, which has successfully participated in the 2002 NIST SRE. In addition to the short-term spectrum represented by the Mel-frequency scaled cepstral coefficients (MFCCs), the improved WCL-1 system exploits also prosodic information to account for the speaking style of the users. A logarithm of the energy, computed for the corresponding speech frame, replaces the first MFCC coefficient, which was found very much influenced by the transmission channel and the handset characteristics. Furthermore, a logarithm of the fundamental frequency f0 is added to the other parameters, to form the final feature vector. Instead of the traditional ln(f0), we propose ln(f0-f0 min ), which we found out to be much more effective, due to its extended dynamic range that better corresponds to the relative importance of the fundamental frequency. The constant f0 min is derived as 90% of the minimal fundamental frequency the pitch estimator can detect. Comparative results between the improved WCL-1 system and the baseline version, obtained in the one-speaker detection task over the 2001 NIST SRE database, are reported.
The IIR Submission to CSLP 2006 Speaker Recognition Evaluation
Chinese Spoken Language Processing, 2006
This paper describes the design and implementation of a practical automatic speaker recognition system for the CSLP speaker recognition evaluation (SRE). The speaker recognition system is built upon four subsystems using speaker information from acoustic spectral features. In addition to the conventional spectral features, a novel temporal discrete cosine transform (TDCT) feature is introduced in order to capture long-term speech dynamic. The speaker information is modeled using two complementary speaker modeling techniques, namely, Gaussian mixture model (GMM) and support vector machine (SVM). The resulting subsystems are then integrated at the score level through a multilayer perceptron (MLP) neural network. Evaluation results confirm that the feature selection, classifier design, and fusion strategy are successful, giving rise to an effective speaker recognition system.
NIST speaker recognition evaluation chronicles
2004
NIST has coordinated annual evaluations of textindependent speaker recognition since 1996. During the course of this series of evaluations there have been notable milestones related to the development of the evaluation paradigm and the performance achievements of state-of-the-art systems. We document here the variants of the speaker detection task that have been included in the evaluations and the history of the best performance results for this task. Finally, we discuss the data collection and protocols for the 2004 evaluation and beyond.
STBU System for the NIST 2006 Speaker Recognition Evaluation
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
This paper describes STBU 2006 speaker recognition system, which performed well in the NIST 2006 speaker recognition evaluation. STBU is consortium of 4 partners: Spescom DataVoice (South Africa), TNO (Netherlands), BUT (Czech Republic) and University of Stellenbosch (South Africa). The primary system is a combination of three main kinds of systems: (1) GMM, with short-time MFCC or PLP features, (2) GMM-SVM, using GMM mean supervectors as input and MLLR-SVM, using MLLR speaker adaptation coefficients derived from English LVCSR system. In this paper, we describe these sub-systems and present results for each system alone and in combination on the NIST Speaker Recognition Evaluation (SRE) 2006 development and evaluation data sets.