The Problem of Voice Template Aging in Speaker Recognition Systems (original) (raw)

Effect of long-term ageing on i-vector speaker verification

Assessing the impact of ageing on biometric systems is an important challenge. In this paper, an i-vector speaker verification framework is used to evaluate the impact of long-term ageing on state-of-the-art speaker verification. Using the Trinity College Dublin Speaker Ageing (TCDSA) database, it is observed that the performance of the i-vector system, in terms of both discrimination and calibration, degrades progressively as the absolute age difference between training and testing samples increases. In the case of male speakers, the equal error rate (EER) increases from 4.61% at an ageing difference of 0-1 years to 32.74% at an age difference of 51-60 years. The performance of a Gaussian Mixture Model -Universal Background Model (GMM-UBM) system is presented for comparison. It is shown that while the i-vector system outperforms the GMM-UBM system, as absolute age difference increases, the performance of both degrades at a similar rate. It is concluded that long-term ageing variability is distinct from everyday intersession variability, and therefore must be dealt with via dedicated compensation strategies.

Exploring Session Variability and Template Aging in Speaker Verification for Fixed Phrase Short Utterances

Interspeech 2016, 2016

This work highlights the impact of session variability and template aging on speaker verification (SV) using fixed phrase short utterances from the RedDots database. These have been collected over a period of one year and contain a large number of sessions per speaker. Session variation has been found to have a direct influence on SV performance and its significance is even greater for the case of fixed phrase short utterances as a very small amount of speech data is involved for speaker modeling as well as testing. Similarly for a practical deployable SV system when there is large session variation involved over a period of time, the template aging of the speakers may effect the SV performance. This work attempts to address some issues related to session variability and template aging of speakers which are found for data having large session variability, that if considered can be utilized for improving the performance of an SV system.

The NIST speaker recognition evaluation – Overview, methodology, systems, results, perspective

Speech Communication, 2000

This paper, based on three presentations made in 1998 at the RLA2C Workshop in Avignon, discusses the evaluation of speaker recognition systems from several perspectives. A general discussion of the speaker recognition task and the challenges and issues involved in its evaluation is oered. The NIST evaluations in this area and speci®cally the 1998 evaluation, its objectives, protocols and test data, are described. The algorithms used by the systems that were developed for this evaluation are summarized, compared and contrasted. Overall performance results of this evaluation are presented by means of detection error trade-o (DET) curves. These show the performance trade-o of missed detections and false alarms for each system and the eects on performance of training condition, test segment duration, the speakers' sex and the match or mismatch of training and test handsets. Several factors that were found to have an impact on performance, including pitch frequency, handset type and noise, are discussed and DET curves showing their eects are presented. The paper concludes with some perspective on the history of this technology and where it may be going.

Score-Aging Calibration for Speaker Verification

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016

The gradual changes that occur in the human voice due to aging create challenges for speaker verification. This study presents an approach to calibrating the output scores of a speaker verification system using the time interval between comparison samples as additional information. Several functions are proposed for the incorporation of this time information, which is viewed as aging information, in a conventional linear score calibration transformation. Experiments are presented on data with shortterm aging intervals ranging between 2 months and 3 years, and long-term aging intervals of up to 30 years. The aging calibration proposal is shown to offset the decreased discrimination and calibration performance for both short-and long-term intervals, and to extrapolate well to unseen aging intervals. Relative reductions in C r (cost of log-likelihood ratio) of 1-4% and 10-43% are obtained at short-and long-term intervals, respectively. Assuming that a system has knowledge of the time interval between samples under comparison, this approach represents a straightforward means of compensating for the detrimental impact of aging on speaker verification performance.

Effects of time lapse on Speaker Recognition results

2009 16th International Conference on Digital Signal Processing, 2009

The effect of time lapse has not been studied well in most biometrics. Here, this effect is studied for Speaker Recognition, namely, Speaker Identification and Speaker Verification. The RecoMadeEasy T M speaker recognition engine has been used to obtain baseline results for 22 speakers who have been involved in a long-term study. The speakers have given data in three seatings with 1 to 2 months delay between consecutive collections. The speakers were real proficiency test candidates who were asked to speak in response to prompts. At each seating, several recordings were made in response to different prompts. The error rates are discussed, going from one seating to the next, for Identification and Verification. Large degradations are seen across different seatings. Two different adaptation techniques have been studied for reducing this discrepancy with very promising results.

On the Performance Degradation of Speaker Recognition System due to Variation in Speech Characteristics Caused by Physiological Changes

International Journal of Computing and Digital Systemss, 2017

Speaker recognition is the process of identifying a person using their speech characteristics (voice biometrics). Speech characteristics of an individual can vary due to physiological changes which may be caused by health changes, physical activity as well as emotional changes. Such changes in speech characteristics are likely to affect the accuracy of speaker recognition systems. In this paper, the performance degradation of a speaker recognition system is quantified, empirically, when the characteristics of an individual's speech change due to physiological changes caused by 'physical activity'. The speaker recognition system used in this work is based on Mel-Frequency Cepstrum Coefficients (MFCC's) and Vector Quantization (VQ). When the speech sample of a user is obtained soon after high intensity physical activity, the changes in the individual's speech characteristics affect the accuracy of speaker recognition systems. It is necessary to understand how speaker recognition systems are affected by changes in speech characteristics in order to improve their immunity to such changes. From speech recorded after physical activity, it is found that the duration of 'voiced component' which has prominent discriminative characteristics of speech is shortened and it has an effect on the accuracy of speaker recognition system.

Speaker adaptation in the NIST Speaker Recognition Evaluation 2004

New in the 2004 edition of the NIST Speaker Recognition Evaluation (SRE) was the condition where unsupervised adaptation of speaker models is allowed. Despite the promising results on development test material, hardly any beneficial results were obtained in the Evaluation itself. An analysis is made why this was the case, and it appears that a mimimum level of performance is essential to obtain results using adaptation that improve on the performance without adaptation. Further, the system should be well calibrated. For the conditions with 8 conversation sides we have been able to find improvement using unsupervised adaptation using the NIST 2004 evaluation, both for an UBM/GMM adaptation methodology, and a novel SVM adaptation methodology. The minimum DCF for a fused system drops from 0.259 for the unadapted condition to 0.231 for the adapted condition.

SRI's 2004 NIST Speaker Recognition Evaluation System

2005

This paper describes our recent efforts in exploring longerrange features and their statistical modeling techniques for speaker recognition. In particular, we describe a system that uses discriminant features from cepstral coefficients, and systems that use discriminant models from word n-grams and syllable-based NERF n-grams. These systems together with a cepstral baseline system are evaluated on the 2004 NIST speaker recognition evaluation dataset. The effect of the development set is measured using two different datasets, one from Switchboard databases and another from the FISHER database. Results show that the difference between the development and evaluation sets affects the performance of the systems only when more training data is available. Results also show that systems using longer-range features combined with the baseline result in about a 31% improvement with 1-side training over the baseline system and about a 61% improvement with 8-side training over the baseline system.

The SRI NIST 2010 speaker recognition evaluation system

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

The SRI speaker recognition system for the 2010 NIST speaker recognition evaluation (SRE) incorporates multiple subsystems with a variety of features and modeling techniques. We describe our strategy for this year's evaluation, from the use of speech recognition and speech segmentation to the individual system descriptions as well as the final combination. Our results show that under most conditions, the cepstral systems tend to perform the best, but that other, non-cepstral systems have the most complementarity. The combination of several subsystems with the use of adequate side information gives a 35% improvement on the standard telephone condition. We also show that a constrained cepstral system based on nasal syllables tends to be more robust to vocal effort variabilities.

The NIST Speaker Recognition Evaluations: 1996-2001

2001

We discuss the history and purposes of the NIST evaluations of speaker recognition performance. We cover the sites that have participated, the performance measures used, and the formats used to report results. We consider the extent to which there has been measurable progress over the years. In particular, we examine apparent performance improvements seen in the 2001 evaluation. Information for prospective participants is included.