On Line Vocal Tract Length Estimation for Speaker Normalization in Speech Recognition (original) (raw)

A novel feature transformation for vocal tract length normalization in automatic speech recognition

IEEE Transactions on Speech and Audio Processing, 1998

This paper proposes a method to transform acoustic models that have been trained with a certain group of speakers for use on different speech in hidden Markov model based (HMM-based) automatic speech recognition. Features are transformed on the basis of assumptions regarding the difference in vocal tract length between the groups of speakers. First, the vocal tract length (VTL) of these groups has been estimated based on the average third formant F F F 3 . Second, the linear acoustic theory of speech production has been applied to warp the spectral characteristics of the existing models so as to match the incoming speech. The mapping is composed of subsequent nonlinear submappings. By locally linearizing it and comparing results in the output, a linear approximation for the exact mapping was obtained which is accurate as long as warping is reasonably small. The feature vector, which is computed from a speech frame, consists of the mel scale cepstral coefficients (MFCC) along with delta and delta 2 -cepstra as well as delta and delta 2 energy. The method has been tested for TI digits data base, containing adult and children speech, consisting of isolated digits and digit strings of different length. The word error rate when trained on adults and tested on children with transformed adult models is decreased by more than a factor of two compared to the nontransformed case.

Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition

Generally speaking, the speaker-dependence of a speech recognition system stems from speaker-dependent speech feature. The variation of vocal tract length and/or shape is one of the major source of inter-speaker variations. In this paper, we address several methods of vocal tract length normalization (VTLN) for large vocabulary continuous speech recognition: (1) explore the bilinear warping VTLN in frequency domain; (2) propose a speaker-speci c Bark/Mel scale VTLN in Bark/Mel domain; (3) investigate adaptation of the normalization factor. Our experimental results show that the speaker-speci c Bark/Mel scale VTLN is better than the piecewise/bilinear warping VTLN in frequency domain. It can reduce up to 12% word error rate for our Spanish and English spontaneous speech scheduling task database. For adaptation of the normalization factor, our experimental results show that promising result can be obtained by using not more than three utterances from a new speaker to estimate his/her normalization factor, and the unsupervised adaptation mode works as well as the supervised one. Therefore, the computational complexity of VTLN can be avoided by learning the normalization factor from very few utterances of a new speaker.

Region-Based Vocal Tract Length Normalization for ASR

Ninth Annual Conference of the …, 2008

In this paper, we propose a Region-based multi-parametric Vocal Tract Length Normalization (R-VTLN) algorithm for the problem of automatic speech recognition (ASR). The proposed algorithm extends the well-established mono-parametric utterance-based VTLN algorithm of Lee and Rose [1] by dividing the speech frames of a test utterance into regions and by warping independently the features corresponding to each region using a maximum likelihood criterion. We propose two algorithms for classifying frames into regions: (i) an unsupervised clustering algorithm based on spectral distance, and (ii) an unsupervised algorithm assigning frames to regions based on phonetic-class labels obtained from the first recognition pass. We also investigate the ability of various mono-parametric and multiparametric warping functions to reduce the spectral distance between two speakers, as a function of phone. R-VTLN is shown to significantly outperform mono-parametric VTLN in terms of word accuracy for the AURORA4 database.

An approach to vocal tract length normalization by robust formant

International Conference on Circuits, 2010

Spectrum pattern of the same phoneme could be quite different for individual speakers due to physical and linguistic difference. Without applying appropriate computational technique on the frequency axis, the inter-speaker variation will reduce the modeling efficiency and result in poor recognition performance. In this paper, a formant-driven framework is proposed which is based on by modifying formant pattern model in order to compute normalization factor of a given speaker. Experiments on GRID corpus clearly show the effectiveness of this method.

On Reducing Harmonic and Sampling Distortion in Vocal Tract Length Normalization

IEEE Transactions on Audio, Speech, and Language Processing, 2013

This paper proposes a novel feature-space VTLN (vocal tract length normalization) method that models frequency warping as a linear interpolation of contiguous Mel filter-bank energies. The presented technique aims to reduce the distortion in the Mel filter-bank energy estimation due to the harmonic composition of voiced speech intervals and DFT (discrete Fourier transform) sampling when the central frequency of band-pass filters is shifted. This paper also proposes an analytical maximum likelihood (ML) method to estimate the optimal warping factor in the cepstral space. The presented interpolated filter-bank energy-based VTLN leads to relative reductions in WER (word error rate) as high as 11.2% and 7.6% when compared with the baseline system and standard VTLN, respectively, in a medium-vocabulary continuous speech recognition task. Also, the proposed VTLN scheme can provide significant reductions in WER when compared with state-of-the-art VTLN methods based on linear transforms in the cepstral feature-space. The warping factor estimated with the proposed VTLN approach shows more dependence on the speaker and more independence of the acoustic-phonetic content than the warping factor resulting from standard and state-of-the-art VTLN methods. Finally, the analytical ML-based optimization scheme presented here achieves almost the same reductions in WER as the ML grid search version of the technique with a computational load 20 times lower.

SPEAKER NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION -AN ON-LINE APPROACH

We propose a method to transform the on line speech signal so as to comply with the specica-tions of an HMM-based automatic speech recog-nizer. The spectrum of the input signal undergoes a v ocal tract length (VTL) normalization based on dierences of the average third formant F 3 . The high frequency gap which is generated after scaling is estimated by means of an extrapolation scheme. Mel scale cepstral coecients (MFCC) are used along with delta and delta 2 -cepstra as well as delta and delta 2 energy. The method has been tested on the TI digits database which contains adult and kids speech providing substantial gains with respect to non normalized speech.

Evaluation of the Vocal Tract Length Normalization Based Classifiers for Speaker Verification

International Journal of Recent Contributions from Engineering, Science & IT (iJES), 2016

This paper proposes and evaluates classifiers based on Vocal Tract Length Normalization (VTLN) in a text-dependent speaker verification (SV) task with short testing utterances. This type of tasks is important in commercial applications and is not easily addressed with methods designed for long utterances such as JFA and i-Vectors. In contrast, VTLN is a speaker compensation scheme that can lead to significant improvements in speech recognition accuracy with just a few seconds of speech samples. A novel scheme to generate new classifiers is employed by incorporating the observation vector sequence compensated with VTLN. The modified sequence of feature vectors and the corresponding warping factors are used to generate classifiers whose scores are combined by a Support Vector Machine (SVM) based SV system. The proposed scheme can provide an average reduction in EER equal to 14% when compared with the baseline system based on the likelihood of observation vectors.

Bias Adaptation for Vocal Tract Length Normalization

Vocal tract length normalisation (VTLN) is a well known rapid adaptation technique. VTLN as a linear transformation in the cepstral domain results in the scaling and translation factors. The warping factor represents the spectral scaling parameter. While, the translation factor represented by bias term captures more speaker characteristics especially in a rapid adaptation framework without having the risk of over-fitting. This paper presents a complete and comprehensible derivation of the bias transformation for VTLN and implements it in a unified framework for statistical parametric speech synthesis and recognition. The recognition experiments show that bias term improves the rapid adaptation performance and gives additional performance over the cepstral mean normalisation factor. It was observed from the synthesis results that VTLN bias term did not have much effect in combination with model adaptation techniques that already have a bias transformation incorporated.

An Approach to Vocal Tract Length Normalization by Robust Formants

Spectrum pattern of the same phoneme could be quite different for individual speakers due to physical and linguistic difference. Without applying appropriate computational technique on the frequency axis, the inter-speaker variation will reduce the modeling efficiency and result in poor recognition performance. In this paper, a formant-driven framework is proposed which is based on by modifying formant pattern model in order to compute normalization factor of a given speaker. Experiments on GRID corpus clearly show the effectiveness of this method.