A speaker-independent continuous speech recognition system using continuous mixture Gaussian density HMM of phoneme-sized units (original) (raw)

Speech Recognition Using Monophone and Triphone Based Continuous Density Hidden Markov Models

Abstract—Speech Recognition is a process of transcribing speech to text. Phoneme based modeling is used where in each phoneme is represented by Continuous Density Hidden Markov Model. Mel Frequency Cepstral Coefficients (MFCC) are extracted from speech signal, delta and double-delta features representing the temporal rate of change of features are added which considerably improves the recognition accuracy. Each phoneme is represented by tristate Hidden Markov Model(HMM) with each state being represented by Continuous Density Gaussian model. As single mixture gaussian model do not represent the distribution of feature vectors in a better way, mixture splitting is performed successively in stages to eight mixture gaussian components. The multi-gaussian monophone models so generated do not capture all the variations of a phone with respect to its context, context dependent triphone models are build and the states are tied using decision tree based clustering. It is observed that recognition accuracy increases as the number of mixture components is increased and it works well for tied-state triphone based HMMs for large vocabulary. TIMIT Acoustic-Phonetic Continuous Speech Corpus is used for implementation. Recognition accuracy is also tested for our recorded speech.

Specifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition

2003

Abstract. Specifics of hidden Markov model-based speech recognition are investigated. Influ-ence of modeling simple and context-dependent phones, using simple Gaussian, two and three-component Gaussian mixture probability density functions for modeling feature distribution, and incorporating language model are discussed. Word recognition rates and model complexity criteria are used for evaluating suitability of these modifications for practical applications. Development of large vocabulary continuous speech recognition system using HTK toolkit and WSJCAM0 English speech corpus is described. Results of experimental investigations are presented. Key words: large vocabulary continuous speech recognition, hidden Markov model, Viterbi

A General Method for Combining Acoustic Features in an Automatic Speech Recognition System

2006

A general method for the use of different types of features in Automatic Speech Recognition (ASR) systems is presented. A gaussian mixture model (GMM) is obtained in a reference acoustic space. A specific feature combination or selection is associated to each gaussian of the mixture and used for computing symbol posterior probabilities. Symbols can refer to phonemes, phonemes in context or states of a Hidden Markov Model (HMM). Experimental results are presented of applications to phoneme and word rescoring after verification. Two corpora were used, one with small vocabularies in Italian and Spanish and one with very large vocabulary in French.

IMPROVED HYBRID MODEL OF HMM/GMM FOR SPEECH RECOGNITION

2008

In this paper, we propose a speech recognition engine using hybrid model of Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM). Both the models have been trained independently and the respective likelihood values have been considered jointly and input to a decision logic which provides net likelihood as the output. This hybrid model has been compared with the HMM model. Training and testing has been done by using a database of 20 Hindi words spoken by 80 different speakers. Recognition rates achieved by normal HMM are 83.5% and it gets increased to 85% by using the hybrid approach of HMM and GMM.

A Hidden Markov Model-Based Speech Recognition System Using Baum-Welch, Forward-Backward and Viterbi Algorithms

Speech is the most complex part or component of human intelligence and for that matter speech signal processing is very important. The variability of speech is very high, and this makes speech recognition difficult. Other factors like dialects, speech duration, context dependency, different speech speed, speaker differentiation, environment and locality all add to the difficulty in speech processing. The absence of distinct boundaries between tones or words causes additional problems. Speech has speaker dependent characteristics, so that no one can reproduce or repeat phrases in the same way as another. Nevertheless, a speech recognition system should be able to model and recognize the same words and phrases absolutely. Digital signal processors (DSP) are often used in speech signal processing systems to control these complexities. This paper presents a Hidden Markov Model (HMM) based speech signal modulation through the application of the Baum-Welch, Forward-Backward and Viterbi algorithms. The system was implemented using a 16-bit floating point DSP (TMS320C6701) from Texas instruments and the vocabulary was trained using the Microsoft Hidden Markov Model Toolkit (HTK). The proposed system achieved about 79% correct word recognition which represents approximately 11,804 correct words recognized out of a total of 14960 words provided. This result indicates that the proposed model accuracy and speaker independent system has a very good evaluation score, and thus can be used to aid dictation for speech impaired persons and applications in real time with a 10 ms data exchange rate.

Continuous Density Hidden Markov Model for Hindi Speech Recognition

State of the art automatic speech recognition system uses Mel frequency cepstral coefficients as feature extractor along with Gaussian mixture model for acoustic modeling but there is no standard value to assign number of mixture component in speech recognition process.Current choice of mixture component is arbitrary with little justification. Also the standard set for European languages can not be used in Hindi speech recognition due to mismatch in database size of the languages.The parameter estimation with too many or few component may inappropriately estimate the mixture model. Therefore, number of mixture is important for initial estimation of expectation maximization process. In this research work, we estimate number of Gaussian mixture component for Hindi database based upon the size of vocabulary.Mel frequency cepstral feature and perceptual linear predictive feature along with its extended variations with delta-delta-delta feature have been used to evaluate this number based on optimal recognition score of the system . Comparitive analysis of recognition performance for both the feature extraction methods on medium size Hindi database is also presented in this paper.HLDA has been used as feature reduction technique and also its impact on the recognition score has been highlighted here.

The Application of Hidden Markov Models in Speech Recognition

Foundations and Trends® in Signal Processing, 2007

Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.

Hidden Markov models (HMMs) isolated word recognizer with the optimization of acoustical analysis and modeling techniques

International Journal of Physical Sciences, 2011

Most state of the art automatic speech recognition (ASR) systems are typically based on continuous Hidden Markov Models (HMMs) as acoustic modeling technique. It has been shown that the performance of HMM speech recognizers may be affected by a bad choice of the type of acoustic feature parameters in the acoustic front end module. For these reasons, we propose in this paper a dedicated isolated word recognition system based on HMMs which was carefully optimized specifically at the acoustic analysis and HMM acoustical modeling levels. Such conception was tested and valued on Hidden Markov model toolkit platform (HTK). Systems performances were evaluated using the TIMIT database. One comparative study was carried out using two types of speech analysis: The cepstral method referred to as Mel frequency cepstral coefficients (MFCC) and the perceptual linear predictive (PLP) coding are used for different tests so as to evaluate and reinforce our conception. The frame shift duration effect of the acoustic analysis as well as the addition of the dynamic coefficients of the acoustic parameters (MFCC and PLP) were carefully tested in order to look for high accuracy for our optimized isolated word recognition (IWR) system. Finally, various experiments related to the HMM topology have been carried out in order to get better recognition accuracies. In fact, the effect of some modeling parameters of HMM on the recognition accuracy of the IWR system such as the number of states as well as the number of Gaussian mixtures were analyzed in order to get the optimal HMM topology.

Large Vocabulary in Continuous Speech Recognition Using HMM and Normal Fit

this paper addresses the problem of large vocabulary speaker independent continuous speech recognition using the phonemes, Hidden Markov Model (HMM) and Normal fit method. Here we first detect for the voiced part in speech signal through computing dynamic threshold in each frame. Real Cepstrum coefficients are extracted as features from the voiced frames. The Baum-Welch algorithm is applied for training those features. Then normal fit technique is applied, the outputted values are labelled using correspondent phoneme or syllable. The model is tested for 5 languages namely English, Kannada, Hindi, Tamil and Telugu. The automatic segmentation of speech signals average accuracy rate is 95.42% and miss rate of about 4.58%. In the large vocabulary, average Word Recognition Rate (WRR) is 85.16% and average Word Error Rate (WER) is 14.84%. All computations are done using mat lab.

Mixture of Support Vector Machines for HMM based Speech Recognition

18th International Conference on Pattern Recognition (ICPR'06), 2006

Models (HMMs), which represent the temporal dynamics of speech very efficiently, and Gaussian mixture models, which do non-optimally the classification of speech into single speech units (phonemes). In this paper we use parallel mixtures of Support Vector Machines (SVMs) for classification by integrating this method in a HMM-based speech recognition system. SVMs are very appealing due to their association with statistical learning theory and have already shown good results in pattern recognition and in continuous speech recognition. They suffer however from the effort for training which scales at least quadratic with respect to the number of training vectors. The SVM mixtures need only nearly linear training time making it easier to deal with the large amount of speech data. In our hybrid system we use the SVM mixtures as acoustic models in a HMM-based decoder. We train and test the hybrid system on the DARPA Resource Management (RM1) corpus, showing better performance than HMM-based decoder using Gaussian mixtures.