Connectionist probability estimators in HMM speech recognition (original) (raw)
Related papers
International Joint Conference on Neural Networks, 1994
In this paper we present a training method and a network architecture for estimating contextdependent observation probabilities in the framework of a hybrid hidden Markov model (HMM) / multi layer perceptron (MLP) speaker-independent continuous speech recognition system. The context-dependent modeling approach we present here computes the HMM context-dependent observation probabilities using a Bayesian factorization in terms of context-conditioned posterior phone
Computer Speech & Language, 1994
In this paper we present a training method and a network architecture for estimating contextdependent observation probabilities in the framework of a hybrid hidden Markov model (HMM) / multi layer perceptron (MLP) speaker-independent continuous speech recognition system. The context-dependent modeling approach we present here computes the HMM context-dependent observation probabilities using a Bayesian factorization in terms of context-conditioned posterior phone probabilities which are computed with a set of MLPs, one for every relevant context. The proposed network architecture shares the input-to-hidden layer among the set of context-dependent MLPs in order to reduce the number of independent parameters. Multiple states for phone models with different context dependence for each state are used to model the different context effects at the beginning and end of phonetic segments. A new training procedure that ''smooths'' networks with different degrees of context-dependence is proposed to obtain a robust estimate of the context-dependent probabilities. We have used this new architecture to model generalized biphone phonetic contexts. Tests with the speaker-independent DARPA Resource Management data base have shown average reductions in word error rates of 28% using a word-pair grammar, compared to our earlier context-independent HMM/MLP hybrid.
A Hybrid Stochastic Connectionist Approach to Automatic Speech Recognition
Citeseer
This report focuses on a hybrid approach, including stochastic and connectionist methods, for continuous speech recognition. Hidden Markov Models (HMMs) are a popular stochastic approach used for continuous speech, well suited to cope with the high variability found in natural utterances. On the other hand, arti cial neural networks (NNs) have shown high classi cation power for short speech utterances. Therefore, we have built a hybrid system with the advantage of both Hidden Markov Models and Neural Networks. The basic idea is as follows: build a codebook from the Time-Delay Neural Networks (TDNN) output units and train HMMs using the Fuzzy-VQ algorithm.
Hidden Markov models and neural networks for speech recognition
1998
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than rst order dependencies in the observed data sequences. This is due to the rst order state process and the assumption of state conditional independence between observations. Arti cial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classi cation and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classi cation abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and to evaluate this model on a number of standard speech recognition tasks. This has resulted in a hybrid called a Hidden Neural Network (HNN), in which the HMM emission and transition probabilities are replaced by the outputs of state-speci c neural networks. The HNN framework is characterized by: Discriminative training: HMMs are commonly trained by the Maximum Likelihood (ML) criterion to model within-class data distributions. As opposed to this, the HNN is trained by the Conditional Maximum Likelihood (CML) criterion to discriminate between di erent classes. CML training is in this work implemented by a gradient descent algorithm in which the neural networks are updated by backpropagation of errors calculated by a modi ed version of the forward-backward algorithm for HMMs. Global normalization: A valid probabilistic interpretation of the HNN is ensured by normalizing the model globally at the sequence level during CML training. This is di erent from the local normalization of probabilities enforced at the state level in standard HMMs. Flexibility: The global normalization makes the HNN architecture very exible. Any combination of neural network estimated parameters and standard HMM parameters can be used. Furthermore, the global normalization of the HNN gives a large freedom in selecting the architecture and output functions of the neural networks. v vi Postscript les of this thesis and all the above listed papers can be downloaded from the WWW-server at the Section for Digital Signal Processing. 1 The papers relevant to the work described in this thesis are furthermore included in appendix C-F of this thesis. Acknowledgments At this point I would like to thank Ste en Duus Hansen and Anders Krogh for their supervision of my Ph.D. project. Especially I wish to express my gratitude to Anders Krogh for the guidance, encouragement and friendship that he managed to extend to me during our almost ve years of collaboration. Even during his stay at the Sanger Centre in Cambridge he managed to guide me through the project by always responding to my emails and telephone calls and by inviting me to visit him. Anders' scienti c integrity, great intuition, ambition and pleasant company has earned him my respect. Without his encouragement and optimistic faith in this work it might never have come to an end. The sta and Ph.D. students at the Section for Digital Signal Processing are thanked for creating a very pleasant research environment and for the many joyful moments at the o ce and during conference trips. Thanks also to Mogens Dyrdahl and everybody else involved in maintaining the excellent computing facilities which were crucial for carrying out my research. Center for Biological Sequence Analysis is also acknowledged for providing CPU-time which made some of the computationally intensive evaluations possible. Similarly, Peter Toft is thanked for learning me to master the force of Linux. I sincerely wish to express my gratitude to Steve Renals for inviting me to work at the Department of Computer Science, University of She eld from February to July 1997. It was a very pleasant and rewarding stay. The Ph.D. students and sta at the Department of Computer Science are acknowledged for their great hospitality and for creating a pleasant research atmosphere. I'm especially grateful to Gethin Williams for the many discussions on hybrid speech recognizers and for proofreading large parts of this thesis. I'm indebted to Gethin for his many valuable comments and suggestions to improve this manuscript. Morten With Pedersen and Kirsten Pedersen are also acknowledged for their comments and suggestions to this manuscript. Morten is furthermore thanked for the many fruitful discussions we've had and for his pleasant company at the o ce during the years. The speech group and in particular Christophe Ris at the Circuit Theory and Signal Processing Lab (TCTS), Facult e Polytechnique de Mons is acknowledged for providing data necessary to carry out the experiments presented in chapter 9 of this thesis. The Technical University of Denmark is acknowledged for allowing me the opportunity of doing this work. Otto M nsteds foundation and Valdemar Selmer Trane og Hustru Elisa Trane's foundation is acknowledged for nancial support to travel activities. Last but not least I thank my family and friends for their support, love and care during the Ph.D. study. A special heartfelt thanks goes to my wife and little daughter who helped me maintain my sanity during the study, as I felt myself drowning in ambitions. Without their support this work would not have been possible.
Combining Neural Networks And Hidden Markov Models For Continuous Speech Recognition
1992
e present a speaker-independent, continuous-speech recog-(nition system based on a hybrid multilayer perceptronMLP)/hidden Markov model (HMM). The system comebines the advantages of both approaches by using MLPs tostimate the state-dependent observation probabilities of anepHMM. New MLP architectures and training procedures arresented that allow the modeling of multiple distributions.Cfor phonetic classes and context-dependent phonetic classesomparisons with a pure HMM system...
The Application of Hidden Markov Models in Speech Recognition
Foundations and TrendsĀ® in Signal Processing, 2007
Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.
Hybrid Neural Network/hidden Markov Model Continuous-Speech Recognition
1992
nMIn this paper we present a hybrid multilayer perceptron (MLP)/hiddearkov model (HMM) speaker-independent continuous-speech recognibtion system, in which the advantages of both approaches are combinedy using MLPs to estimate the state-dependent observation probabilitiespof an HMM. New MLP architectures and training procedures areresented which allow the modeling of multiple distributions for phoneticapclasses and context-dependent phonetic classes. Comparisons withure HMM system...
Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop
A speech recognizer is developed using a layered neuml network to implement speech-frame prediction and using a Markov chain to modulate the network ' a weight parameters. We postulate that speech recognition accuracy U closely linked to the capability of the predictive model in representing longterm tempoml correlations in data. Analytical ezpressions are obtained for the correlation functions for various types of predictive models (linear, nonlinear, and jointly linear and nonlinear) in o d e r to determine the faithfulness of the models to the actual speech data. The analytical results, computer simulations, and speech recognition ezperiments suggest that when nonlinear and linear prediction are jointly performed within the same layer of the neural network, the model is better able to capture long-term data correlations and consequently improve speech recognition performance.