Combining Neural Networks And Hidden Markov Models For Continuous Speech Recognition (original) (raw)
Hybrid Neural Network/hidden Markov Model Continuous-Speech Recognition
1992
nMIn this paper we present a hybrid multilayer perceptron (MLP)/hiddearkov model (HMM) speaker-independent continuous-speech recognibtion system, in which the advantages of both approaches are combinedy using MLPs to estimate the state-dependent observation probabilitiespof an HMM. New MLP architectures and training procedures areresented which allow the modeling of multiple distributions for phoneticapclasses and context-dependent phonetic classes. Comparisons withure HMM system...
Computer Speech & Language, 1994
In this paper we present a training method and a network architecture for estimating contextdependent observation probabilities in the framework of a hybrid hidden Markov model (HMM) / multi layer perceptron (MLP) speaker-independent continuous speech recognition system. The context-dependent modeling approach we present here computes the HMM context-dependent observation probabilities using a Bayesian factorization in terms of context-conditioned posterior phone probabilities which are computed with a set of MLPs, one for every relevant context. The proposed network architecture shares the input-to-hidden layer among the set of context-dependent MLPs in order to reduce the number of independent parameters. Multiple states for phone models with different context dependence for each state are used to model the different context effects at the beginning and end of phonetic segments. A new training procedure that ''smooths'' networks with different degrees of context-dependence is proposed to obtain a robust estimate of the context-dependent probabilities. We have used this new architecture to model generalized biphone phonetic contexts. Tests with the speaker-independent DARPA Resource Management data base have shown average reductions in word error rates of 28% using a word-pair grammar, compared to our earlier context-independent HMM/MLP hybrid.
Hidden Markov models and neural networks for speech recognition
1998
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than rst order dependencies in the observed data sequences. This is due to the rst order state process and the assumption of state conditional independence between observations. Arti cial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classi cation and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classi cation abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and to evaluate this model on a number of standard speech recognition tasks. This has resulted in a hybrid called a Hidden Neural Network (HNN), in which the HMM emission and transition probabilities are replaced by the outputs of state-speci c neural networks. The HNN framework is characterized by: Discriminative training: HMMs are commonly trained by the Maximum Likelihood (ML) criterion to model within-class data distributions. As opposed to this, the HNN is trained by the Conditional Maximum Likelihood (CML) criterion to discriminate between di erent classes. CML training is in this work implemented by a gradient descent algorithm in which the neural networks are updated by backpropagation of errors calculated by a modi ed version of the forward-backward algorithm for HMMs. Global normalization: A valid probabilistic interpretation of the HNN is ensured by normalizing the model globally at the sequence level during CML training. This is di erent from the local normalization of probabilities enforced at the state level in standard HMMs. Flexibility: The global normalization makes the HNN architecture very exible. Any combination of neural network estimated parameters and standard HMM parameters can be used. Furthermore, the global normalization of the HNN gives a large freedom in selecting the architecture and output functions of the neural networks. v vi Postscript les of this thesis and all the above listed papers can be downloaded from the WWW-server at the Section for Digital Signal Processing. 1 The papers relevant to the work described in this thesis are furthermore included in appendix C-F of this thesis. Acknowledgments At this point I would like to thank Ste en Duus Hansen and Anders Krogh for their supervision of my Ph.D. project. Especially I wish to express my gratitude to Anders Krogh for the guidance, encouragement and friendship that he managed to extend to me during our almost ve years of collaboration. Even during his stay at the Sanger Centre in Cambridge he managed to guide me through the project by always responding to my emails and telephone calls and by inviting me to visit him. Anders' scienti c integrity, great intuition, ambition and pleasant company has earned him my respect. Without his encouragement and optimistic faith in this work it might never have come to an end. The sta and Ph.D. students at the Section for Digital Signal Processing are thanked for creating a very pleasant research environment and for the many joyful moments at the o ce and during conference trips. Thanks also to Mogens Dyrdahl and everybody else involved in maintaining the excellent computing facilities which were crucial for carrying out my research. Center for Biological Sequence Analysis is also acknowledged for providing CPU-time which made some of the computationally intensive evaluations possible. Similarly, Peter Toft is thanked for learning me to master the force of Linux. I sincerely wish to express my gratitude to Steve Renals for inviting me to work at the Department of Computer Science, University of She eld from February to July 1997. It was a very pleasant and rewarding stay. The Ph.D. students and sta at the Department of Computer Science are acknowledged for their great hospitality and for creating a pleasant research atmosphere. I'm especially grateful to Gethin Williams for the many discussions on hybrid speech recognizers and for proofreading large parts of this thesis. I'm indebted to Gethin for his many valuable comments and suggestions to improve this manuscript. Morten With Pedersen and Kirsten Pedersen are also acknowledged for their comments and suggestions to this manuscript. Morten is furthermore thanked for the many fruitful discussions we've had and for his pleasant company at the o ce during the years. The speech group and in particular Christophe Ris at the Circuit Theory and Signal Processing Lab (TCTS), Facult e Polytechnique de Mons is acknowledged for providing data necessary to carry out the experiments presented in chapter 9 of this thesis. The Technical University of Denmark is acknowledged for allowing me the opportunity of doing this work. Otto M nsteds foundation and Valdemar Selmer Trane og Hustru Elisa Trane's foundation is acknowledged for nancial support to travel activities. Last but not least I thank my family and friends for their support, love and care during the Ph.D. study. A special heartfelt thanks goes to my wife and little daughter who helped me maintain my sanity during the study, as I felt myself drowning in ambitions. Without their support this work would not have been possible.
Continuous speech recognition using hidden Markov models
IEEE Assp Magazine, 1990
Stochastic signal processing techniques have profoundly changed our perspective on speech processing. We have witnessed a progression from heuristic algorithms to detailed statistical approaches based on iterative analysis techniques. Markov modeling provides a mathematically rigorous approach to developing robust statistical signal models. Since t h e i n t r o d u c t i o n of Markov models t o speech processing in t h e middle 1970s. continuous speech recognition technology has come of age. Dramatic advances have been made in characterizing the temporal and spectral evolution of the speech signal. A t the same time, our appreciation o f t h e need t o explain complex acoustic manifestations b y integration of application constraints into low level signal processing has grown. In this paper, w e review the use of Markov models in continuous speech recognition. Markov models are presented as a generalization of i t s predecessor technology, Dynamic Programming. A unified view is offered in which b o t h linguistic decoding and acoustic matching are integrated into a single optimal network search framework.
Speech recognition using hybrid hidden markov model and NN classifier
International Journal of Speech Technology, 1998
This paper discusses the use of an integrated HMM/NN classifier for speech recognition. The proposed classifier combines the time normalization property of the HMM classifier with the superior discriminative ability of the neural net (NN) classifier. Speech signals display a strong time varying characteristic. Although the neural net has been successful in many classification problems, its success (compared to HMM) is secondary to HMM in the field of speech recognition. The main reason is the lack of time normalization characteristics of most neural net structures (time-delay neural net is one notable exception but its structure is very complex). In the proposed integrated hybrid HMM/NN classifier, a left-to-right HMM module is used first to segment the observation sequence of every exemplar into a fixed number of states. Subsequently, all the frames belonging to the same state are replaced by one average frame. Thus, every exemplar, irrespective of its time scale variation, is transformed into a fixed number of frames, i.e., a static pattern. The multilayer perceptron (MLP) neural net is then used as the classifier for these time normalized exemplars. Some experimental results using telephone speech databases are presented to demonstrate the potential of this hybrid integrated classifier.
s Multiple-State Context-Dependent Phonetic Modeling with MLP
2000
arlier hybrid multilayer perceptron (MLP)/hidden Markov model (HMM) continuous speech recognition sys- r g tems have not modeled context-dependent phonetic effects, sequences of distributions for phonetic models, o ender-based speech consistencies. In this paper we present a new MLP architecture and training procedure for t " modeling context-dependent phonetic classes with a sequence of distributions. A new training procedure tha
The main goal in this research is to find out possible ways to built hybrid systems, based on neural network (NN) and hidden Markov (HMM) models, for the task of automatic speech recognition. The investigation that has been conducted covers different types of neural network and hidden Markov models, and the combination of them into some hybrid models. The neural networks used were basically MLP and Radial Basis models. The hidden Markov models were basically different combinations of states and mixtures of the Continuous Density type of the Bakis model. A reduced set with ten words spoken in the Portuguese idiom, from Brazil, was carefully chosen to provide some pronounce and phonetic confusion. The results already obtained showed very positive, pointing toward to a high potentiality of such hybrid models.
The Application of Hidden Markov Models in Speech Recognition
Foundations and TrendsĀ® in Signal Processing, 2007
Hidden Markov Models (HMMs) provide a simple and effective framework for modelling time-varying spectral vector sequences. As a consequence, almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs.
Hybrid HMM/ANN Systems for Speaker Independent Continuous Speech Recognition in French
In this paper we report a series of tests carried out on our hybrid HMM/ANN systems which aims at combining Neural Networks theory and Hidden Markov Models (HMMs) for speech recognition of a continuous speech French database: BREF-80. As this database is not manually labelled, we describe a new method based on the temporal alignment of the speech signal on a high quality synthetic speech pattern to generate a first segmentation in order to bootstrap the training procedure. A phone recognition experiment with our baseline system achieved a phone accuracy of about 63% which is, to our knowledge, the best result reported in the litterature. Preliminary experiments on continuous speech recognition have set a baseline performance for our hybrid HMM/ANN system on BREF using 1K, 3K and 13 K word lexicons. All the experiments were carried out with the STRUT (Speech Training and Recognition Unified Toolkit) software [10]. I. Introduction Significant advances have been made in recent years in...