Combination of acoustic models in continuous speech recognition hybrid systems (original) (raw)

A General Method for Combining Acoustic Features in an Automatic Speech Recognition System

2006

A general method for the use of different types of features in Automatic Speech Recognition (ASR) systems is presented. A gaussian mixture model (GMM) is obtained in a reference acoustic space. A specific feature combination or selection is associated to each gaussian of the mixture and used for computing symbol posterior probabilities. Symbols can refer to phonemes, phonemes in context or states of a Hidden Markov Model (HMM). Experimental results are presented of applications to phoneme and word rescoring after verification. Two corpora were used, one with small vocabularies in Italian and Spanish and one with very large vocabulary in French.

Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system

IEEE Transactions on Speech and Audio Processing, 2005

Recently, the advantages of the spectral parameters obtained by frequency filtering (FF) of the logarithmic filter-bank energies (logFBEs) have been reported. These parameters, which are frequency derivatives of the logFBEs, lie in the frequency domain, and have shown good recognition performance with respect to the conventional MFCCs for HMM systems. In this paper, the FF features are first compared with the MFCCs and the Rasta-PLP features using both a hybrid HMM/MLP and a usual HMM/GMM recognition system, for both clean and noisy speech. Taking advantage of the ability of the hybrid system to deal with correlated features, the inclusion of both the frequency second-derivatives and the raw logFBEs as additional features is proposed and tested. Moreover, the robustness of these features in noisy conditions is enhanced by combining the FF technique with the Rasta temporal filtering approach. Finally, a study of the FF features in the framework of multi-stream processing is presented. The best recognition results for both clean and noisy speech are obtained from the multi-stream combination of the J-Rasta-PLP features and the FF features.

Combining Neural Networks And Hidden Markov Models For Continuous Speech Recognition

1992

e present a speaker-independent, continuous-speech recog-(nition system based on a hybrid multilayer perceptronMLP)/hidden Markov model (HMM). The system comebines the advantages of both approaches by using MLPs tostimate the state-dependent observation probabilities of anepHMM. New MLP architectures and training procedures arresented that allow the modeling of multiple distributions.Cfor phonetic classes and context-dependent phonetic classesomparisons with a pure HMM system...

Hybrid Neural Network/hidden Markov Model Continuous-Speech Recognition

1992

nMIn this paper we present a hybrid multilayer perceptron (MLP)/hiddearkov model (HMM) speaker-independent continuous-speech recognibtion system, in which the advantages of both approaches are combinedy using MLPs to estimate the state-dependent observation probabilitiespof an HMM. New MLP architectures and training procedures areresented which allow the modeling of multiple distributions for phoneticapclasses and context-dependent phonetic classes. Comparisons withure HMM system...

Mixture of Support Vector Machines for HMM based Speech Recognition

18th International Conference on Pattern Recognition (ICPR'06), 2006

Models (HMMs), which represent the temporal dynamics of speech very efficiently, and Gaussian mixture models, which do non-optimally the classification of speech into single speech units (phonemes). In this paper we use parallel mixtures of Support Vector Machines (SVMs) for classification by integrating this method in a HMM-based speech recognition system. SVMs are very appealing due to their association with statistical learning theory and have already shown good results in pattern recognition and in continuous speech recognition. They suffer however from the effort for training which scales at least quadratic with respect to the number of training vectors. The SVM mixtures need only nearly linear training time making it easier to deal with the large amount of speech data. In our hybrid system we use the SVM mixtures as acoustic models in a HMM-based decoder. We train and test the hybrid system on the DARPA Resource Management (RM1) corpus, showing better performance than HMM-based decoder using Gaussian mixtures.

Investigations on features for log-linear acoustic models in continuous speech recognition

2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 2009

Hidden Markov Models with Gaussian Mixture Models as emission probabilities (GHMMs) are the underlying structure of all state-of-the-art speech recognition systems. Using Gaussian mixture distributions follows the generative approach where the class-conditional probability is modeled, although for classification only the posterior probability is needed. Though being very successful in related tasks like Natural Language Processing (NLP), in speech recognition direct modeling of posterior probabilities with log-linear models has rarely been used and has not been applied successfully to continuous speech recognition. In this paper we report competitive results for a speech recognizer with a log-linear acoustic model on the Wall Street Journal corpus, a Large Vocabulary Continuous Speech Recognition (LVCSR) task. We trained this model from scratch, i.e. without relying on an existing GHMM system. Previously the use of data dependent sparse features for log-linear models has been proposed. We compare them with polynomial features and show that the combination of polynomial and data dependent sparse features leads to better results.

The efficient incorporation of MLP features into automatic speech recognition systems

Computer Speech & Language, 2011

In recent years, the use of Multi-Layer Perceptron (MLP) derived acoustic features has become increasingly popular in automatic speech recognition systems. These features are typically used in combination with standard short-term spectral-based features, and have been found to yield consistent performance improvements. However there are a number of design decisions and issues associated with the use of MLP features for state-of-the-art speech recognition systems. Two modifications to the standard training/adaptation procedures are described in this work. First, the paper examines how MLP features, and the associated acoustic models, can be trained efficiently on large training corpora using discriminative training techniques. An approach that combines multiple individual MLPs is proposed, and this reduces the time needed to train MLPs on large amounts of data. In addition, to further speed up discriminative training, a lattice re-use method is proposed. The paper also examines how systems with MLP features can be adapted to a particular speakers, or acoustic environments. In contrast to previous work (where standard HMM adaptation schemes are used), linear input network adaptation is investigated. System performance is investigated within a multi-pass adaptation/combination framework. This allows the performance gains of individual techniques to be evaluated at various stages, as well as the impact in combination with other sub-systems. All the approaches considered in this paper are evaluated on an Arabic large vocabulary speech recognition task which includes both Broadcast News and Broadcast Conversation test data.

Acoustic Feature Combination for Robust Speech Recognition

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., 2005

In this paper, we consider the use of multiple acoustic features of the speech signal for robust speech recognition. We investigate the combination of various auditory based (Mel Frequency Cepstrum Coefficients, Perceptual Linear Prediction, etc.) and articulatory based (voicedness) features. Features are combined by a Linear Discriminant Analysis based and by a log-linear model combination based techniques. We describe the two feature combination techniques and compare the experimental results. Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features.

Study of algorithms to combine multiple automatic speech recognition (ASR) system outputs

Automatic Speech Recognition systems (ASRs) recognize word sequences by employing algorithms such as Hidden Markov Models. Given the same speech to recognize, the different ASRs may output very similar results but with errors such as insertion, substitution or deletion of incorrect words. Since different ASRs may be based on different algorithms, it is likely that error segments across ASRs are uncorrelated. Therefore it may be possible to improve the speech recognition accuracy by exploiting multiple hypotheses testing using a combination of ASRs. System Combination is a technique that combines the outputs of two or more ASRs to estimate the most likely hypothesis among conflicting word pairs or differing hypotheses for the same part of utterance. In this thesis, a conventional voting scheme called Recognized Output Voting Error Reduction (ROVER) is studied. A weighted voting scheme based on Bayesian theory known as Bayesian Combination (BAYCOM) is implemented. BAYCOM is derived fr...

Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system

Computer Speech & Language, 1994

In this paper we present a training method and a network architecture for estimating contextdependent observation probabilities in the framework of a hybrid hidden Markov model (HMM) / multi layer perceptron (MLP) speaker-independent continuous speech recognition system. The context-dependent modeling approach we present here computes the HMM context-dependent observation probabilities using a Bayesian factorization in terms of context-conditioned posterior phone probabilities which are computed with a set of MLPs, one for every relevant context. The proposed network architecture shares the input-to-hidden layer among the set of context-dependent MLPs in order to reduce the number of independent parameters. Multiple states for phone models with different context dependence for each state are used to model the different context effects at the beginning and end of phonetic segments. A new training procedure that ''smooths'' networks with different degrees of context-dependence is proposed to obtain a robust estimate of the context-dependent probabilities. We have used this new architecture to model generalized biphone phonetic contexts. Tests with the speaker-independent DARPA Resource Management data base have shown average reductions in word error rates of 28% using a word-pair grammar, compared to our earlier context-independent HMM/MLP hybrid.