A simplified Subspace Gaussian Mixture to compact acoustic models for speech recognition (original) (raw)

The subspace Gaussian mixture model - A structured model for speech recognition

Computer Speech & Language, 2011

We describe a new approach to speech recognition, in which all Hidden Markov Model (HMM) states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state. The model is defined by vectors associated with each state with a dimension of, say, 50, together with a global mapping from this vector space to the space of parameters of the GMM. This model appears to give better results than a conventional model, and the extra structure offers many new opportunities for modeling innovations while maintaining compatibility with most standard techniques.

Subspace Gaussian Mixture Models for speech recognition

2010

This technical report contains the details of an acoustic modeling approach based on subspace adaptation of a shared Gaussian Mixture Model. This refers to adaptation to a particular speech state; it is not a speaker adaptation technique, although we do later introduce a speaker adaptation technique that it tied to this particular framework. Our model is a large shared GMM whose parameters vary in a subspace of relatively low dimension (e.g. 50), thus each state is described by a vector of low dimension which controls the GMM's means and mixture weights in a manner determined by globally shared parameters. In addition we generalize to having each speech state be a mixture of substates, each with a different vector. Only the mathematical details are provided here; experimental results are being published separately.

Acoustic modeling for under-resourced languages based on vectorial HMM-states representation using Subspace Gaussian Mixture Models

2012 IEEE Spoken Language Technology Workshop (SLT), 2012

This paper explores a novel method for context-dependent models in automatic speech recognition (ASR), in the context of under-resourced languages. We present a simple way to realize a tying states approach, based on a new vectorial representation of the HMM states. This vectorial representation is considered as a vector of a low number of parameters obtained by the Subspace Gaussian Mixture Models paradigm (SGMM). The proposed method does not require phonetic knowledge or a large amount of data, which represent the major problems of acoustic modeling for under-resourced languages. This paper shows how this representation can be obtained and used for tying states. Our experiments, applied on Vietnamese, show that this approach achieves a stable gain compared to the classical approach which is based on decision trees. Furthermore, this method appears to be portable to other languages, as shown in the preliminary study conducted on Berber.

Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models

2010

Although research has previously been done on multilingual speech recognition, it has been found to be very difficult to improve over separately trained systems. The usual approach has been to use some kind of "universal phone set" that covers multiple languages. We report experiments on a different approach to multilingual speech recognition, in which the phone sets are entirely distinct but the model has parameters not tied to specific states that are shared across languages. We use a model called a "Subspace Gaussian Mixture Model" where states' distributions are Gaussian Mixture Models with a common structure, constrained to lie in a subspace of the total parameter space. The parameters that define this subspace can be shared across languages. We obtain substantial WER improvements with this approach, especially with very small amounts of inlanguage training data.

Subspace Gaussian mixture model with state-dependent subspace dimensions

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014

In recent years, under the hidden Markov modeling (HMM) framework, the use of subspace Gaussian mixture models (SGMMs) has demonstrated better recognition performance than traditional Gaussian mixture models (GMMs) in automatic speech recognition. In state-of-the-art SGMM formulation, a fixed subspace dimension is assigned to every phone states. While a constant subspace dimension is easier to implement, it may, however, lead to overfitting or underfitting of some state models as the data is usually distributed unevenly among the states. In a later extension of SGMM, states are split to sub-states with an appropriate objective function so that the problem is eased by increasing the state-specific parameters for the underfitting state. In this paper, we propose another solution and allow each sub-state to have a different subspace dimension depending on its amount of training frames so that the state-specific parameters can be robustly estimated. Experimental evaluation on the Switchboard recognition task shows that our proposed method brings improvement to the existing SGMM training procedure.

Comparing computation in Gaussian mixture and neural network based large-vocabulary speech recognition

In this paper we look at real-time computing issues in large vocabulary speech recognition. We use the French broadcast audio transcription task from ETAPE 2011 for this evaluation. We compare word error rate (WER) versus overall computing time for hidden Markov models with Gaussian mixtures (GMM-HMM) and deep neural networks (DNN-HMM). We show that for a similar computing during recognition, the DNN-HMM combination is superior to the GMM-HMM. For a realtime computing scenario, the error rate for the ETAPE dev set is 23.5% for DNN-HMM versus 27.9% for the GMM-HMM: a significant difference in accuracy for comparable computing. Rescoring lattices (generated by DNN-HMM acoustic model) with a quadgram language model (LM), and then with a neural net LM reduces the WER to 22.0% while still providing realtime computing.

Discriminative Estimation of Subspace Constrained Gaussian Mixture Models for Speech Recognition

IEEE Transactions on Audio, Speech and Language Processing, 2000

In this paper we study discriminative training of acoustic models for speech recognition under two criteria: maximum mutual information (MMI) and a novel "error weighted" training technique. We present a proof that the standard MMI training technique is valid for a very general class of acoustic models with any kind of parameter tying. We report experimental results for subspace constrained Gaussian mixture models (SCG-MMs), where the exponential model weights of all Gaussians are required to belong to a common "tied" subspace, as well as for Subspace Precision and Mean (SPAM) models which impose separate subspace constraints on the precision matrices (i.e. inverse covariance matrices) and means. It has been shown previously that SCGMMs and SPAM models generalize and yield significant error rate improvements over previously considered model classes such as diagonal models, models with semi-tied covariances, and EMLLT (extended maximum likelihood linear transformation) models. We show here that MMI and error weighted training each individually result in over 20% relative reduction in word error rate on a digit task over maximum likelihood (ML) training. We also show that a gain of as much as 28% relative can be achieved by combining these two discriminative estimation techniques.

Gaussian mixture selection using context-independent HMM

2001

We address a method to efficiently select Gaussian mixtures for fast acoustic likelihood computation. It makes use of context-independent models for selection and back-off of corresponding triphone models. Specifically, for the kbest phone models by the preliminary evaluation, triphone models of higher resolution are applied, and others are assigned likelihoods with the monophone models. This selection scheme assigns more reliable back-off likelihoods to the un-selected states than the conventional Gaussian selection based on a VQ codebook. It can also incorporate efficient Gaussian pruning at the preliminary evaluation, which offsets the increased size of the pre-selection model. Experimental results show that the proposed method achieves comparable performance as the standard Gaussian selection, and performs much better under aggressive pruning condition. Together with the phonetic tied-mixture (PTM) modeling, acoustic matching cost is reduced to almost 14% with little loss of accuracy.

Special Section on Statistical Modeling for Speech Processing

IEICE Transactions on Information and Systems, 2006

In recent years, the number of studies investigating new directions in speech modeling that goes beyond the conventional HMM has increased considerably. One promising approach is to use Bayesian Networks (BN) as speech models. Full recognition systems based on Dynamic BN as well as acoustic models using BN have been proposed lately. Our group at ATR has been developing a hybrid HMM/BN model, which is an HMM where the state probability distribution is modeled by a BN, instead of commonly used mixtures of Gaussian functions. In this paper, we describe how to use the hybrid HMM/BN acoustic models, especially emphasizing some design and implementation issues. The most essential part of HMM/BN model building is the choice of the state BN topology. As it is manually chosen, there are some factors that should be considered in this process. They include, but are not limited to, the type of data, the task and the available additional information. When context-dependent models are used, the state-level structure can be obtained by traditional methods. The HMM/BN parameter learning is based on the Viterbi training paradigm and consists of two alternating steps-BN training and HMM transition updates. For recognition, in some cases, BN inference is computationally equivalent to a mixture of Gaussians, which allows HMM/BN model to be used in existing decoders without any modification. We present two examples of HMM/BN model applications in speech recognition systems. Evaluations under various conditions and for different tasks showed that the HMM/BN model gives consistently better performance than the conventional HMM.

Gaussian mixture optimization for HMM based on efficient cross-validation

Interspeech 2007, 2007

A Gaussian mixture optimization method is explored using cross-validation likelihood as an objective function instead of the conventional training set likelihood. The optimization is based on reducing the number of mixture components by selecting and merging a pair of Gaussians step by step base on the objective function so as to remove redundant components and improve the generality of the model. Cross-validation likelihood is more appropriate for avoiding over-fitting than the conventional likelihood and can be efficiently computed using sufficient statistics. It results in a better Gaussian pair selection and provides a termination criterion that does not rely on empirical thresholds. Large-vocabulary speech recognition experiments on oral presentations show that the cross-validation method gives a smaller word error rate with an automatically determined model size than a baseline training procedure that does not perform the optimization.