Continuous speech recognition using support vector machines (original) (raw)

Hidden Markov models (HMMs) with Gaussian mixture observation densities are the dominant approach in speech recognition. These systems typically use a representational model based on maximum likelihood decoding and expectation maximization-based training. Though powerful, this paradigm is prone to overfitting and does not directly incorporate discriminative information. We propose a new paradigm centered on principles of structural risk minimization and using a discriminative framework for speech recognition based on support vector machines (SVMs). SVMs are a family of discriminative classifiers that provide significant advantages over other discriminatively trained classifiers. Chief among these advantages is the ability to simultaneously optimize the representational and discriminative ability of the acoustic classifier-a necessity for acoustic units such as phonemes which have a high degree of overlap in the feature space. As a proof of concept, we present an SVM-based large vocabulary speech recognition system. This system achieves a state-of-the-art word error rate of 10.6% on a continuous alphadigit task. Through the introduction of this system, we provide insight into the many issues one faces when moving from an HMM framework to an SVM framework. These include the application of temporal constraints to the static support vector classifier, generation of a posterior probability from the binary support vector classifier and balancing the need for a robust training set with pragmatic efficiency issues. We conclude with a discussion of open research issues that are crucial to the successful application of SVMs in speech recognition.