Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers (original) (raw)

Robust bi-modal speech recognition based on state synchronous modeling and stream weight optimization

IEEE International Conference on Acoustics Speech and Signal Processing, 2002

There have been higher demands recently for Automatic Speech Recognition (ASR) systems able to operate robustly in acoustically noisy environments. This paper proposes a method to effectively integrate audio and visual information in audio-visual (bi-modal) ASR systems. Such integration inevitably necessitates modeling of the synchronization of the audio and visual information. To address the time lag and correlation problems in individual features between speech and lip movements, we introduce a type of integrated HMM modeling of audio-visual information based on a family of HMM composition. The proposed model can represent state synchronicity not only within a phoneme but also between phonemes. Furthermore, we also propose a rapid stream weight optimization based on GPD algorithm for noisy bi-modal speech recognition. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech. In SNR=0dB our proposed method attained 16% higher performance compared to a product HMMs without the synchronicity re-estimation.

Multi-stream parameterization for structural speech recognition

2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008

Recently, a novel and structural representation of speech was proposed [1, 2], where the inevitable acoustic variations caused by nonlinguistic factors are effectively removed from speech. This structural representation captures only microphone-and speaker-invariant speech contrasts or dynamics and uses no absolute or static acoustic properties directly such as spectrums. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels [3]. The structural models trained with a single speaker outperformed the conventional HMMs trained with more than four thousand speakers even in the case of noisy speech. We also applied the new models to recognizing utterances of connected vowels [4]. In the current paper, a multiple stream structuralization method is proposed to improve the performance of the structural recognition framework. The proposed method only with 8 training speakers shows the very comparable performance to that of the conventional 4,130-speaker triphone-based HMMs.

Hierarchical Multi-stream Posterior Based Speech Recognition System

Lecture Notes in Computer Science, 2006

In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on "state gamma posterior" definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs.This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the stateof-the-art Tandem systems.

Landmark-based approach to speech recognition: an alternative to HMMs

Conference of the International Speech Communication Association, 2007

In this paper, we compare a Probabilistic Landmark-Based speech recognition System (LBS) which uses Knowledge-based Acoustic Parameters (APs) as the front-end with an HMMbased recognition system that uses the Mel-Frequency Cepstral Coefficients as its front end. The advantages of LBS based on APs are (1) the APs are normalized for extra-linguistic information, (2) acoustic analysis at different landmarks may be performed with different resolutions and with different APs, (3) LBS outputs multiple acoustic landmark sequences that signal perceptually significant regions in the speech signal, (4) it may be easier to port this system to another language since the phonetic features captured by the APs are universal, and (5) LBS can be used as a tool for uncovering and subsequently understanding variability. LBS also has a probabilistic framework that can be combined with pronunciation and language models in order to make it more scalable to large vocabulary recognition tasks.

Asynchronous stream modeling for large vocabulary audio-visual speech recognition

International Conference on Acoustics, Speech, and Signal Processing, 2001

Addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood

Overcoming asynchrony in Audio-Visual Speech Recognition

2010 IEEE International Workshop on Multimedia Signal Processing, 2010

In this paper we propose two alternatives to overcome the natural asynchrony of modalities in AudioVisual Speech Recognition. We first investigate the use of asynchronous statistical models based on Dynamic Bayesian Networks with different levels of asynchrony. We show that audiovisual models should consider asynchrony within word boundaries and not at phoneme level. The second approach to the problem includes an additional processing of the features before being used for recognition. The proposed technique aligns the temporal evolution of the audio and video streams in terms of a speechrecognition system and enables the use of simpler statistical models for classification. On both cases we report experiments with the CUAVE database, showing the improvements obtained with the proposed asynchronous model and feature processing technique compared to traditional systems.

Weighted finite-state transducers in speech recognition

Computer Speech & Language, 2002

We survey the use of weighted finite-state transducers (WFSTs) in speech recognition. We show that WFSTs provide a common and natural representation for HMM models, context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs. Furthermore, general transducer operations combine these representations flexibly and efficiently. Weighted determinization and minimization algorithms optimize their time and space requirements, and a weight pushing algorithm distributes the weights along the paths of a weighted transducer optimally for speech recognition.

A multi-stream ASR framework for BLSTM modeling of conversational speech

International Conference on Acoustics, Speech, and Signal Processing, 2011

We propose a novel multi-stream framework for continuous conversational speech recognition which employs bidirectional Long Short-Term Memory (BLSTM) networks for phoneme prediction. The BLSTM architecture allows recurrent neural nets to model longrange context, which led to improved ASR performance when combined with conventional triphone modeling in a Tandem system. In this paper, we extend the principle of joint BLSTM and triphone modeling to a multi-stream system which uses MFCC features and BLSTM predictions as observations originating from two independent data streams. Using the COSINE database, we show that this technique prevails over a recently proposed single-stream Tandem system as well as over a conventional HMM recognizer.

LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP

Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing / sponsored by the Institute of Electrical and Electronics Engineers Signal Processing Society. ICASSP (Conference), 2005

Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciatio...

DBN based multi-stream models for audio-visual speech recognition

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing

In this paper, we propose a model based on Dynamic Bayesian Networks (DBNs) to integrate information from multiple audio and visual streams. We also compare the DBN based system (implemented using the Graphical Model Toolkit (GMTK)) with a classical HMM (implemented in the Hidden Markov Model Toolkit (HTK)) for both the single and two stream integration problems. We also propose a new model (mixed integration) to integrate information from three or more streams derived from different modalities and compare the new model's performance with that of a synchronous integration scheme. A new technique to estimate stream confidence measures for the integration of three or more streams is also developed and implemented. Results from our implementation using the Clemson University Audio Visual Experiments (CUAVE) database indicate an absolute improvement of about £ ¥ ¤ in word accuracy in the-4 to 10db average case when making use of two audio and one video streams for the mixed integration models over the sychronous models.