A comparative study on system combination schemes for LVCSR (original) (raw)

Cross-lingual and multi-stream posterior features for low resource LVCSR systems

Interspeech 2010, 2010

We investigate approaches for large vocabulary continuous speech recognition (LVCSR) system for new languages or new domains using limited amounts of transcribed training data. In these low resource conditions, the performance of conventional LVCSR systems degrade significantly. We propose to train low resource LVCSR system with additional sources of information like annotated data from other languages (German and Spanish) and various acoustic feature streams (short-term and modulation features). We train multilayer perceptrons (MLPs) on these sources of information and use Tandem features derived from the MLPs for low resource LVCSR. In our experiments, the proposed system trained using only one hour of English conversational telephone speech (CTS) provides a relative improvement of 11% over the baseline system.

A phonetic feature based lattice rescoring approach to LVCSR

2009

Large Vocabulary Continuous Speech Recognition (LVCSR) systems decode the input speech using diverse information sources, such as acoustic, lexical, and linguistic. Although most of the unreliable hypotheses are pruned during the recognition process, current state-of-the-art systems often make errors that are "unreasonable" for human listeners. Several studies have shown that a proper integration of acoustic-phonetic information can be beneficial to reducing such errors. We have previously shown that high-accuracy phone recognition can be achieved if a bank of speech attribute detectors is used to compute a confidence score describing attribute activation levels that the current frame exhibits. In those experiments, the phone recognition system did not rely on the language model to follow their word sequence constraints, and the vocabulary was small. In this work, we extend our approach to LVCSR by introducing a second recognition step during which additional information not directly used during conventional log-likelihood based decoding is introduced. Experimental results show promising performance.

Recent advances in LVCSR : A benchmark comparison of performances

International Journal of Electrical and Computer Engineering (IJECE), 2017

Large Vocabulary Continuous Speech Recognition (LVCSR), which is characterized by a high variability of the speech, is the most challenging task in automatic speech recognition (ASR). Believing that the evaluation of ASR systems on relevant and common speech corpora is one of the key factors that help accelerating research, we present, in this paper, a benchmark comparison of the performances of the current state-of-the-art LVCSR systems over different speech recognition tasks. Furthermore, we put objectively into evidence the best performing technologies and the best accuracy achieved so far in each task. The benchmarks have shown that the Deep Neural Networks and Convolutional Neural Networks have proven their efficiency on several LVCSR tasks by outperforming the traditional Hidden Markov Models and Guaussian Mixture Models. They have also shown that despite the satisfying performances in some LVCSR tasks, the problem of large-vocabulary speech recognition is far from being solved in some others, where more research efforts are still needed. 1. INTRODUCTION Speech is a natural and fundamental communication vehicle which can be considered as one of the most appropriate media for human-machine interactions. The aim of Automatic Speech Recognition (ASR) systems is to convert a speech signal into a sequence of words either for text-based communication purposes or for device controlling. ASR is usually used when the keyboard becomes inconvenient such, for example, when our hands are busy or with limited mobility, when we are using the phone, we are in the dark, or we are moving around etc. ASR finds application in many different areas: dictation, meeting and lectures transcription, speech translation, voice-search, phone based services and others. Those systems are, in general, extremely dependent on the data used for training the models, configuration of front-ends etc. Hence a large part of system development usually involves investigations of appropriate configurations for a new domain, new training data, and new language. There are several tasks of speech recognition and the difference between these tasks rests mainly on: (i) the speech type (isolated or continuous speech), (ii) the speaker mode (speaker dependent or independent), (iii) the vocabulary size (small, medium or large) and (iv) the speaking style (read or spontaneous speech). Even though ASR has matured to the point of commercial applications, the Speaker Independent Large Vocabulary Continuous Speech Recognition tasks (commonly designed as LVCSR) pose a particular challenge to ASR technology developers. Three of the major problems that arise when LVCSR systems are being developed are: First speaker independent systems require a large amount of training data in order to cover speakers variability. Second, continuous speech recognition is very complex because of the difficulties to locate word boundaries and the high degree of pronunciation variation due to dialects, coarticulation and noise, unlike isolated word

Combination of acoustic models in continuous speech recognition hybrid systems

2000

The combination of multiple sources of information has been an attractive approach in different areas. That is the case of speech recognition area where several combination methods have been presented. Our hybrid MLP/HMM systems use acoustic models based on different set of features and different MLP classifier structures. In this work we developed a method combining phoneme probabilities generated by the different acoustic models trained on distinct feature extraction processes. Two different algorithms were implemented for combining the acoustic models probabilities. The first covers the combination in the probability domain and the second one in the log-probability domain. We made combinations of two and three alternative baseline systems where was possible to obtain relative improvements on word error rate larger than 20% for a large vocabulary speaker independent continuous speech recognition task.

Improved Acoustic Feature Combination for LVCSR by Neural Networks

This paper investigates the combination of different acoustic features. Several methods to combine these features such as concatenation or LDA are well known. Even though LDA improves the system, feature combination by LDA has been shown to be suboptimal. We introduce a new method based on neural networks. The posterior estimates derived from the NN lead to a significant improvement and achieve a 6% relative better word error rate (WER).

Multilingual MLP features for low-resource LVCSR systems

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

We investigate approaches for large vocabulary continuous speech recognition (LVCSR) system for new languages or new domains using limited amounts of transcribed training data. In these low resource conditions, the performance of conventional LVCSR systems degrade significantly. We propose to train low resource LVCSR system with additional sources of information like annotated data from other languages (German and Spanish) and various acoustic feature streams (short-term and modulation features). We train multilayer perceptrons (MLPs) on these sources of information and use Tandem features derived from the MLPs for low resource LVCSR. In our experiments, the proposed system trained using only one hour of English conversational telephone speech (CTS) provides a relative improvement of 11% over the baseline system.

On the Use of MLP Features for Broadcast News Transcription

Lecture Notes in Computer Science, 2008

Multi-Layer Perceptron (MLP) features have recently been attracting growing interest for automatic speech recognition due to their complementarity with cepstral features. In this paper the use of MLP features is evaluated in a large vocabulary continuous speech recognition task, exploring different types of MLP features and their combination. Cepstral features and three types of Bottle-Neck MLP features were first evaluated without and with unsupervised model adaptation using models with the same number of parameters. When used with MLLR adaption on a broadcast news Arabic transcription task, Bottle-Neck MLP features perform as well as or even slightly better than a standard 39 PLP based front-end. This paper also explores different combination schemes (feature concatenations, cross adaptation, and hypothesis combination). Extending the feature vector by combining various feature sets led to a 9% relative word error rate reduction relative to the PLP baseline. Significant gains are also reported with both ROVER hypothesis combination and cross-model adaptation. Feature concatenation appears to be the most efficient combination method, providing the best gain with the lowest decoding cost.

Voicing feature integration in SRI's decipher LVCSR system

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

We augment the Mel cepstral (MFCC) feature representation with voicing features from an independent front end. The voicing feature front end parameters are optimized for recognition accuracy. The voicing features computed are the normalized autocorrelation peak and a newly proposed entropy of the high-order cepstrum. We explored several alternatives to integrate the voicing features into SRI's DECIPHER system. Promising early results were obtained in a simple system concatenating the voicing features with MFCC features and optimizing the voicing feature window duration. Best results overall came from a more complex system combining a multiframe voicing feature window with the MFCC plus third differential features using linear discriminant analysis and optimizing the number of voicing feature frames. The best integration approach from the single-pass system experiments was implemented in a multi-pass system for large vocabulary testing on the Switchboard database. An average WER reduction of 2% relative was obtained on the NIST Hub-5 dev2001 and eval2002 databases.

Transcription of broadcast news-some recent improvements to IBM's LVCSR system

1998

This paper describes extensions and improvements to IBM's large vocabulary continuous speech recognition (LVCSR) system for transcription of broadcast news. The recognizer uses an additional 35 hours of training data over the one used in the 1996 Hub4 evaluation [?]. It includes a number of new features: optimal feature space for acoustic modeling (in training and/or testing), filler-word modeling, Bayesian Information Criterion (BIC) based segment clustering, an improved implementation of iterative MLLR and 4-gram language models. Results using the 1996 DARPA Hub4 evaluation data set are presented.

Combination and generation of parallel feature streams for improved speech recognition

2005

The combination of information from parallel features that provide complementary information about the speech signal generally improves speech recognition accuracy. There are two issues associated with parallel feature combination: the specific method of combining the parallel features, and the nature of the parallel features themselves. These two issues jointly determine the performance of an information combi