Voicing feature integration in SRI's decipher LVCSR system (original) (raw)

Recent advances in LVCSR : A benchmark comparison of performances

International Journal of Electrical and Computer Engineering (IJECE), 2017

Large Vocabulary Continuous Speech Recognition (LVCSR), which is characterized by a high variability of the speech, is the most challenging task in automatic speech recognition (ASR). Believing that the evaluation of ASR systems on relevant and common speech corpora is one of the key factors that help accelerating research, we present, in this paper, a benchmark comparison of the performances of the current state-of-the-art LVCSR systems over different speech recognition tasks. Furthermore, we put objectively into evidence the best performing technologies and the best accuracy achieved so far in each task. The benchmarks have shown that the Deep Neural Networks and Convolutional Neural Networks have proven their efficiency on several LVCSR tasks by outperforming the traditional Hidden Markov Models and Guaussian Mixture Models. They have also shown that despite the satisfying performances in some LVCSR tasks, the problem of large-vocabulary speech recognition is far from being solved in some others, where more research efforts are still needed. 1. INTRODUCTION Speech is a natural and fundamental communication vehicle which can be considered as one of the most appropriate media for human-machine interactions. The aim of Automatic Speech Recognition (ASR) systems is to convert a speech signal into a sequence of words either for text-based communication purposes or for device controlling. ASR is usually used when the keyboard becomes inconvenient such, for example, when our hands are busy or with limited mobility, when we are using the phone, we are in the dark, or we are moving around etc. ASR finds application in many different areas: dictation, meeting and lectures transcription, speech translation, voice-search, phone based services and others. Those systems are, in general, extremely dependent on the data used for training the models, configuration of front-ends etc. Hence a large part of system development usually involves investigations of appropriate configurations for a new domain, new training data, and new language. There are several tasks of speech recognition and the difference between these tasks rests mainly on: (i) the speech type (isolated or continuous speech), (ii) the speaker mode (speaker dependent or independent), (iii) the vocabulary size (small, medium or large) and (iv) the speaking style (read or spontaneous speech). Even though ASR has matured to the point of commercial applications, the Speaker Independent Large Vocabulary Continuous Speech Recognition tasks (commonly designed as LVCSR) pose a particular challenge to ASR technology developers. Three of the major problems that arise when LVCSR systems are being developed are: First speaker independent systems require a large amount of training data in order to cover speakers variability. Second, continuous speech recognition is very complex because of the difficulties to locate word boundaries and the high degree of pronunciation variation due to dialects, coarticulation and noise, unlike isolated word

Improved Acoustic Feature Combination for LVCSR by Neural Networks

This paper investigates the combination of different acoustic features. Several methods to combine these features such as concatenation or LDA are well known. Even though LDA improves the system, feature combination by LDA has been shown to be suboptimal. We introduce a new method based on neural networks. The posterior estimates derived from the NN lead to a significant improvement and achieve a 6% relative better word error rate (WER).

Improving recognition accuracy on CVSD speech under mismatched conditions

2003

Emerging technology in mobile communications is seeing increasingly high acceptance as a preferred choice for last-mile communication. There have been a wide range of techniques to achieve signal compression to suit to the smaller bandwidths available on mobile communication channels; but speech recognition methods have seen success mostly only in controlled speech environments. However, designing of speech recognition systems for mobile communications is crucial in order to provide voice enabled command and control and for applications like Mobile Voice Commerce. Continuously Variable Slope Delta (CVSD) modulation, a technique for low bitrate coding of speech, has been in use particularly in military wireless environments for over 30 years, and is now also adopted by BlueTooth. CVSD is particularly suitable for Internet and mobile environments due to its robustness against transmission errors, and simplicity of implementation and the absence of a need for synchronization. In this paper, we study some characteristics of the CVSD speech in the context of robust recognition of compressed speech, and present two methods of improving the recognition accuracy in Automatic Speech Recognition (ASR) systems. We study the characteristics of the features extracted for ASR and how they relate to the corresponding features computed from Pulse Coded Modulation (PCM) speech and apply this relation to correct the CVSD features to improve recognition accuracy. Secondly we show that the ASR done on bit-streams directly, gives a good recognition accuracy and when combined with our approach gives a better accuracy.

A phonetic feature based lattice rescoring approach to LVCSR

2009

Large Vocabulary Continuous Speech Recognition (LVCSR) systems decode the input speech using diverse information sources, such as acoustic, lexical, and linguistic. Although most of the unreliable hypotheses are pruned during the recognition process, current state-of-the-art systems often make errors that are "unreasonable" for human listeners. Several studies have shown that a proper integration of acoustic-phonetic information can be beneficial to reducing such errors. We have previously shown that high-accuracy phone recognition can be achieved if a bank of speech attribute detectors is used to compute a confidence score describing attribute activation levels that the current frame exhibits. In those experiments, the phone recognition system did not rely on the language model to follow their word sequence constraints, and the vocabulary was small. In this work, we extend our approach to LVCSR by introducing a second recognition step during which additional information not directly used during conventional log-likelihood based decoding is introduced. Experimental results show promising performance.

Implementation and analysis of speech recognition front-ends

1999

We have developed a comprehensive front-end module integrating several signal modeling algorithms common to state-of-the-art speech recognition systems. The algorithms presented in this work include mel-frequency cepstra, perceptual linear prediction, filter bank amplitudes, and delta features. The framework for the front-end system was carefully designed to ensure simple integration into speech processing software. The modular design of the software along with an intuitive GUI provide a powerful tutorial by allowing a wide selection of algorithms. The software is written in a tutorial fashion, with a direct correlation between algorithmic lines of code and equations in the technical paper

Formant weighted cepstral feature for LSP-based speech recognition

2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001

In this paper, we propose a formant weighted cepstral feature for LSP-based speech recognition system. The proposed weighting scheme is based on the well-known property of LSPs that the speech spectrum has a peak when adjacent LSFs come close. By applying this scheme to pseudo-cepstrum (PCEP) conversion process [1], we can obtain formant weighted or peak enhanced cepstral feature. Results of speech recognition experiments using QCELP coder output show that the proposed feature set outperforms the conventional features such as LSP or PCEP. Moreover its performance also exceeds that of unquantized LPC cepstrum.

Training set issues in SRI's DECIPHER speech recognition system

Proceedings of the workshop on Speech and Natural Language - HLT '90, 1990

SRI has developed the DECIPHER system, a hidden Markov model (HMM) based continuous speech recognition system typically used in a speaker-independent manner. Initially we review the DECIPHER system, then we show that DECIPHER's speakerindependent performance improved by 20% when the standard 3990-sentence speaker-independent test set was augmented with training data from the 7200-sentence resource management speaker-dependent training sentences. We show a further improvement of over 20% when a version of corrective training was implemented. Finally we show improvement using parallel male-and femaletrained models in DECIPHER. The word-error rate when all three improvements were combined was 3.7% on DARPA's February 1989 speaker-independent test set using the standard perplexity 60 wordpair grammar. System Description Front End Analysis Decipher uses a FFT-based Mel-cepstra front end. Twenty-five FFT-Mel filters spanning 100 to 6400 hz are used to derive 12 Mel-cepstra coefficients every 10-rns frame. Four features are derived every frame from this cepstra sequence.

Perceptual MVDR-based cepstral coefficients (PMCCs) for high accuracy speech recognition

2003

This paper describes a robust feature extraction technique for continuous speech recognition. Central to the technique is the Minimum Variance Distortionless Response (MVDR) method of spectrum estimation. We incorporate perceptual information directly in to the spectrum estimation. This provides improved robustness and computational efficiency when compared with the previously proposed MVDR-MFCC technique [10]. On an in-car speech recognition task this method, which we refer to as PMCC, is 15% more accurate in WER and requires approximately a factor of 4 times less computation than the MVDR-MFCC technique. On the same task PMCC yields 20% relative improvement over MFCC and 11% relative improvement over PLP frontends. Similar improvements are observed on the Aurora 2 database.

Language recognition using phonotactic-based shifted delta coefficients and multiple phone recognizers

Interspeech 2014, 2014

A new language recognition technique based on the application of the philosophy of the Shifted Delta Coefficients (SDC) to phone log-likelihood ratio features (PLLR) is described. The new methodology allows the incorporation of long-span phonetic information at a frame-by-frame level while dealing with the temporal length of each phone unit. The proposed features are used to train an i-vector based system and tested on the Albayzin LRE 2012 dataset. The results show a relative improvement of 33.3% in C avg in comparison with different state-of-the-art acoustic i-vector based systems. On the other hand, the integration of parallel phone ASR systems where each one is used to generate multiple PLLR coefficients which are stacked together and then projected into a reduced dimension are also presented. Finally, the paper shows how the incorporation of state information from the phone ASR contributes to provide additional improvements and how the fusion with the other acoustic and phonotactic systems provides an important improvement of 25.8% over the system presented during the competition.

Language Recognition on Albayzin 2010 LRE using PLLR features

Resumen: Los así denominados Phone Log-Likelihood Ratios (PLLR), han sido introducidos como características alternativas a los MFCC-SDC para sistemas de Reconocimiento de la Lengua (RL) mediante iVectors. En este artículo, tras una breve descripción de estas características, se proporcionan nuevas evidencias de su utilidad para tareas de RL, con un nuevo conjunto de experimentos sobre la base de datos Albayzin 2010 LRE, que contiene habla multi-locutor de banda ancha en seis lenguas diferentes: euskera, catalán, gallego, español, portugués e inglés. Los sistemas de iVectors entrenados con PLLRs obtienen mejoras relativas significativas respecto a los sistemas fonotácticos y sistemas de iVectors entrenados con características MFCC-SDC, tanto en condiciones de habla limpia como con habla ruidosa. Las fusiones de los sistemas PLLR con los sistemas fonotácticos y/o sistemas basados en MFCC-SDC proporcionan mejoras adicionales en el rendimiento, lo que revela que las características PLLR aportan información complementaria en ambos casos.

Voicing feature integration in SRI's decipher LVCSR system (original) (raw)

Related papers