Recent advances in LVCSR : A benchmark comparison of performances (original) (raw)
Large Vocabulary Continuous Speech Recognition (LVCSR), which is characterized by a high variability of the speech, is the most challenging task in automatic speech recognition (ASR). Believing that the evaluation of ASR systems on relevant and common speech corpora is one of the key factors that help accelerating research, we present, in this paper, a benchmark comparison of the performances of the current state-of-the-art LVCSR systems over different speech recognition tasks. Furthermore, we put objectively into evidence the best performing technologies and the best accuracy achieved so far in each task. The benchmarks have shown that the Deep Neural Networks and Convolutional Neural Networks have proven their efficiency on several LVCSR tasks by outperforming the traditional Hidden Markov Models and Guaussian Mixture Models. They have also shown that despite the satisfying performances in some LVCSR tasks, the problem of large-vocabulary speech recognition is far from being solved in some others, where more research efforts are still needed. 1. INTRODUCTION Speech is a natural and fundamental communication vehicle which can be considered as one of the most appropriate media for human-machine interactions. The aim of Automatic Speech Recognition (ASR) systems is to convert a speech signal into a sequence of words either for text-based communication purposes or for device controlling. ASR is usually used when the keyboard becomes inconvenient such, for example, when our hands are busy or with limited mobility, when we are using the phone, we are in the dark, or we are moving around etc. ASR finds application in many different areas: dictation, meeting and lectures transcription, speech translation, voice-search, phone based services and others. Those systems are, in general, extremely dependent on the data used for training the models, configuration of front-ends etc. Hence a large part of system development usually involves investigations of appropriate configurations for a new domain, new training data, and new language. There are several tasks of speech recognition and the difference between these tasks rests mainly on: (i) the speech type (isolated or continuous speech), (ii) the speaker mode (speaker dependent or independent), (iii) the vocabulary size (small, medium or large) and (iv) the speaking style (read or spontaneous speech). Even though ASR has matured to the point of commercial applications, the Speaker Independent Large Vocabulary Continuous Speech Recognition tasks (commonly designed as LVCSR) pose a particular challenge to ASR technology developers. Three of the major problems that arise when LVCSR systems are being developed are: First speaker independent systems require a large amount of training data in order to cover speakers variability. Second, continuous speech recognition is very complex because of the difficulties to locate word boundaries and the high degree of pronunciation variation due to dialects, coarticulation and noise, unlike isolated word