Large Vocabulary Continuous Speech Recognition Research Papers (original) (raw)
Amharic is the official language of Ethiopia. It belongs to the Semitic language family and is characterized by a quite homogeneous phonology distinguishing between 234 distinct Consonant-Vowel (CV) syllables. Since there is no Amharic... more
Amharic is the official language of Ethiopia. It belongs to the Semitic language family and is characterized by a quite homogeneous phonology distinguishing between 234 distinct Consonant-Vowel (CV) syllables. Since there is no Amharic speech corpus of any kind, we developed a read-speech corpus using a phonetically rich and balanced text database. To prepare the text database, we used the archive of EthioZena website which consists of selected articles from well known newspapers and magazines published in Amharic. The archive was cleaned semi-automatically. Like other standard speech corpora, such as WSJCAM0, the Amharic speech corpus contains training set, speaker adaptation set, test sets (development and evaluation test sets each with 5000 and 20000 vocabulary size). The speech has been recorded in Ethiopia in an office environment and segmented semi-automatically. The corpus is now used for experiments with a syllable-and phone-based LVCSR for Amharic.
With the distribution of speech technology products all over the world, the portability to new target languages becomes a practical concern. As a consequence our research focuses on the question of how to port LVCSR systems in a fast and... more
With the distribution of speech technology products all over the world, the portability to new target languages becomes a practical concern. As a consequence our research focuses on the question of how to port LVCSR systems in a fast and efficient way. More specifically we want to estimate acoustic models for a new target language using speech data from varied source languages, but only limited data from the target language. For this purpose we introduce different methods for multilingual acoustic model combination and a polyphone decision tree specialization procedure. Recognition results using language dependent, independent and language adaptive acoustic models are presented and discussed in the framework of our GlobalPhone project which investigates LVCSR systems in 15 languages. Mit der weltweiten Verbreitung von Sprachtechnologieprodukten wird die schnelle und effiziente Portierung vorhandener Spracherkennungssysteme auf neue Sprachen zu einer Angelegenheit von direkt anwendbarem Nutzen. Aus diesem Grund konzentriert sich unsere Forschung auf die Frage, wie sich ein Spracherkennungssystem, genaugenommen die akustischen Modelle, unter Ausnutzung vorhandener Daten anderer Sprachen in einer neuen Sprache effizient entwickeln lassen. Zu diesem Zweck führen wir unterschiedliche Methoden zur Kombination multilingualer akustischer Modelle ein und definieren die Polyphone Decision Tree Specialization Methode. Es werden zahlreiche Erkennungsexperimente anhand sprachenabhängiger, sprachenunabhängiger und sprachenadaptiver akustischer Modellen vorgestellt und im Rahmen des GlobalPhone Projektes evaluiert. GlobalPhone ist ein Projekt, in dem LVCSR Spracherkennung in 15 verschiedenen Sprachen untersucht wird.
Aiming at increasing system simplicity and flexibility, an audio evoked based system was developed by integrating simplified headphone and user-friendly software design. This paper describes a Hindi Speech Actuated Computer Interface for... more
Aiming at increasing system simplicity and flexibility, an audio evoked based system was developed by integrating simplified headphone and user-friendly software design. This paper describes a Hindi Speech Actuated Computer Interface for Web search (HSACIWS), which accepts spoken queries in Hindi language and provides the search result on the screen. This system recognizes spoken queries by large vocabulary continuous speech recognition (LVCSR), retrieves relevant document by text retrieval, and provides the search result on the Web by the integration of the Web and the voice systems. The LVCSR in this system showed enough performance levels for speech with acoustic and language models derived from a query corpus with target contents.
Today, most systems use large vocabulary continuous speech recognition tools to produce word transcripts which have indexed transcripts and query terms retrieved from the index. However, query terms that are not part of the recognizer's... more
Today, most systems use large vocabulary continuous speech recognition tools to produce word transcripts which have indexed transcripts and query terms retrieved from the index. However, query terms that are not part of the recognizer's vocabulary cannot be retrieved, thereby affecting the recall of the search.
The universal background model (UBM) is an effective framework widely used in speaker recognition. But so far it has received little attention from the speech recognition field. In this work, we make a first attempt to apply the UBM to... more
The universal background model (UBM) is an effective framework widely used in speaker recognition. But so far it has received little attention from the speech recognition field. In this work, we make a first attempt to apply the UBM to acoustic modeling in ASR. We propose a tree-based parameter estimation technique for UBMs, and describe a set of smoothing and pruning methods to facilitate learning. The proposed UBM approach is benchmarked on a state-of-the-art large-vocabulary continuous speech recognition platform on a broadcast transcription task. Preliminary experiments reported in this paper already show very exciting results. Index Terms – UBM, universal background model, speech recognition, acoustic modeling. 1.
The universal background model (UBM) is an effective framework widely used in speaker recognition. But so far it has received little attention from the speech recognition field. In this work, we make a first attempt to apply the UBM to... more
The universal background model (UBM) is an effective framework widely used in speaker recognition. But so far it has received little attention from the speech recognition field. In this work, we make a first attempt to apply the UBM to acoustic modeling in ASR. We propose a tree-based parameter estimation technique for UBMs, and describe a set of smoothing and pruning methods to facilitate learning. The proposed UBM approach is benchmarked on a state-of-the-art large-vocabulary continuous speech recognition platform on a broadcast transcription task. Preliminary experiments reported in this paper already show very exciting results.
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over... more
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) rates, whereas the morph LMs can recognize previously unseen word forms by concatenating morphs. We show that the morph LMs generally outperform the word LMs and that they perform fairly well on OOVs without compromising the accuracy obtained for in-vocabulary words.
Today, speech interfaces have become widely employed in mobile devices, thus recognition speed and resource consumption are becoming new metrics of Automatic Speech Recognition (ASR) performance. For ASR systems using continuous Hidden... more
Today, speech interfaces have become widely employed in mobile devices, thus recognition speed and resource consumption are becoming new metrics of Automatic Speech Recognition (ASR) performance. For ASR systems using continuous Hidden Markov Models (HMMs), the computation of the state likelihood is one of the most time consuming parts. In this paper, we propose novel multi-level Gaussian selection techniques to reduce the cost of state likelihood computation. These methods are based on original and efficient codebooks. The proposed algorithms are evaluated within the framework of a large vocabulary continuous speech recognition task.
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four "morphologically rich" languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over... more
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four "morphologically rich" languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) rates, whereas the morph LMs can recognize previously unseen word forms by concatenating morphs. We show that the morph LMs generally outperform the word LMs and that they perform fairly well on OOVs without compromising the accuracy obtained for in-vocabulary words.
This paper describes our recent e ort in developing the Global-Phone database for multilingual large vocabulary continuous speech recognition. In particular we present the current status of the GlobalPhone corpus containing high quality... more
This paper describes our recent e ort in developing the Global-Phone database for multilingual large vocabulary continuous speech recognition. In particular we present the current status of the GlobalPhone corpus containing high quality speech data for the 9 languages Arabic, Chinese, Croatic, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. We also discuss the JANUS-3 toolkit and how it can be applied on our way t o w ards multilinguality using the GlobalPhone database.
This paper describes the SoVideo broadcast news retrieval system for Mandarin Chinese. The system is based on technologies such as large-vocabulary continuous speech recognition for Mandarin Chinese, automatic story segmentation, and... more
This paper describes the SoVideo broadcast news retrieval system for Mandarin Chinese. The system is based on technologies such as large-vocabulary continuous speech recognition for Mandarin Chinese, automatic story segmentation, and information retrieval. Currently, the database consists of 177 hours of broadcast news, which yields 3264 stories by automatic story segmentation. We discuss the development of the retrieval system, and the evaluation of each component and the retrieval system.
This paper describes and discusses the 'STBU' speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of 4 partners: Spescom DataVoice (South Africa), TNO (The... more
This paper describes and discusses the 'STBU' speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of 4 partners: Spescom DataVoice (South Africa), TNO (The Netherlands), BUT (Czech Republic) and University of Stellenbosch (South Africa). The STBU system was a combination of three main kinds of subsystems: (1) GMM, with shorttime MFCC or PLP features, (2) GMM-SVM, using GMM mean supervectors as input to an SVM, and (3) MLLR-SVM, using MLLR speaker adaptation coefficients derived from an English LVCSR system. All subsystems made use of supervector subspace channel compensation methods-either eigenchannel adaptation or nuisance attribute projection. We document the design and performance of all subsystems , as well as their fusion and calibration via logistic regression. Finally, we also present a cross-site fusion that was done with several additional systems from other NIST SRE-2006 participants.
This paper presents a new discriminative approach for training Gaussian mixture models (GMMs) of hidden Markov models (HMMs) based acoustic model in a large vocabulary continuous speech recognition (LVCSR) system. This approach is... more
This paper presents a new discriminative approach for training Gaussian mixture models (GMMs) of hidden Markov models (HMMs) based acoustic model in a large vocabulary continuous speech recognition (LVCSR) system. This approach is featured by embedding a rival penalized competitive learning (RPCL) mechanism on the level of hidden Markov states. For every input, the correct identity state, called winner and
The recognition of speech in meetings poses a number of challenges to current Automatic Speech Recognition (ASR) techniques. Meetings typically take place in rooms with non-ideal acoustic conditions and significant background noise, and... more
The recognition of speech in meetings poses a number of challenges to current Automatic Speech Recognition (ASR) techniques. Meetings typically take place in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. In such circumstances, headset microphones have to date provided the best recognition performance, however participants are often reluctant to wear them. Microphone arrays provide an alternative to close-talking microphones by providing speech enhancement through directional discrimination. Unfortunately, however, development of array front-end systems for state-of-the-art large vocabulary continuous speech recognition suffers from a lack of necessary resources, as most available speech corpora consist only of single-channel recordings. This paper describes the collection of an audio-visual corpus of read speech from a number of instrumented meeting rooms. The corpus, based on the WSJCAM0 database, is...
This paper presents work done at Cambridge University for the TREC-9 Spoken Document Retrieval (SDR) track. The CU- HTK transcriptions from TREC-8 with Word Error Rate (WER) of 20.5% were used in conjunction with stopping, Porter stem-... more
This paper presents work done at Cambridge University for the TREC-9 Spoken Document Retrieval (SDR) track. The CU- HTK transcriptions from TREC-8 with Word Error Rate (WER) of 20.5% were used in conjunction with stopping, Porter stem- ming, Okapi-style weighting and query expansion using a con- temporaneous corpus of newswire. A windowing/recombination strategy was applied for the case where story boundaries were unknown (SU) obtaining a final result of 38.8% and 43.0% Av- erage Precision for the TREC-9 short and terse queries respec- tively. The corresponding results for the story boundaries known runs (SK) were 46.4% and 49.2%. Document expansion was used in the SK runs and shown to also be beneficial for SU under certain circumstances. Non-lexical information was generated, which although not used within the evaluation, should prove useful to enrich the transcriptions in real-world applications. Fi- nally, cross recogniser experiments again showed there is little performance deg...
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four "morphologically rich" languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over... more
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four "morphologically rich" languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) rates, whereas the morph LMs can recognize previously unseen word forms by concatenating morphs. We show that the morph LMs generally outperform the word LMs and that they perform fairly well on OOVs without compromising the accuracy obtained for in-vocabulary words.
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four "morphologically rich" languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over... more
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four "morphologically rich" languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) rates, whereas the morph LMs can recognize previously unseen word forms by concatenating morphs. We show that the morph LMs generally outperform the word LMs and that they perform fairly well on OOVs without compromising the accuracy obtained for in-vocabulary words.
This paper describes the research underway for the ESPRIT WERNICKE project. The project brings together a number of different groups from Europe and the US and focuses on extending the state-of-the-art for hybrid hidden Markov... more
This paper describes the research underway for the ESPRIT WERNICKE project. The project brings together a number of different groups from Europe and the US and focuses on extending the state-of-the-art for hybrid hidden Markov model/connectionist approaches to large vocabulary, continuous speech recognition. This paper describes the specific goals of the research and presents the work performed to date. Results are reported for the resource management talker-independent recognition task. The paper concludes ...
The authors present a large vocabulary, continuous speech recognition system based on linked predictive neural networks (LPNNs). The system uses neural networks as predictors of speech frames, yielding distortion measures which can be... more
The authors present a large vocabulary, continuous speech recognition system based on linked predictive neural networks (LPNNs). The system uses neural networks as predictors of speech frames, yielding distortion measures which can be used by the one-stage DTW algorithm to perform continuous speech recognition. The system currently achieves 95%, 58%, and 39% word accuracy on tasks with perplexity 7, 111,
Automatic language identication is an important problem in building multilingual speech recognition and understanding systems. Building a language identication module for four languages we studied the inuence of applying dierent levels of... more
Automatic language identication is an important problem in building multilingual speech recognition and understanding systems. Building a language identication module for four languages we studied the inuence of applying dierent levels of knowledge sources on a large vocabulary continuous speech recognition (LVCSR) approach, i.e. the phonetic, phonotactic, lexical, and syntactic-semantic knowledge. The resulting language identication (LID) module can identify spontaneous speech input and can be used as a frontend for our multilingual speech-to-speech translation system JANUS-II. A comparison of ve LID systems showed that the incorporation of lexical and linguistic knowledge reduces the language identication error for the 2-language tests up to 50%. Based on these results we build a LID module for German, English, Spanish, and Japanese which yields 84% identication rate on the Spontaneous Scheduling Task (SST).
The availability of real-time continuous speech recognition on mobile and embedded devices has opened up a wide range of research opportunities in human-computer interactive applications. Unfortunately, most of the work in this area to... more
The availability of real-time continuous speech recognition on mobile and embedded devices has opened up a wide range of research opportunities in human-computer interactive applications. Unfortunately, most of the work in this area to date has ...
In this paper we discuss two techniques to reduce the size of the acoustic model while maintaining or improving the accuracy of the recognition engine. The first technique, demiphone modeling, tries to reduce the redundancy existing in a... more
In this paper we discuss two techniques to reduce the size of the acoustic model while maintaining or improving the accuracy of the recognition engine. The first technique, demiphone modeling, tries to reduce the redundancy existing in a context dependent state-clustered Hidden Markov Model (HMM). Three-state demiphones optimally designed from the triphone decision tree are introduced to drastically reduce the phone space of the acoustic model and to improve system accuracy. The second redundancy elimination technique is a more classical approach based on parameter tying. Similar vectors of variances in each HMM cluster are tied together to reduce the number of parameters. The closeness between the vectors of variances is measured using a Vector Quantizer (VQ) to maintain the information provided by the variances parameters. The paper also reports speech recognition improvements using assign
A major component in the development of any speech recognition system is the decoder. As task complexities and, consequently, system complexities have continued to increase the decoding problem has become an increasingly significant... more
A major component in the development of any speech recognition system is the decoder. As task complexities and, consequently, system complexities have continued to increase the decoding problem has become an increasingly significant component in the overall speech recognition system development effort, with efficient decoder design contributing to significantly improve the trade-off between decoding time and search errors. In this paper we present the "Juicer"(from transducer) large vocabulary continuous speech recognition (LVCSR) decoder based on weighted finite-State transducer (WFST). We begin with a discussion of the need for open source, state-of-the-art decoding software in LVCSR research and how this lead to the development of Juicer, followed by a brief overview of decoding techniques and major issues in decoder design. We present Juicer and its major features, emphasising its potential not only as a critical component in the development of LVCSR systems, but also as an important research tool in itself, being based around the flexible WFST paradigm. We also provide results of benchmarking tests that have been carried out to date, demonstrating that in many respects Juicer, while still in its early development, is already achieving stateof-the-art. These benchmarking tests serve to not only demonstrate the utility of Juicer in its present state, but are also being used to guide future development, hence, we conclude with a brief discussion of some of the extensions that are currently under way or being considered for Juicer.
A study was conducted to evaluate user performance and satisfaction in completion of a set of text creation tasks using three commercially available continuous speech recognition systems. The study also compared user performance on... more
A study was conducted to evaluate user performance and satisfaction in completion of a set of text creation tasks using three commercially available continuous speech recognition systems. The study also compared user performance on similar tasks using keyboard input. One part of the study (Initial Use) involved 24 users who enrolled, received training and carried out practice tasks, and then completed a set of transcription and composition tasks in a single session. In a parallel effort (Extended Use), four researchers used speech recognition to carry out real work tasks over 10 sessions with each of the three speech recognition software products. This paper presents results from the Initial Use phase of the study along with some preliminary results from the Extended Use phase. We present details of the kinds of usability and system design problems likely in current systems and several common patterns of error correction that we found. Keywords Speech recognition, input techniques, speech user interfaces, analysis methods [lcrnjission to ,,lakc digital or Ilard zapics 01'dl or part 0I'this uork fbl personal Or classroonl 11s~ is grantd without i'w prwidcd Ihill CWiCS arr: nol ,lla& ,jr dislril>ute(i li,r profit or commercial Xl\.Xlta@C and &it c(,pics hcur illi< nolicc a11ti the I'~111 ciltlliml ou tk lirsl pa@C. I'() C(W ottlcrwisc, I,, qublish, 10 post cm scrvm co' to rcdisLrii)UtC 10 ffSt% rquires prior qxcifis pcniiission :llld'Or Cl fCc.
A study was conducted to evaluate user performance and satisfaction in completion of a set of text creation tasks using three commercially available continuous speech recognition systems. The study also compared user performance on... more
A study was conducted to evaluate user performance and satisfaction in completion of a set of text creation tasks using three commercially available continuous speech recognition systems. The study also compared user performance on similar tasks using keyboard ...
Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps... more
Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The performance of the LVCSR system depends critically on the definition of the subword units and the accuracy of the dictionary. In current LVCSR systems, both these components are manually designed. While manually designed ...
The development of speech processing technologies requires the use of audio and text corpus. Despite these resources have been researched during years for several languages, there is not enough research made for Brazilian Portuguese... more
The development of speech processing technologies requires the use of audio and text corpus. Despite these resources have been researched during years for several languages, there is not enough research made for Brazilian Portuguese language. This article describes the progress of the initiative of corpus creation and validation for Brazilian Portuguese, using Hidden Markov Models (HMM) based acustic models and statistical language models for large vocabulary continuous speech recognition.
In this article we present a service for deaf people closed to interaction between people and machine area. This service will allow deaf people to communicate properly with other people. The deaf person will make a video-calling using a... more
In this article we present a service for deaf people closed to interaction between people and machine area. This service will allow deaf people to communicate properly with other people. The deaf person will make a video-calling using a 3G mobile device and what the other person means using his voice, is recognized and translated to sign language by a
Creation of pronunciation lexicons for speech recognition is widely acknowledged to be an important, but labor-intensive, aspect of system development. Lexicons are often manually created and make use of knowledge and expertise that is... more
Creation of pronunciation lexicons for speech recognition is widely acknowledged to be an important, but labor-intensive, aspect of system development. Lexicons are often manually created and make use of knowledge and expertise that is difficult to codify. In this paper we describe our American English lexicon developed primarily for the ARPA WSJ/NAB tasks. The lexicon is phonemically represented, and contains alternate pronunciations for about 10% of the words. Tools have been developed to add new lexical items, as well as to help ensure consistency of the pronunciations. Our experience in large vocabulary, continuous speech recognition is that systematic lexical design can improve system performance. Some comparative results with commonly available lexicons are given.