Word error rate Research Papers (original) (raw)

The estimation of initial language models for new applications of spoken dialogue systems without large taskspecific training corpora is becoming an increasingly important issue. This paper investigates two different approaches in which... more

The estimation of initial language models for new applications of spoken dialogue systems without large taskspecific training corpora is becoming an increasingly important issue. This paper investigates two different approaches in which the task-specific knowledge contained in the language understanding grammar is exploited in order to generate n-gram language models for the speech recognizer: The first uses class-based language models for which the word-classes are automatically derived from the grammar. In the second approach, language models are estimated on artificial corpora which have been created from the understanding grammar. The application of fill-up techniques allows the combination of the strengths of both approaches and leads to a language model which shows optimal performance regardless of the amount of training data available. Perplexities and word error rates are reported for two different domains.

Despite 35 years of R&D on the problem of Optical character Recognition (OCR), the technology is not yet mature enough for the Arabic font-written script compared with Latin-based ones. There is still a wide room for enhancements as per:... more

Despite 35 years of R&D on the problem of Optical character Recognition (OCR), the technology is not yet mature enough for the Arabic font-written script compared with Latin-based ones. There is still a wide room for enhancements as per: lowering the Word Error Rate (WER), standing robust in face of moderate noise, and working on an omni-font open-vocabulary basis. Among the best trials done in this regard so far comes the HMM-based ones. Elaborating on this Automatic Speech Recognition (ASR)-inspired promising approach, our team has significantly refined basic processes and modules deployed in such architectures (e.g. lines & words decomposition, features extraction, models parameters selection, language modelling, .., etc.) to develop what is hoped to be a truly reliable (i.e. low WER, omni fontwritten, open-vocabulary, noise-robust, and responsive) Arabic OCR suitable for reallife IT applications. This paper extensively reviews the HMM-based approach for building Arabic font-written OCR 's in general, and our work in specific. It also reports about the experimental results obtained so far showing that our system outperforms its rivals reported in the published literature.

Individual optical character recognition (OCR) engines vary in the types of errors they commit in recognizing text, particularly poor quality text. By aligning the output of multiple OCR engines and taking advantage of the differences... more

Individual optical character recognition (OCR) engines vary in the types of errors they commit in recognizing text, particularly poor quality text. By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. This lattice error rate constitutes a lower bound among aligned alternatives from the OCR output. Results from a collection of poor quality midtwentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.

As the use of Internet broadcasting (webcasting) increases, more webcasts will be archived and accessed numerous times retrospectively. One challenge in skimming and browsing through such archives is the lack of textual transcripts of the... more

As the use of Internet broadcasting (webcasting) increases, more webcasts will be archived and accessed numerous times retrospectively. One challenge in skimming and browsing through such archives is the lack of textual transcripts of the archived medias’ audio channel. Ideally, transcripts would be obtainable through Automatic Speech Recognition (ASR). However, current ASR systems can only deliver, in realistic conditions, Word Error Rates (WERs) of around 45% – unsatisfactory, as shown in our recent study [1], which revealed that transcripts are useful and usable in webcast archives for WERs equal to or less than 25%. We therefore propose an extension to the ePresence webcast system that engages users to collaborate in a wiki manner on editing the imperfect transcripts obtained through ASR.

Many groups have investigated the relationship of word error rate and perplexity of language models. This issue is of central interest because perplexity optimization can be done independent of a recognizer and in most cases it is... more

Many groups have investigated the relationship of word error rate and perplexity of language models. This issue is of central interest because perplexity optimization can be done independent of a recognizer and in most cases it is possible to find simple perplexity optimization procedures. Moreover, many tasks in language model training such as the optimization of word classes may use perplexity as target function resulting in explicit optimization formulas which are not available if error rates are used as target. This paper first presents some theoretical arguments for a close relationship between perplexity and word error rate. Thereafter the notion of uncertainty of a measurement is introduced and is then used to test the hypothesis that word error rate and perplexity are correlated by a power law. There is no evidence to reject this hypothesis.

We would also like to thank the members of the Program Committee for completing their reviews promptly, and for providing useful feedback for deciding on the program and preparing the final versions of the papers. Thanks also to Marie... more

We would also like to thank the members of the Program Committee for completing their reviews promptly, and for providing useful feedback for deciding on the program and preparing the final versions of the papers. Thanks also to Marie Candito, Bonnie Webber and Miles Osborne for assistance with logistics and to Brian Roark for his guidance and support. Finally, thanks to the authors of the papers, for submitting such interesting and diverse work, and to the presenters of demos and commercial exhibitions.

Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based... more

Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based on information in speech signal. Automatic Speech Recognition (ASR) system takes a human speech utterance as an input and requites a string of words as output. This paper introduce a brief survey on Automatic Speech Recognition and discuss the major subjects and improvements made in the past 60 years of research, that provides technological outlook and a respect of the fundamental achievement that has been accomplished in this important area of speech communication. Definition of various types of speech classes, feature extraction techniques, speech classifiers and performance evaluation are issues that requires attention in designing of speech recognition system. The objective of this review paper is to summarize some of the well known methods use...

In this paper, we investigate pilot-symbol-aided parameter estimation for orthogonal frequency division multiplexing (OFDM) systems. We first derive a minimum mean-square error (MMSE) pilot-symbol-aided parameter estimator. Then, we... more

In this paper, we investigate pilot-symbol-aided parameter estimation for orthogonal frequency division multiplexing (OFDM) systems. We first derive a minimum mean-square error (MMSE) pilot-symbol-aided parameter estimator. Then, we discuss a robust implementation of the pilot-symbol-aided estimator that is insensitive to channel statistics. From the simulation results, the required signal-to-noise ratios (SNR's) for a 10% word error rate (WER) are 6.8 dB and 7.3 dB for the typical urban (TU) channels with 40 Hz and 200 Hz Doppler frequencies, respectively, and they are 8 dB and 8.3 dB for the hilly-terrain (HT) channels with 40 Hz and 200 Hz Doppler frequencies, respectively. Compared with the decision-directed parameter estimator, the pilot-symbolaided estimator is highly robust to Doppler frequency for dispersive fading channels with noise impairment even though it has some performance degradation for systems with lower Doppler frequencies.

The majority of state-of-the-art speech recognition systems make use of system combination. The combination approaches adopted have traditionally been tuned to minimising Word Error Rates (WERs). In recent years there has been growing... more

The majority of state-of-the-art speech recognition systems make use of system combination. The combination approaches adopted have traditionally been tuned to minimising Word Error Rates (WERs). In recent years there has been growing interest in taking the output from speech recognition systems in one language and translating it into another. This paper investigates the use of cross-site combination approaches in terms of both WER and impact on translation performance. In addition the stages involved in modifying the output from a Speech-to-Text (STT) system to be suitable for translation are described. Two source languages, Mandarin and Arabic, are recognised and then translated using a phrase-based statistical machine translation system into English. Performance of individual systems and cross-site combination using cross-adaptation and ROVER are given. Results show that the best STT combination scheme in terms of WER is not necessarily the most appropriate when translating speech.

An MLP classifier outputs a posterior probability for each class. With noisy data, classification becomes less certain, and the entropy of the posteriors distribution tends to increase providing a measure of classification confidence.... more

An MLP classifier outputs a posterior probability for each class. With noisy data, classification becomes less certain, and the entropy of the posteriors distribution tends to increase providing a measure of classification confidence. However, at high noise levels, ...

— Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person... more

— Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based on information in speech signal. Automatic Speech Recognition (ASR) system takes a human speech utterance as an input and requites a string of words as output. This paper introduce a brief survey on Automatic Speech Recognition and discuss the major subjects and improvements made in the past 60 years of research, that provides technological outlook and a respect of the fundamental achievement that has been accomplished in this important area of speech communication. Definition of various types of speech classes, feature extraction techniques, speech classifiers and performance evaluation are issues that requires attention in designing of speech recognition system. The objective of this review paper is to summarize some of the well known methods u...

In this paper, we investigate pilot-symbol-aided parameter estimation for orthogonal frequency division multiplexing (OFDM) systems. We first derive a minimum mean-square error (MMSE) pilot-symbol-aided parameter estimator. Then, we... more

In this paper, we investigate pilot-symbol-aided parameter estimation for orthogonal frequency division multiplexing (OFDM) systems. We first derive a minimum mean-square error (MMSE) pilot-symbol-aided parameter estimator. Then, we discuss a robust implementation of the pilot-symbol-aided estimator that is insensitive to channel statistics. From the simulation results, the required signal-to-noise ratios (SNR's) for a 10% word error rate (WER) are 6.8 dB and 7.3 dB for the typical urban (TU) channels with 40 Hz and 200 Hz Doppler frequencies, respectively, and they are 8 dB and 8.3 dB for the hilly-terrain (HT) channels with 40 Hz and 200 Hz Doppler frequencies, respectively. Compared with the decision-directed parameter estimator, the pilot-symbolaided estimator is highly robust to Doppler frequency for dispersive fading channels with noise impairment even though it has some performance degradation for systems with lower Doppler frequencies.

Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common... more

Building an automatic speech recognition (ASR) system from scratch requires a large amount of annotated speech data, which is difficult to collect in many languages. However, there are cases where the low-resource language shares a common acoustic space with a high-resource language having enough annotated data to build an ASR. In such cases, we show that the domain-independent acoustic models learned from the high-resource language through unsupervised domain adaptation (UDA) schemes can enhance the performance of the ASR in the low-resource language. We use the specific example of Hindi in the source domain and Sanskrit in the target domain. We explore two architectures: i) domain adversarial training using gradient reversal layer (GRL) and ii) domain separation networks (DSN). The GRL and DSN architectures give absolute improvements of 6.71% and 7.32%, respectively, in word error rate over the baseline deep neural network model when trained on just 5.5 hours of data in the target domain. We also show that choosing a proper language (Telugu) in the source domain can bring further improvement. The results suggest that UDA schemes can be helpful in the development of ASR systems for low-resource languages, mitigating the hassle of collecting large amounts of annotated speech data.

In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their... more

In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We devel...

Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has be-come an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8],... more

Automatic segmentation of these audio streams according to speaker identities, environmental and channel conditions has be-come an important preprocessing step for speech recognition, speaker recognition, and audio data mining [7], [8], [?], and [?]. In this paper, ...

We propose grapheme-based sub-word units for spoken term detection (STD). Compared to phones, graphemes have a number of potential advantages. For out-of-vocabulary search terms, phonebased approaches must generate a pronunciation using... more

We propose grapheme-based sub-word units for spoken term detection (STD). Compared to phones, graphemes have a number of potential advantages. For out-of-vocabulary search terms, phonebased approaches must generate a pronunciation using letter-to-sound rules. Using graphemes obviates this potentially error-prone hard decision, shifting pronunciation modelling into the statistical models describing the observation space. In addition, long-span grapheme language models can be trained directly from large text corpora. We present experiments on Spanish and English data, comparing phone and grapheme-based STD. For Spanish, where phone and grapheme-based systems give similar transcription word error rates (WERs), grapheme-based STD significantly outperforms a phonebased approach. The converse is found for English, where the phonebased system outperforms a grapheme approach. However, we present additional analysis which suggests that phone-based STD performance levels may be achieved by a grapheme-based approach despite lower transcription accuracy, and that the two approaches may usefully be combined. We propose a number of directions for future development of these ideas, and suggest that if grapheme-based STD can match phone-based performance, the inherent flexibility in dealing with out-of-vocabulary terms makes this a desirable approach.

Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans... more

Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the explicit use of such long-timespan units is comparatively unusual in automatic speech recognition systems for English. The work described in this thesis explored the utility of information collected over syllable-related timescales. The rst approach involved integrating syllable segmentation information into the speech recognition process. The addition of acoustically-based syllable onset estimates 184] resulted in a 10% relative reduction in word-error rate. The second approach began with developing four speech recognition systems based on long-time-span features and units, including modulation spectrogram features 80]. Error analysis suggested the strategy of combining, which led to the implementation of methods that merged the outputs of syllable-based recognition systems with the phone-oriented baseline system at the frame level, the syllable level and the whole-utterance level. These combined systems exhibited relative improvements of 20-40% compared to the baseline system for clean and reverberant speech test cases.

This paper presents our work towards developing a new speech corpus for Modern Standard Arabic (MSA), which can be used for implementing and evaluating Arabic speaker-independent, large vocabulary, automatic, and continuous speech... more

This paper presents our work towards developing a new speech corpus for Modern Standard Arabic (MSA), which can be used for implementing and evaluating Arabic speaker-independent, large vocabulary, automatic, and continuous speech recognition systems. The speech corpus was recorded by 40 (20 male and 20 female) Arabic native speakers from 11 countries representing three major regions (Levant, Gulf, and Africa). Three development phases were conducted based on the size of training data, Gaussian mixture distributions, ...

In this paper we present techniques for building multi-domain and multi-lingual recognizers within a finite-state transducer (FST) framework. The flexibility of the FST approach is also demonstrated on the task of incorporating networks... more

In this paper we present techniques for building multi-domain and multi-lingual recognizers within a finite-state transducer (FST) framework. The flexibility of the FST approach is also demonstrated on the task of incorporating networks modeling different types of non-speech events into an existing word lattice network. The ability to create robust multi-domain and/or multi-lingual recognizers for spontaneous speech will enable a conversational system to switch seamlessly and automatically among different domains and/or languages. Preliminary results using a bi-domain recognizer exhibit only small recognition accuracy degradation in comparison to domain-dependent recognition. Similarly promising results were observed using a bilingual recognizer which performs simultaneous language identification and recognition. When using the FST techniques to add non-speech models to the recognizer, experiments show a 10% reduction in word error rate across all utterances and a 30% reduction on u...

In this paper we present a quantitative investigation into the impact of text normalization on lexica and language models for speech recognition in French. The text normalization process defines what is considered to be a word by the... more

In this paper we present a quantitative investigation into the impact of text normalization on lexica and language models for speech recognition in French. The text normalization process defines what is considered to be a word by the recognition system. Depending on this definition we can measure different lexical coverages and language model perplexities, both of which are closely related to the speech recognition accuracies obtained on read newspaper texts. Different text normalizations of up to 185M words of newspaper texts are presented along with corresponding lexical coverage and perplexity measures. Some normalizations were found to be necessary to achieve good lexical coverage, while others were more or less equivalent in this regard. The choice of normalization to create language models for use in the recognition experiments with read newspaper texts was based on these findings. Our best system configuration obtained a 11.2% word error rate in the AUPELF 'French-speaking' speech recognizer evaluation test held in February 1997.

In this paper, we present a novel approach for morphological decomposition in large vocabulary Arabic speech recognition. It achieved low out-of-vocabulary (OOV) rate as well as high recognition accuracy in a state-of-the-art Arabic... more

In this paper, we present a novel approach for morphological decomposition in large vocabulary Arabic speech recognition. It achieved low out-of-vocabulary (OOV) rate as well as high recognition accuracy in a state-of-the-art Arabic broadcast news transcription system. In this approach, the compound words are decomposed into stems and affixes in both language training and acoustic training data. The decomposed words in the recognition output are re-joined before scoring. Four algorithms are experimented and compared in this work. The best system achieved 1.9% absolute reduction (9.8% relative) in word error rate (WER) when compared to the 64K-word baseline. The recognition performance of this system is also comparable to a 300K-word recognition system trained on the normal words. In the meantime, the decomposed system is much faster in terms of speed and also needs less memory than the systems with larger than 64K vocabularies.

This paper describes first results of our DARPA-sponsored efforts toward recognizing and browsing foreign language, more specifically, Serbo-Croatian broadcast news. For Serbo-Croatian as well as many other than the most common well... more

This paper describes first results of our DARPA-sponsored efforts toward recognizing and browsing foreign language, more specifically, Serbo-Croatian broadcast news. For Serbo-Croatian as well as many other than the most common well studied languages, the problems of broadcast quality recognition are complicated by 1.) the lack of available acoustic and language data, and 2.) the excessive vocabulary growth in heavily inflected languages that lead to unacceptable OOV-rates. We present a Serbo-Croatian large vocabulary system that achieves a 74% recognition rate, despite limited training data. Our system achieves this rate by a multipass strategy that dynamically adapts the recognition dictionary to the speech segment to be recognized by generating morphological variations (Hypothesis Driven Lexical Adaptation). We will outline the bootstrapping and training process of the Janus Recognition Toolkit (JanusRTk) based broadcast news recognition engine: data collection, segmentation and labeling of the data according to different acoustic conditions, dictionary design, language modeling and training. The Hypothesis Driven Lexical Adaptation (HDLA) approach has been tested both on Serbo-Croatian and German news data and has achieved considerable recognition improvements. OOV-rates were reduced by 35-45%; on the Serbo-Croatian broadcast news data from 8.7% to 4.8% thereby also decreasing word error rate from 29.5% to 26%.

Debates in the European Parliament are simultaneously translated into the official languages of the Union. These interpretations are broadcast live via satellite on separate audio channels. After several months, the parliamentary... more

Debates in the European Parliament are simultaneously translated into the official languages of the Union. These interpretations are broadcast live via satellite on separate audio channels. After several months, the parliamentary proceedings are published as final text editions (FTE). FTEs are formatted for an easy readability and can differ significantly from the original speeches and the live broadcast interpretations. We examine the impact on German word error rate (WER) when introducing supervision based on German FTEs and supervision based on German automatic translations extracted from the English and Spanish audio. We show that FTE based supervision and additional interpretation based supervision provide significant reductions in WER. We successfully apply FTE supervised acoustic model (AM) training using 143h of recordings. Combining the new AM with the mentioned supervision techniques, we achieve a significant WER reduction of 13.3% relative.

Several real world applications of humanoids in general will require continuous service over a long time period. A humanoid robot operating in different environments over a long period of time means that A) there will be a lot of... more

Several real world applications of humanoids in general will require continuous service over a long time period. A humanoid robot operating in different environments over a long period of time means that A) there will be a lot of variation in the speech it has to ground semantically and B) it has to know when a conversation is of interest in order to respond. Detailed natural speech understanding is hard in real scenarios with arbitrary domains. To prepare the ground for indomain dialogs in real day-today life open domain scenarios we focus on an intermediate attention level based on conversation concept listening and learning. With the aid of explicit semantic analysis new concepts from open domain conversational speech are learned together with how to react to them according to human needs. This can entail how the robot performs actions such as positioning and privacy filtering. The corresponding attention model is investigated in terms of concept error rate and word error rate using speech recordings of household conversations.

This paper describes a machine translation system that offers many deaf and hearing-impaired people the chance to access pub-lished information in Arabic by translating text into their first language, Arabic Sign Lan-guage (ArSL). The... more

This paper describes a machine translation system that offers many deaf and hearing-impaired people the chance to access pub-lished information in Arabic by translating text into their first language, Arabic Sign Lan-guage (ArSL). The system was created under the close guidance of a team that included three deaf native signers and one ArSL in-terpreter. We discuss problems inherent in the design and development of such transla-tion systems and review previous ArSL ma-chine translation systems, which all too often demonstrate a lack of collaboration between engineers and the deaf community. We de-scribe and explain in detail both the adapted translation approach chosen for the proposed system and the ArSL corpus that we collected for this purpose. The corpus has 203 signed sentences (with 710 distinct signs) with con-tent restricted to the domain of instructional language as typically used in deaf education. Evaluation shows that the system produces translated sign sentences outputs ...

In this paper we investigate the integration of a confusion network into an on-line handwritten sentence recognition system. The word posterior probabilities from the confusion network are used as confidence scored to detect potential... more

In this paper we investigate the integration of a confusion network into an on-line handwritten sentence recognition system. The word posterior probabilities from the confusion network are used as confidence scored to detect potential errors in the output sentence from the Maximum A Posteriori decoding on a word graph. Dedicated classifiers (here, SVMs) are then trained to correct these errors and combine the word posterior probabilities with other sources of knowledge. A rejection phase is also introduced in the detection process. Experiments on handwritten sentences show a 28.5 % relative reduction of the word error rate.

The paper investigates the integration of Heteroscedastic Linear Discriminant Analysis (HLDA) into adaptively trained speech recognizers. Two different approaches are compared: the first is a variant of CMLLR-SAT, the second is based on... more

The paper investigates the integration of Heteroscedastic Linear Discriminant Analysis (HLDA) into adaptively trained speech recognizers. Two different approaches are compared: the first is a variant of CMLLR-SAT, the second is based on our previously introduced method Constrained Maximum-Likelihood Speaker Normalization (CMLSN). For the latter both HLDA projection and speaker-specific transformations for normalization are estimated w. r. t. a set of simple target-models. It is investigated if additional robustness can be achieved by estimating HLDA on normalized data. Experimental results are provided for a broadcast news task and a collection of parliamentary speeches. We show that the proposed methods lead to relative reductions in word error rate (WER) of 8% over an adapted baseline system that already includes an HLDA transform. The best performance for both tasks is achieved for the algorithm that is based on CMLSN. When compared to the combination of HLDA and CMLLR-SAT, this method leads to a considerable reduction in computational effort and to a significantly lower WER.

This paper presents a set of experiments used to develop a statistical system from translating speech to sign language for deaf people. This system is composed of an Automatic Speech Recognition (ASR) system, followed by a statistical... more

This paper presents a set of experiments used to develop a statistical system from translating speech to sign language for deaf people. This system is composed of an Automatic Speech Recognition (ASR) system, followed by a statistical translation module and an animated agent that represents the different signs. Two different approaches have been used to perform the translations: a phrase-based system and a finite state transducer. For the evaluation, the followings figures have been considered: WER (Word Error Rate), BLEU and NIST. The paper presents translation results of reference sentences and sentences from the Automatic Speech Recognizer. Also three different configurations have been evaluated for the Speech Recognizer. The best results were obtained with the finite state transducer, with a word error rate of 28.21% for the reference text, and 29.27% using the ASR output.

Sequence recognition performance is often summarised first in terms of the number of hits (H), substitutions (S), deletions (D) and insertions (I), and then as a single statistic by the "word error rate" WER = 100(S+D+I)/(H+S+D). While in... more

Sequence recognition performance is often summarised first in terms of the number of hits (H), substitutions (S), deletions (D) and insertions (I), and then as a single statistic by the "word error rate" WER = 100(S+D+I)/(H+S+D). While in common use, WER has two disadvantages as a performance measure. One is that it has no upper bound, so it doesn't tell you how good a system is, only that one is better than another. The other is that it is not D/I symmetric, although deletions and insertions are equally disadvantageous. At low error rates these limitations can be ignored. However, for the high error rates which can occur during tests for speech recognition in noise the WER measure starts to misbehave, giving far more weight to insertions than to deletions and regularly "exceeding 100%". Here we derive an alternative summary statistic for sequence recognition accuracy: WIP = H 2 /(H+S+D)(H+S+I). The WIP (word information preserved) measure results from an approximation to the proportion of the information about the true sequence which is preserved in the recognised sequence. It has comparable simplicity to WER but neither of its disadvantages.

Speech comprises a variety of acoustical phenomena occurring at differing rates. Fixed-rate ASR systems assume in effect a constant temporal rate of information flow via incorporating uniform statistics in proportion to a sound's... more

Speech comprises a variety of acoustical phenomena occurring at differing rates. Fixed-rate ASR systems assume in effect a constant temporal rate of information flow via incorporating uniform statistics in proportion to a sound's duration. The usual tradeoff window length of 25-30 milliseconds represents a time-frequency resolution compromise, which aims to allow reasonable speed for following changes in the spectral trajectories

Face-to-face meetings usually encompass several modalities including speech, gesture, handwriting, and person identification. Recognition and integration of each of these modalities is important to create an accurate record of a meeting.... more

Face-to-face meetings usually encompass several modalities including speech, gesture, handwriting, and person identification. Recognition and integration of each of these modalities is important to create an accurate record of a meeting. However, each of these modalities presents recognition difficulties. Speech recognition must be speaker and domain independent, have low word error rates, and be close to real time to be useful. Gesture and handwriting recognition must be writer independent and support a wide variety of writing styles. Person identification has difficulty with segmentation in a crowded room. Furthermore, in order to produce the record automatically, we have to solve the assignment problem (who is saying what), which involves people identification and speech recognition. We follow a multimodal approach for people identification to increase the robustness (with the modules: color appearance id, face id and speaker id). This paper will examine a meeting room system und...

In this paper we report on new developments in the automatic meeting transcription task. Unlike other types of speech (such as those found in Broadcast News and Switchboard), meetings are unique in their richer dynamics of human-to-human... more

In this paper we report on new developments in the automatic meeting transcription task. Unlike other types of speech (such as those found in Broadcast News and Switchboard), meetings are unique in their richer dynamics of human-to-human interaction. An intuitive "thumbnail" plot is proposed to visualize such turntaking behavior. We will also show how recognition of short turns can be improved by building a language model tailored specifically for short turns. Out-Of-Vocabulary (OOV) words become a more salient problem in the meeting transcription task, as they are mostly topic words and proper names, lack of which not only causes Word Error Rate (WER) increase, but also limits further use of recognition hypotheses. We describe a prototype system which uses the Web as a source for vocabulary expansion, and present preliminary OOV retrieval results.

In this study, we propose an algorithm for Arabic isolated digit recognition. The algorithm is based on extracting acoustical features from the speech signal and using them as input to multi-layer perceptrons neural networks. Each word in... more

In this study, we propose an algorithm for Arabic isolated digit recognition. The algorithm is based on extracting acoustical features from the speech signal and using them as input to multi-layer perceptrons neural networks. Each word in the vocabulary digits (0 to 9) is associated with a network. The networks are implemented as predictors for the speech samples for a certain duration of time. The back-propagation algorithm is used to train the networks. The hidden markov model (HMM) is implemented to extract temporal features (states) for the speech signal. The input vector to the networks consists of twelve mel frequency cepstral coefficients, log of the energy, and five elements representing the state. Our results show that we are able to reduce the word error rate comparing with an HMM word recognition system.

A method to automatically annotate video items with se-mantic metadata is presented. The method has been devel-oped in the context of the Papyrus project to annotate docu-mentary-like broadcast videos with a set of relevant keywords using... more

A method to automatically annotate video items with se-mantic metadata is presented. The method has been devel-oped in the context of the Papyrus project to annotate docu-mentary-like broadcast videos with a set of relevant keywords using automatic speech recognition (ASR) ...

In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a... more

In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (realtime as measured on Pentium 4 Xeon 3.4 GHz Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites.

Cet article décrit une méthode qui combine des hypothèses graphémiques et phonétiques au niveau de la phrase, à l’aide d’une réprésentation en automates à états finis et d’un modèle de langage, pour la réécriture de phrases tapées au... more

Cet article décrit une méthode qui combine des hypothèses graphémiques et phonétiques au niveau de la phrase, à l’aide d’une réprésentation en automates à états finis et d’un modèle de langage, pour la réécriture de phrases tapées au clavier par des dysorthographiques. La particularité des écrits dysorthographiés qui empêche les correcteurs orthographiques d’être efficaces pour cette tâche est une segmentation en mots parfois incorrecte. La réécriture diffère de la correction en ce sens que les phrases réécrites ne sont pas à destination de l’utilisateur mais d’un système automatique, tel qu’un moteur de recherche. De ce fait l’évaluation est conduite sur des versions filtrées et lemmatisées des phrases. Le taux d’erreurs mots moyen passe de 51 % à 20 % avec notre méthode, et est de 0 % sur 43 % des phrases testées.

In ths paper, we present our research on dialog dependent language modeling. In accordance with a speech (or sentence) production model in a discourse we split language modeling into two components; namely, dialog dependent concept... more

In ths paper, we present our research on dialog dependent language modeling. In accordance with a speech (or sentence) production model in a discourse we split language modeling into two components; namely, dialog dependent concept modeling and syntactic modeling. ...

In this paper, a task of human-machine interaction based on speech is presented. The specific task consists on the use and control of a set of home appliances through a turnbased dialogue system. This work focuses on the first part of the... more

In this paper, a task of human-machine interaction based on speech is presented. The specific task consists on the use and control of a set of home appliances through a turnbased dialogue system. This work focuses on the first part of the dialogue system, the Automatic Speech Recognition (ASR) system. Two lines of work are taken into account to improve the performance of the ASR system. On one hand, the acoustic modeling required for the ASR is improved via Speaker Adaptation techniques. On the other hand, the Language Modeling in the system is improved by the use of class-based Language Models. The results show the good performance of both techniques to improve the ASR results, as the Word Error Rate (WER) drops from 5.81% using a close-talk microphone to a 0.99% and from 14.53% using a lapel microphone to a 1.52%. Also, an important reduction is achieved in terms of the Category Error Rate (CER), which measures the ability of the ASR system to extract the semantic information of the uttered sentence, dropping from 6.13% and 15.32% to 1.29% and 1.32% for the two microphones used in the experiments.

In this paper, a novel speaker normalization method is presented and compared to a well known vocal tract length normaliza- tion method. With this method, acoustic observations of train- ing and testing speakers are mapped into a... more

In this paper, a novel speaker normalization method is presented and compared to a well known vocal tract length normaliza- tion method. With this method, acoustic observations of train- ing and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space. Recognition experiments made use of two corpora, the first one consisting of adults' speech, the second one consisting of children's speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with...

In this paper we present a number of improvements that were recently made to the template based speech recognition system developed at ESAT. Combining these improvements resulted in a decrease in word error rate from 9.6% to 8.2% on the... more

In this paper we present a number of improvements that were recently made to the template based speech recognition system developed at ESAT. Combining these improvements resulted in a decrease in word error rate from 9.6% to 8.2% on the Nov92, 20k trigram, Wall Street Journal task. The improvements are along different lines. Apart from the time warping already applied within the DTW, it was found beneficial to apply additional length compensation on the template score. The single best score was replaced by a weighted k-NN average, while maintaining natural successor information as an ensemble cost. The local geometry of the acoustic space is now taken into account by assigning a diagonal covariance matrix to each input frame. Context sensitivity of short templates is increased by taking cross boundary scores into account for sorting the N best templates. Furthermore boundaries on the template segmentations may be relaxed. Finally context dependent word templates are now being used for short words. Several other variants that were not retained in the final system are discussed as well.

The paper deals with the development of acoustic models of foreign words for a German speech recognizer. The recognition quality of foreign words is crucial for the overall performance of a system in application fields like spoken... more

The paper deals with the development of acoustic models of foreign words for a German speech recognizer. The recognition quality of foreign words is crucial for the overall performance of a system in application fields like spoken dialogue systems, when foreign words occur as proper names. One of the main problems in the modeling of foreign words is the limitation of training data, which must contain samples of the non-native pronunciation of the foreign sounds. In order to obtain robust acoustic models, which are still precise enough, we compare several methods to map or to merge the models of phonemes, which are pronounced in a similar way by German speakers. We utilize an entropy-based distance measure between sets of phoneme models. The best approach yields a reduction of 16.5% word error rate, when compared to a baseline system.

In this paper, pronunciation variability between native and non-native speakers is investigated, and a novel acoustic model adaptation method is proposed based on pronunciation variability analysis in order to improve the performance of a... more

In this paper, pronunciation variability between native and non-native speakers is investigated, and a novel acoustic model adaptation method is proposed based on pronunciation variability analysis in order to improve the performance of a speech recognition system by ...