Use of simulated data for robust telephone speech recognition (original) (raw)
Noise-robust speech recognition of conversational telephone speech
Interspeech 2006, 2006
Over the past several years, the primary focus of investigation for speech recognition has been over the telephone or IP network. Recently more and more IP telephony has been extensively used. This paper describes the performance of a speech recognizer on noisy speech transmitted over an H.323 IP telephony network, where the minimum mean-square error log spectra amplitude (MMSE-LSA) method [1,2] is used to reduce the mismatch between training and deployment condition in order to achieve robust speech recognition. In the H.323 network environment, the sources of distortion to the speech are packet loss and additive noise. In this work, we evaluate the impact of packet losses on speech recognition performance first, and then explore the effects of uncorrelated additive noise on the performance. To explore how additive acoustic noise affects the speech recognition performance, seven types of noise sources are selected for use in our experiments. Finally, the experimental results indicate that the MMSE-LSA enhancement method apparently increased robustness for some types of additive noise under certain packet loss rates over the H.323 telephone network.
Conversational telephone speech recognition
2003
This paper describes the development of a speech recognition system for the processing of telephone conversations, starting with a state-of-the-art broadcast news transcription system. We identify major changes and improvements in acoustic and language modeling, as well as decoding, which are required to achieve state-of-theart performance on conversational speech. Some major changes on the acoustic side include the use of speaker normalization (VTLN), the need to cope with channel variability, and the need for efficient speaker adaptation and better pronunciation modeling. On the linguistic side the primary challenge is to cope with the limited amount of language model training data. To address this issue we make use of a data selection technique, and a smoothing technique based on a neural network language model. At the decoding level lattice rescoring and minimum word error decoding are applied. On the development data, the improvements yield an overall word error rate of 24.9% whereas the original BN transcription system had a word error rate of about 50% on the same data.
ROBUST RECOGNITION OF SMALL VOCABULARY TELEPHONE - QUALITY SPEECH
2003
Considerable progress has been made in the field of automatic speech recognition in recent years, especially for high-quality (full bandwidth and noise-free) speech. However, good recognition accuracy is difficult to achieve when the incoming speech is passed through a telephone channel. At the same time, the task of speech recognition over telephone lines is growing in importance, as the number of applications of spoken language processing involving telephone speech increases every day. The paper presents our recent work on developing a robust speaker-independent isolated-spoken word recognition system based on a hybrid approach (classic -artificial neural network). A number of experiments are described and compared in order to evaluate different analysis and recognition techniques that are best suited for a telephone-speech recognition task. In particular, we address the use of RASTA processing (i.e., filtering the temporal trajectories of speech parameters) for increasing the recognition accuracy. Also, we propose a method based on the adaptive filter theory for producing simulated telephone data starting from clean speech databases.
IEEE Transactions on Speech and Audio Processing, 1994
We describe an approach for the estimation of acoustic phonetic models that will be used in a hidden Markov model (HMM) recognizer operating over the telephone. We explore two complementary techniques to developing telephone acoustic models. The first technique presents two new channel compensation algorithms. Experimental results on the Wall Street Journal corpus show no significant improvement over sentencebased cepstral-mean removal. The second technique uses an existing "high-quality" speech corpus to train acoustic models that are appropriate for the Switchboard Credir Card task over long-distance telephone lines. Experimental results show that cross-database acoustic training yields performance similar to that of conventional task-dependent acoustic training.
Bell Labs connected digit databases for telephone speech recognition
International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003, 2003
This paper describes Bell Labs Connected Digits databases (BLCD), which were collected over the landline telephone networks. The BLCD databases were designed to provide a standard benchmark for evaluating the performances of different connected digit recognition systems. It is also a vehicle for research and diagnosis of specific problems in automatic connected digit recognition. We first describe the content and the organization of the BLCD databases, and then present an automatic database verification procedure utilizing automatic speech recognition (ASR). For reference, we present automatic speech recognition performance on a set of the databases using the Bell Labs ASR system. For the databases with good recording conditions, the word-error rates can be less than 1%. In order to promote speech science and technology for real world applications, we make this database available for the speech community.
Investigations on Offline Artificial Bandwidth Extension of Telephone Speech Databases
Automatic speech recognition (ASR) systems have to be trained by large speech databases. For telephony tasks, speech databases almost exclusively exist with a narrow acoustic bandwidth. In near future, more and more wideband (WB) speech codecs – such as the adaptive multi-rate wideband (AMR-WB) codec – will come to use, leading to a demand for WB telephony speech databases. These can be used to train WB acoustic models, such as hidden Markov models (HMMs). ASR systems may benefit from WB acoustic models in a way that more demanding tasks with larger vocabulary or continuous speech input could be performed. Recording WB telephony speech databases, however, would entail high effort in time and money. Furthermore, there are only few WB capable mobile terminals on the market yet. Additionally, appropriate network infrastructure is mostly available for testing purposes so far. This paper presents an offline artificial bandwidth extension (ABWE) algorithm to perform acoustic expansion of ...
A feature-space transformation for telephone based speech recognition
1995
An experimental study describing the e ects of carbon and electret telephone transducers on automatic speech recognition (ASR) performance is presented. It is shown that telephone based ASR performance on a connected digit task actually improves when speech is spoken through the carbon transducer. This surprising result is explained by a study of the di erences in acoustic characteristics between carbon and electret telephone handsets. An initial attempt is made to devise a simple procedure for obtaining a parametric transformation which emulates the properties of the carbon transducer. The parameters of this transformation are trained automatically from speech spoken simultaneously through carbon and electret telephone handsets. When telephone speech data is transformed according to this procedure, a signi cant improvement in ASR performance is obtained. These results are interpreted and future research directions are discussed.
Telephony Speech Recognition System: Challenges
Ijca Proceedings on National Conference on Communication Technologies Its Impact on Next Generation Computing 2012, 2012
Present paper describes the challenges to design the telephony Automatic Speech Recognition (ASR) System. Telephonic speech data are collected automatically from all geographical regions of West Bengal to cover major dialectal variations of Bangla spoken language. All incoming calls are handled by Asterisk Server i.e. Computer telephony interface (CTI). The system asks some queries and users' spoken responses are stored and transcribed manually for ASR system training. In real time scenario, the telephonic speech contains channel drop, silence or no speech event, truncated speech signal, noisy signal etc along with the desired speech event. This paper describes these kinds of challenges of telephony ASR system. And also describes some brief techniques which will handle such unwanted signals in case of telephonic speech to certain extent and able to provide almost desired speech signal for the ASR system.
The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system
Interspeech 2005
In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites.
Building speech databases for cellular networks
The number of telephone applications that use automatic speech recognition is increasing fast. At the same time the use of mobile telephones is rising at high speed. This causes a need for databases with speech recorded over the cellular network. When creating a mobile speech database a number of problems show up that are not an issue when creating a speech database of fixed network recordings. These problems have to do with different recording environments, different networks and handsets, speaker recruitment and distribution, and the transcription. In this paper, the problems are explained, a couple of possible solutions are given and our experiences with these solutions in our contributions to the creation of mobile speech databases are presented. Besides, ELRA's position in the distribution of mobile speech databases is outlined.
The 2004 BBN/LIMSI 20xRT english conversational telephone speech system
2004
ABSTRACT In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3 xRT (real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the EARS progress test set.
Recognition of conversational telephone speech using the JANUS speech engine
1997
Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10 or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resisted most attempts at improvements by way of the proven techniques to date. Di culties arise from shorter words, telephone channel degradation, and highly dis uent and coarticulated speech. In this paper, we describe the application, adaptation, and performance evaluation of our JANUS speech recognition engine to the Switchboard conversational speech recognition task. Through a numberof algorithmic improvements, we h a ve been able to reduce error rates from more than 50 word error to 38, measured on the o cial 1996 NIST evaluation test set. Improvements include vocal tract length normalization, polyphonic modeling, label boosting, speaker adaptation with and without con dence measures, and speaking mode dependent pronunciation modeling.
TELEPHONY APPLICATIONS WITH SPEECH RECOGNITION
In this paper, we present and describe several computer telephony applications using speech recognition. These applications were developed under a research project carried out in collaboration with Portugal Telecom. Two possibilities have been explored in the developing of speech recognition applications. In the first one, speech recognition was implemented only with the help of software. In the second one, we used hardware equipped with DSP's. Both possibilities support Word Spotting and Barge-in. In addition to the telephone applications, the generic tools that have been built to develop those applications are also presented.
Automatic transcription of conversational telephone speech
IEEE Transactions on Speech and Audio Processing, 2000
This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models.
STC-TIMIT: Generation of a single-channel telephone corpus
Proc. of …, 2008
This paper describes a new speech corpus, STC-TIMIT, and discusses the process of design, development and its distribution through LDC. The STC-TIMIT corpus is derived from the widely used TIMIT corpus by sending it through a real and single telephone channel. TIMIT is phonetically balanced, covers the dialectal diversity in continental USA and has been extensively used as a benchmark for speech recognition algorithms, especially in early stages of development. The experimental usability of TIMIT has been increased eventually with the creation of derived corpora, passing the original data through different channels. One such example is the well-known NTIMIT corpus, where the original files in TIMIT are re-recorded after being sent through different telephone calls, resulting in a corpus that characterizes telephone channels in a wide sense. In STC-TIMIT, we followed a similar procedure, but the whole corpus was transmitted in a single telephone call with the goal of obtaining data from a real and yet highly stable telephone channel across the whole corpus. Files in STC-TIMIT are aligned to those of TIMIT with a theoretical precision of 0.125 ms, making TIMIT labels valid for the new corpus. The experimental section presents several results on speech recognition accuracy.
Telephone data collection using the World Wide Web
Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96, 1996
Over the past year our group has begun development of telephonebased speech understanding capability for our GALAXY conversational system. An important part of this process has been the collection of telephone speech which was used for training and evaluation. In the first phase of data collection our goal was to collect read speech from a wide variety of talkers, telephone handsets, and noise/channel conditions. In the second phase of data collection our additional goal was to collect spontaneous telephone speech from subjects actually using the system. In order to maximize variation in telephone conditions, as well as ease of use for subjects, the data collection software was designed to telephone subjects at their specified phone numbers around North America. Subjects initiate the data collection session by submitting an electronic form accessible by a WWW browser. For read speech collection, a set of prompts is automatically generated for the subject. This paper describes the design of the data collection system we are using for these purposes. To date we have collected over 9,000 utterances from over 270 subjects.
rre STC-TIMIT: Generation of a Single-channel Telephone Corpus
This paper describes a new speech corpus, STC-TIMIT, and discusses the process of design, development and its distribution through LDC. The STC-TIMIT corpus is derived from the widely used TIMIT corpus by sending it through a real and single telephone channel. TIMIT is phonetically balanced, covers the dialectal diversity in continental USA and has been extensively used as a benchmark for speech recognition algorithms, especially in early stages of development. The experimental usability of TIMIT has been increased eventually with the creation of derived corpora, passing the original data through different channels. One such example is the well-known NTIMIT corpus, where the original files in TIMIT are re-recorded after being sent through different telephone calls, resulting in a corpus that characterizes telephone channels in a wide sense. In STC-TIMIT, we followed a similar procedure, but the whole corpus was transmitted in a single telephone call with the goal of obtaining data from a real and yet highly stable telephone channel across the whole corpus. Files in STC-TIMIT are aligned to those of TIMIT with a theoretical precision of 0.125 ms, making TIMIT labels valid for the new corpus. The experimental section presents several results on speech recognition accuracy.
Investigating Data Selection for Minimum Phone Error Training of Acoustic Models
2007
This paper considers minimum phone error (MPE) based discriminative training of acoustic models for Mandarin broadcast news recognition. A novel data selection approach based on the normalized frame-level entropy of Gaussian posterior probabilities obtained from the word lattice of the training utterance was explored. It has the merit of making the training algorithm focus much more on the training statistics of those frame samples that center nearly around the decision boundary for better discrimination. Moreover, we presented a new phone accuracy function based on the frame-level accuracy of hypothesized phone arcs instead of using the raw phone accuracy function of MPE training. The underlying characteristics of the presented approaches were extensively investigated and their performance was verified by comparison with the original MPE training approach. Experiments conducted on the broadcast news collected in Taiwan showed that the integration of the frame-level data selection and accuracy calculation could achieve slight but consistent improvements over the baseline system.
2001
{ 1 Research in the speech recognition speech-to-text conversion) area has been underway for a couple of decades, and a great deal of progress has been made in reducing the word error rate (WER). In this paper, we attempt to summarize the state of the art in speech recognition algorithms. The algorithms we describe span the areas of lexicon design, feature extraction, classi er design, combination of hypotheses, and speaker adaptation of acoustic models. We will benchmark the algorithms on two main sources of speech, the rst being Voicemail (conversational telephone speech from a single speaker) and the second being Switchboard (conversational telephone speech between two speakers). We also present the results of some cross-domain experiments which highlight the "brittleness" of speech recognition systems today and illustrates the need to focus research e ort on improving crossdomain performance.