Use of simulated data for robust telephone speech recognition (original) (raw)

Noise-robust speech recognition of conversational telephone speech

Interspeech 2006, 2006

Over the past several years, the primary focus of investigation for speech recognition has been over the telephone or IP network. Recently more and more IP telephony has been extensively used. This paper describes the performance of a speech recognizer on noisy speech transmitted over an H.323 IP telephony network, where the minimum mean-square error log spectra amplitude (MMSE-LSA) method [1,2] is used to reduce the mismatch between training and deployment condition in order to achieve robust speech recognition. In the H.323 network environment, the sources of distortion to the speech are packet loss and additive noise. In this work, we evaluate the impact of packet losses on speech recognition performance first, and then explore the effects of uncorrelated additive noise on the performance. To explore how additive acoustic noise affects the speech recognition performance, seven types of noise sources are selected for use in our experiments. Finally, the experimental results indicate that the MMSE-LSA enhancement method apparently increased robustness for some types of additive noise under certain packet loss rates over the H.323 telephone network.

Conversational telephone speech recognition

2003

This paper describes the development of a speech recognition system for the processing of telephone conversations, starting with a state-of-the-art broadcast news transcription system. We identify major changes and improvements in acoustic and language modeling, as well as decoding, which are required to achieve state-of-theart performance on conversational speech. Some major changes on the acoustic side include the use of speaker normalization (VTLN), the need to cope with channel variability, and the need for efficient speaker adaptation and better pronunciation modeling. On the linguistic side the primary challenge is to cope with the limited amount of language model training data. To address this issue we make use of a data selection technique, and a smoothing technique based on a neural network language model. At the decoding level lattice rescoring and minimum word error decoding are applied. On the development data, the improvements yield an overall word error rate of 24.9% whereas the original BN transcription system had a word error rate of about 50% on the same data.

ROBUST RECOGNITION OF SMALL VOCABULARY TELEPHONE - QUALITY SPEECH

2003

Considerable progress has been made in the field of automatic speech recognition in recent years, especially for high-quality (full bandwidth and noise-free) speech. However, good recognition accuracy is difficult to achieve when the incoming speech is passed through a telephone channel. At the same time, the task of speech recognition over telephone lines is growing in importance, as the number of applications of spoken language processing involving telephone speech increases every day. The paper presents our recent work on developing a robust speaker-independent isolated-spoken word recognition system based on a hybrid approach (classic -artificial neural network). A number of experiments are described and compared in order to evaluate different analysis and recognition techniques that are best suited for a telephone-speech recognition task. In particular, we address the use of RASTA processing (i.e., filtering the temporal trajectories of speech parameters) for increasing the recognition accuracy. Also, we propose a method based on the adaptive filter theory for producing simulated telephone data starting from clean speech databases.

Training issues and channel equalization techniques for the construction of telephone acoustic models using a high-quality speech corpus

IEEE Transactions on Speech and Audio Processing, 1994

We describe an approach for the estimation of acoustic phonetic models that will be used in a hidden Markov model (HMM) recognizer operating over the telephone. We explore two complementary techniques to developing telephone acoustic models. The first technique presents two new channel compensation algorithms. Experimental results on the Wall Street Journal corpus show no significant improvement over sentencebased cepstral-mean removal. The second technique uses an existing "high-quality" speech corpus to train acoustic models that are appropriate for the Switchboard Credir Card task over long-distance telephone lines. Experimental results show that cross-database acoustic training yields performance similar to that of conventional task-dependent acoustic training.

Bell Labs connected digit databases for telephone speech recognition

International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003, 2003

This paper describes Bell Labs Connected Digits databases (BLCD), which were collected over the landline telephone networks. The BLCD databases were designed to provide a standard benchmark for evaluating the performances of different connected digit recognition systems. It is also a vehicle for research and diagnosis of specific problems in automatic connected digit recognition. We first describe the content and the organization of the BLCD databases, and then present an automatic database verification procedure utilizing automatic speech recognition (ASR). For reference, we present automatic speech recognition performance on a set of the databases using the Bell Labs ASR system. For the databases with good recording conditions, the word-error rates can be less than 1%. In order to promote speech science and technology for real world applications, we make this database available for the speech community.

Investigations on Offline Artificial Bandwidth Extension of Telephone Speech Databases

Automatic speech recognition (ASR) systems have to be trained by large speech databases. For telephony tasks, speech databases almost exclusively exist with a narrow acoustic bandwidth. In near future, more and more wideband (WB) speech codecs – such as the adaptive multi-rate wideband (AMR-WB) codec – will come to use, leading to a demand for WB telephony speech databases. These can be used to train WB acoustic models, such as hidden Markov models (HMMs). ASR systems may benefit from WB acoustic models in a way that more demanding tasks with larger vocabulary or continuous speech input could be performed. Recording WB telephony speech databases, however, would entail high effort in time and money. Furthermore, there are only few WB capable mobile terminals on the market yet. Additionally, appropriate network infrastructure is mostly available for testing purposes so far. This paper presents an offline artificial bandwidth extension (ABWE) algorithm to perform acoustic expansion of ...

A feature-space transformation for telephone based speech recognition

1995

An experimental study describing the e ects of carbon and electret telephone transducers on automatic speech recognition (ASR) performance is presented. It is shown that telephone based ASR performance on a connected digit task actually improves when speech is spoken through the carbon transducer. This surprising result is explained by a study of the di erences in acoustic characteristics between carbon and electret telephone handsets. An initial attempt is made to devise a simple procedure for obtaining a parametric transformation which emulates the properties of the carbon transducer. The parameters of this transformation are trained automatically from speech spoken simultaneously through carbon and electret telephone handsets. When telephone speech data is transformed according to this procedure, a signi cant improvement in ASR performance is obtained. These results are interpreted and future research directions are discussed.

Telephony Speech Recognition System: Challenges

Ijca Proceedings on National Conference on Communication Technologies Its Impact on Next Generation Computing 2012, 2012

Present paper describes the challenges to design the telephony Automatic Speech Recognition (ASR) System. Telephonic speech data are collected automatically from all geographical regions of West Bengal to cover major dialectal variations of Bangla spoken language. All incoming calls are handled by Asterisk Server i.e. Computer telephony interface (CTI). The system asks some queries and users' spoken responses are stored and transcribed manually for ASR system training. In real time scenario, the telephonic speech contains channel drop, silence or no speech event, truncated speech signal, noisy signal etc along with the desired speech event. This paper describes these kinds of challenges of telephony ASR system. And also describes some brief techniques which will handle such unwanted signals in case of telephonic speech to certain extent and able to provide almost desired speech signal for the ASR system.

The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system

Interspeech 2005

In this paper we describe the English Conversational Telephone Speech (CTS) recognition system jointly developed by BBN and LIMSI under the DARPA EARS program for the 2004 evaluation conducted by NIST. The 2004 BBN/LIMSI system achieved a word error rate (WER) of 13.5% at 18.3xRT (real-time as measured on Pentium 4 Xeon 3.4 GHz Processor) on the EARS progress test set. This translates into a 22.8% relative improvement in WER over the 2003 BBN/LIMSI EARS evaluation system, which was run without any time constraints. In addition to reporting on the system architecture and the evaluation results, we also highlight the significant improvements made at both sites.

Building speech databases for cellular networks

The number of telephone applications that use automatic speech recognition is increasing fast. At the same time the use of mobile telephones is rising at high speed. This causes a need for databases with speech recorded over the cellular network. When creating a mobile speech database a number of problems show up that are not an issue when creating a speech database of fixed network recordings. These problems have to do with different recording environments, different networks and handsets, speaker recruitment and distribution, and the transcription. In this paper, the problems are explained, a couple of possible solutions are given and our experiences with these solutions in our contributions to the creation of mobile speech databases are presented. Besides, ELRA's position in the distribution of mobile speech databases is outlined.