Rapid on-line environment compensation for server - based speech recognition in noisy mobile environments (original) (raw)

Speech Recognition Over Mobile Networks

Advances in Pattern Recognition, 2008

This chapter addresses issues associated with automatic speech recognition (ASR) over mobile networks, and introduces several techniques for improving speech recognition performance. One of these issues is the performance degradation of ASR over mobile networks that results from distortions produced by speech coding algorithms employed in mobile communication systems, transmission errors occurring over mobile telephone channels, and ambient background noise that can be particularly severe in mobile domains. In particular, speech coding algorithms have difficulty in modeling speech in ambient noise environments. To overcome this problem, noise reduction techniques can be integrated into speech coding algorithms to improve reconstructed speech quality under ambient noise conditions, or speech coding parameters can be made more robust with respect to ambient noise. As an alternative to mitigating the effects of speech coding distortions in the received speech signal, a bitstream-based framework has been proposed. In this framework, the direct transformation of speech coding parameters to speech recognition parameters is performed as a means of improving ASR performance. Furthermore, it is suggested that the receiver-side enhancement of speech coding parameters can be performed using either an adaptation algorithm or model compensation. Finally, techniques for reducing the effects of channel errors are also discussed in this chapter. These techniques include frame erasure concealment for ASR, soft-decoding, and missing feature theory-based ASR decoding.

Speech recognition in mobile environments

2000

The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Classic robustness techniques that have been previously proposed for speech recognition yield limited improvements of the degradation introduced by idiosyncrasies of the mobile networks. These sources of degradation include distortion introduced by the speech codec as well as artifacts arising from channel errors and discontinuous transmission.

Robust speech recognition in client-server scenarios

This paper addresses issues that are specific to the implementation of automatic speech recognition (ASR) applications and services in client-server scenarios. It is assumed in all of these scenarios that functionality in a human-machine dialog system is distributed between mobile client devices and network based multi-user media and application servers. It is argued that, while there has already been a great deal of research addressing issues relating to the communications channels associated with these scenarios, there are many additional problems that have received relatively little attention. These include issues of how environmental and speaker robustness algorithms are implemented in mobile domains and how multiple ASR channels can be implemented more efficiently in multi-user deployments. Preliminary results are summarized showing the effect of user specific unsupervised adaptation and normalization algorithms on ASR performance in mobile domains. Results are also presented demonstrating the efficiencies that are obtainable from using intelligent algorithms for assigning ASR decoders to computation servers in multi-user deployments.

A noise-robust front-end for distributed speech recognition in mobile communications

International Journal of Speech Technology, 2007

This paper investigates a new front-end processing that aims at improving the performance of speech recognition in noisy mobile environments. This approach combines features based on conventional Mel-cepstral Coefficients (MFCCs), Line Spectral Frequencies (LSFs) and formant-like (FL) features to constitute robust multivariate feature vectors. The resulting front-end constitutes an alternative to the DSR-XAFE (XAFE: eXtended Audio Front-End) available in GSM mobile communications. Our results showed that for highly noisy speech, using the paradigm that combines these spectral cues leads to a significant improvement in recognition accuracy on the Aurora 2 task.

Channel noise robustness for low-bitrate remote speech recognition

2002

In remote (or distributed) speech recognition , the recognition features are quantized at the client, and transmitted to the server via wireless or packet-based communication for recognition. In this paper, we investigate the issue of robustness of remote speech recognition applications against channel noise. The techniques presented include: 1) optimal soft decision channel decoding allowing for error detection, 2) weighted Viterbi recognition (WVR) with weighting coefficients based on the channel decoding reliability, 3) frame erasure concealment, and 4) WVR with weighting coefficients based on the quality of the erasure concealment operation. The techniques presented are implemented at the receiver (server), which limit the complexity for the client, and significantly extend the range of channel conditions for which remote recognition can be sustained. As a case study, we illustrate that remote recognition based on perceptual linear prediction (PLP) coefficients is able to provide at less than 500 bps, good recognition accuracy over a wide range of channel conditions.

Innovative speech processing for mobile terminals: an annotated bibliography

Signal Processing, 2000

This paper gives an overview of recent bibliographic references dealing with speech processing in mobile terminals. Its purpose is to point out state of the art issues in the area; thus a fairly large list of references taken from many conferences proceedings and journals is given and commented. General considerations about speech processing in mobile communications are "rstly introduced; then we deal with audio processing for speech enhancement in mobile terminals and with low bit-rate speech coding. Speech recognition is addressed with some accent put on mobile applications. A short overview of implementation aspects of speech processing algorithms in mobile terminals is also given. Finally, open issues and problems are listed.

An Efficient Front-End for Distributed Speech Recognition over Mobile

International Journal of Computer and Communication Engineering, 2012

To improve the robustness of distributed speech front-ends in mobile communication we introduce, in this paper, a new set of feature vector which is estimated through three steps. First, the Mel-Line Spectral Frequencies (MLSFs) coefficients are combined with conventional MFCCs, after extracted from a denoised acoustic frame using the wiener filter. Also, we optimize the stream weights of multi-stream HMMs by deploying a discriminative approach. Finally, these features are adequately transformed and reduced in a multi-stream scheme using Karhunen-Loeve Transform (KLT). Recognition experiments on the Aurora 2 connected digits database reveal that the proposed front-end leads to a significant improvement in speech recognition accuracy for highly noisy GSM.

Low-bitrate distributed speech recognition for packet-based and wireless communication

IEEE Transactions on Speech and Audio Processing, 2002

We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves the recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.

Trellis encoded vector quantization for robust speech recognition

A joint data (features) and channel (bias) estimation framework for robust speech recognition is described. A trellis encoded vector quantizer is used as a pre-processor to estimate the channel bias using blind maximum likelihood sequence estimation. A sequential constraint in the feature vector sequence is explored and used in two ways, namely, a) the selection of the quantized signal constellation, b) the decoding process in joint data and channel estimation. A two state trellis encoded vector quantizer is designed for signal bias removal applications. Compared with the conventional memoryless VQ based approach in signal bias removal, the preliminary experimental results indicate that incorporating sequential constraint in joint data and channel estimation for robust speech recognition is advantageous

A MFCC-Based CELP Speech Coder for Server-Based Speech Recognition in Network Environments

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007

Existing standard speech coders can provide speech communication of high quality while they degrade the performance of speech recognition systems that use the reconstructed speech by the coders. The main cause of the degradation is that the spectral envelope parameters in speech coding are optimized to speech quality rather than to the performance of speech recognition. For example, mel-frequency cepstral coefficient (MFCC) is generally known to provide better speech recognition performance than linear prediction coefficient (LPC) that is a typical parameter set in speech coding. In this paper, we propose a speech coder using MFCC instead of LPC to improve the performance of a server-based speech recognition system in network environments. However, the main drawback of using MFCC is to develop the efficient MFCC quantization with a low-bit rate. First, we explore the interframe correlation of MFCCs, which results in the predictive quantization of MFCC. Second, a safety-net scheme is proposed to make the MFCC-based speech coder robust to channel error. As a result, we propose a 8.7 kbps MFCC-based CELP coder. It is shown from a PESQ test that the proposed speech coder has a comparable speech quality to 8 kbps G.729 while it is expected that the performance of speech recognition using the proposed speech coder is better than that using G.729.