Distributed Speech Recognition Research Papers (original) (raw)
A speech enhancement technique is indispensable to achieve acceptable speech quality in VoIP systems. This paper proposes a Wiener filter optimized to the estimated SNR of noisy speech for speech enhancement. The proposed noise reduction... more
A speech enhancement technique is indispensable to achieve acceptable speech quality in VoIP systems. This paper proposes a Wiener filter optimized to the estimated SNR of noisy speech for speech enhancement. The proposed noise reduction method is applied as preprocessing before speech coding. The performance of the proposed method is evaluated by the PESQ in various noisy conditions. In this paper, G.711, G.723.1, and G.729A VoIP speech codecs are used for the performance evaluation. The PESQ results show that the performance of our proposed noise reduction scheme outperforms those of the noise suppression in the IS-127 EVRC and the noise reduction in the ETSI standard for the advanced distributed speech recognition front-end.
In this paper, we propose a low bit-rate speech codec based on vector quantization (VQ) of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a highresolution mel-frequency cepstrum (MFC) is computed, goodquality... more
In this paper, we propose a low bit-rate speech codec based on vector quantization (VQ) of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a highresolution mel-frequency cepstrum (MFC) is computed, goodquality speech reconstruction is possible from the MFCCs despite the lack of phase information. By evaluating the contribution toward speech quality that individual MFCCs make and applying appropriate quantization, our results show that the MFCC-based codec exceeds the state-of-the-art MELPe codec across the entire range of 600-2400 bps, when evaluated with the perceptual evaluation of speech quality (PESQ) (ITU-T recommendation P.862). The main advantage of the proposed codec is in distributed speech recognition (DSR) since the MFCCs can be directly applied thus eliminating additional decode and feature extract stages; furthermore, the proposed codec better preserves the fidelity of MFCCs and better word accuracy rates as compared to CELP and MELPe codecs.
Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing... more
Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing aids or speech recognition, are examples of such systems and often require a noise reduction technique operating in combination with a precise voice activity detector (VAD). This paper presents a new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems. The algorithm measures the long-term spectral divergence (LTSD) between speech and noise and formulates the speech/ non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors. The decision threshold is adapted to the measured noise energy while a controlled hang-over is activated only when the observed signal-to-noise ratio is low. It is shown by conducting an analysis of the speech/non-speech LTSD distributions that using long-term information about speech signals is beneficial for VAD. The proposed algorithm is compared to the most commonly used VADs in the field, in terms of speech/non-speech discrimination and in terms of recognition performance when the VAD is used for an automatic speech recognition system. Experimental results demonstrate a sustained advantage over standard VADs such as G.729 and adaptive multi-rate (AMR) which were used as a reference, and over the VADs of the advanced front-end for distributed speech recognition.
The aim of this work is to develop methods that enable acoustic speech features to be predicted from mel-frequency cepstral coefficient ͑MFCC͒ vectors as may be encountered in distributed speech recognition architectures. The work begins... more
The aim of this work is to develop methods that enable acoustic speech features to be predicted from mel-frequency cepstral coefficient ͑MFCC͒ vectors as may be encountered in distributed speech recognition architectures. The work begins with a detailed analysis of the multiple correlation between acoustic speech features and MFCC vectors. This confirms the existence of correlation, which is found to be higher when measured within specific phonemes rather than globally across all speech sounds. The correlation analysis leads to the development of a statistical method of predicting acoustic speech features from MFCC vectors that utilizes a network of hidden Markov models ͑HMMs͒ to localize prediction to specific phonemes. Within each HMM, the joint density of acoustic features and MFCC vectors is modeled and used to make a maximum a posteriori prediction. Experimental results are presented across a range of conditions, such as with speaker-dependent, gender-dependent, and gender-independent constraints, and these show that acoustic speech features can be predicted from MFCC vectors with good accuracy. A comparison is also made against an alternative scheme that substitutes the higher-order MFCCs with acoustic features for transmission. This delivers accurate acoustic features but at the expense of a significant reduction in speech recognition accuracy.
In anticipation of upcoming mobile telephony services with higher speech quality, a wideband (50 Hz to 7 kHz) mobile telephony derivative of TIMIT has been recorded called WTIMIT. It opens up various scientific investigations; e.g., on... more
In anticipation of upcoming mobile telephony services with higher speech quality, a wideband (50 Hz to 7 kHz) mobile telephony derivative of TIMIT has been recorded called WTIMIT. It opens up various scientific investigations; e.g., on speech quality and intelligibility, as well as on wideband upgrades of network-side interactive voice response (IVR) systems with retrained or bandwidth-extended acoustic models for automatic speech recognition (ASR). Wideband telephony could enable network-side speech recognition applications such as remote dictation or spelling without the need of distributed speech recognition techniques. The WTIMIT corpus was transmitted via two prepared Nokia 6220 mobile phones over T-Mobile's 3G wideband mobile network in The Hague, The Netherlands, employing the Adaptive Multirate Wideband (AMR-WB) speech codec. The paper presents observations of transmission effects and phoneme recognition experiments. It turns out that in the case of wideband telephony, server-side ASR should not be carried out by simply decimating received signals to 8 kHz and applying existent narrowband acoustic models. Nor do we recommend just simulating the AMR-WB codec for training of wideband acoustic models. Instead, real-world wideband telephony channel data (such as WTIMIT) provides the best training material for wideband IVR systems.
In this paper, we address the problems in standard noise reduction method, which was designed by ETSI (European Telecommunication Standards Institution) for distributed speech recognition. In ETSI-based procedure, noise spectrum is... more
In this paper, we address the problems in standard noise reduction method, which was designed by ETSI (European Telecommunication Standards Institution) for distributed speech recognition. In ETSI-based procedure, noise spectrum is estimated from noise frames, which are detected by a voice activity detector (VAD). Frame energy from the input signal is calculated by the VAD, and if the frame energy is smaller than a threshold, the corresponding frame is considered as noise frame. In highly corrupted noisy signal, this leads VAD towards false detection of a noise frame as a speech frame. Again in the second stage of ETSI-based procedure, the gain factorization coefficient is set to 0.8 for noise frames. This causes less noise reduction for high noisy signal. In our proposed improvement, pitch information along with frame energy is introduced to detect speech and noise frames, and gain factor is increased for better noise reduction from noise frames. Experimental results on Aurora-2J database show significant achievement in noise reduction using the proposed improvement over the original ETSI-based noise reduction.
This paper shows an improved statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The motivations... more
This paper shows an improved statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The motivations for revising the original multiple observation LRT (MO-LRT) are found in its artificially added hangover mechanism that exhibits an incorrect behavior under different signal-to-noise ratio (SNR) conditions. The new approach defines a maximum a posteriori (MAP) statistical test in which all the global hypotheses on the multiple observation window containing up to one speech-to-nonspeech or nonspeech-to-speech transitions are considered. Thus, the implicit hangover mechanism artificially added by the original method was not found in the revised method so its design can be further improved. With these and other innovations, the proposed method showed a higher speech/nonspeech discrimination accuracy over a wide range of SNR conditions when compared to the original MO-LRT voice activity detector (VAD). Experiments conducted on the AURORA databases and tasks showed that the revised method yields significant improvements in speech recognition performance over standardized VADs such as ITU T G.729 and ETSI AMR for discontinuous voice transmission and the ETSI AFE for distributed speech recognition (DSR), as well as over recently reported methods.
We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as... more
We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves the recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.
In this paper, we propose a low bit-rate speech codec based on a hybrid scalar/vector quantization of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a high-resolution mel-frequency cepstrum (MFC) is computed,... more
In this paper, we propose a low bit-rate speech codec based on a hybrid scalar/vector quantization of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a high-resolution mel-frequency cepstrum (MFC) is computed, good-quality speech reconstruction is possible from the MFCCs despite the lack of explicit phase information. By evaluating the contribution toward speech quality that individual MFCCs make and applying appropriate quantization, our results show perceptual evaluation of speech quality (PESQ) of the MFCC-based codec matches the state-of-the-art MELPe codec at 600 bps and exceeds the CELP codec at 2000-4000 bps coding rates. The main advantage of the proposed codec is in distributed speech recognition (DSR) since speech features based on MFCCs can be directly obtained from codewords thus eliminating additional decode and feature extract stages.
The paper describes a new architecture for accessing hyperlinked speech-accessible knowledge sources that are distributed over the Internet. The architecture, LRRP SpeechWeb, uses Local thin-client application-specific speech Recognition... more
The paper describes a new architecture for accessing hyperlinked speech-accessible knowledge sources that are distributed over the Internet. The architecture, LRRP SpeechWeb, uses Local thin-client application-specific speech Recognition and Remote natural-language query Processing. Users navigate an LRRP SpeechWeb using voice-activated hyperlink commands, and query the knowledge sources through spoken natural-language using a speech browser executing on a local device (Frost, R.A. and Chitte, S., Proc. PACLING '99, Conf. of Pacific Association for Computational Linguistics, p.82-90, 1999). It differs from the use of speech interfaces to conventional Web HTML pages, from conventional telephone access to remote speech applications (as used in many call centers), and from the use of a network of hyperlinked VXML pages. The architecture is ideally suited for use when cell-phones become available with built-in speech-to-text and text-to-speech capabilities.
Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing... more
Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing aids or speech recognition, are examples of such systems and often require a noise reduction technique operating in combination with a precise voice activity detector (VAD). This paper presents a new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems. The algorithm measures the long-term spectral divergence (LTSD) between speech and noise and formulates the speech/ non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors. The decision threshold is adapted to the measured noise energy while a controlled hang-over is activated only when the observed signal-to-noise ratio is low. It is shown by conducting an analysis of the speech/non-speech LTSD distributions that using long-term information about speech signals is beneficial for VAD. The proposed algorithm is compared to the most commonly used VADs in the field, in terms of speech/non-speech discrimination and in terms of recognition performance when the VAD is used for an automatic speech recognition system. Experimental results demonstrate a sustained advantage over standard VADs such as G.729 and adaptive multi-rate (AMR) which were used as a reference, and over the VADs of the advanced front-end for distributed speech recognition.
This paper describes the main components of MiPad (Multimodal Interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a... more
This paper describes the main components of MiPad (Multimodal Interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the existing pen-based PDA interface. Acoustic modeling and noise robustness in distributed speech recognition are key components in MiPad's design and implementation. In a typical scenario, the user speaks to the device at a distance so that he or she can see the screen. The built-in microphone thus picks up a lot of background noise, which requires MiPad be noise robust. For complex tasks, such as dictating e-mails, resource limitations demand the use of a client-server (peer-to-peer) architecture, where the PDA performs primitive feature extraction, feature quantization, and error protection, while the transmitted features to the server are subject to further speech feature enhancement, speech decoding and understanding before a dialog is carried out and actions rendered. Noise robustness can be achieved at the client, at the server or both. Various speech processing aspects of this type of distributed computation as related to MiPad's potential deployment are presented in this paper. Recent user interface study results are also described. Finally, we point out future research directions as related to several key MiPad functionalities.
This paper presents a mixed recovery scheme for robust distributed speech recognition (DSR) implemented over a packet channel which suffers packet losses. The scheme combines media-specific forward error correction (FEC) and error... more
This paper presents a mixed recovery scheme for robust distributed speech recognition (DSR) implemented over a packet channel which suffers packet losses. The scheme combines media-specific forward error correction (FEC) and error concealment (EC). Media-specific FEC is applied at the client side, where FEC bits representing strongly quantized versions of the speech vectors are introduced. At the server side, the information provided by those FEC bits is used by the EC algorithm to improve the recognition performance. We investigate the adaptation of two different EC techniques, namely minimum mean square error (MMSE) estimation, which operates at the decoding stage, and weighted Viterbi recognition (WVR), where EC is applied at the recognition stage, in order to be used along with FEC. The experimental results show that a significant increase in recognition accuracy can be obtained with very little bandwidth increase, which may be null in practice, and a limited increase in latency, which in any case is not so critical for an application such as DSR.
Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing... more
Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing aids or speech recognition, are ...
Developing as peech-based application for mobile devices requires work upfront, since mobile devices and speech recognition systems vary dramatically in their capabilities. While mobile devices can concisely be classified by their... more
Developing as peech-based application for mobile devices requires work upfront, since mobile devices and speech recognition systems vary dramatically in their capabilities. While mobile devices can concisely be classified by their processing power,m emory,o perating system and wireless network speed it is ab it trickier for speech recognition engines. This paper presents acomprehensive approach that comprises aprofound classification of speech recognition systems for mobile applications and aframework for mobile and distributed speech recognition. The framework called Gulliverspeeds up the development process with multi-modal components that can be easily used in aGUI designer and with abstraction layers that support the integration of various speech recognition engines depending on the user'sneeds. The framework itself provides the base for amodel-drivendevelopment approach.
This letter shows an innovative voice activity detector (VAD) based on the Kullback-Leibler (KL) divergence measure. The algorithm is evaluated in the context of the recently approved ETSI standard for distributed speech recognition... more
This letter shows an innovative voice activity detector (VAD) based on the Kullback-Leibler (KL) divergence measure. The algorithm is evaluated in the context of the recently approved ETSI standard for distributed speech recognition (DSR). The VAD uses long-term information of the noisy speech signal in order to define a more robust decision rule yielding high accuracy. The Mel-scaled filter bank log-energies (FBE) are modeled by means of Gaussian distributions, and a symmetric KL divergence is used for the estimation of the distance between speech and noise distributions. The decision rule is formulated in terms of the average subband KL divergence that is compared to a noise-adaptable threshold. An exhaustive analysis using the AURORA databases is conducted in order to assess the performance of the proposed method and to compare it to existing standard VAD methods.
Phoneme alignment is the task of proper positioning of a sequence of phonemes in relation to a corresponding continuous speech signal. This problem is also referred to as phoneme segmentation. An accurate and fast alignment procedure is a... more
Phoneme alignment is the task of proper positioning of a sequence of phonemes in relation to a corresponding continuous speech signal. This problem is also referred to as phoneme segmentation. An accurate and fast alignment procedure is a necessary tool for developing speech recognition and text-to-speech systems.
Abstract: Developing aspeech-based application for mobile devices requires work upfront, since mobile devices and speech recognition systems vary dramatically in their capabilities. While mobile devices can concisely be classified by... more
Abstract: Developing aspeech-based application for mobile devices requires work upfront, since mobile devices and speech recognition systems vary dramatically in their capabilities. While mobile devices can concisely be classified by their processing power, memory, operating system and wireless network speed it is abit trickier for speech recognition engines. This paper presents acomprehensive approach that comprises aprofound classification of speech recognition systems for mobile applications and aframework for mobile and distributed speech recognition. The framework called Gulliverspeeds up the development process with multi-modal components that can be easily used in aGUI designer and with abstraction layers that support the integration of various speech recognition engines depending on the user’s needs. The framework itself provides the base for amodel-drivendevelopment approach. 1
This work describes the optimization of a signal processing frontend for a distributed speech recognition system with the goal of reducing power consumption. Two categories of source code optimizations were used, architectural and... more
This work describes the optimization of a signal processing frontend for a distributed speech recognition system with the goal of reducing power consumption. Two categories of source code optimizations were used, architectural and algorithmic. Architectural optimizations reduce the power consumption for a particular system, in this case, the HP Labs Smartbadge IV prototype portable system. Algorithmic optimizations are more general and involve changes in the algorithmic implementation of the source code to run faster and consume less power. A cycle accurate energy simulation shows a reduction in power usage by 83.5% with these optimizations. The optimized source code runs 34 times faster than the original code, therefore it can run at lower processor clock speeds and voltages for further reductions in power consumption. This technique, known as dynamic voltage scaling, was implemented on the Smartbadge IV hardware for an overall reduction in power usage of 89.2%.
Distributed Speech Recognition (DSR) systems rely on efficient transmission of speech information from distributed clients to a centralized server. Wireless or network communication channels within DSR systems are typically noisy and... more
Distributed Speech Recognition (DSR) systems rely on efficient transmission of speech information from distributed clients to a centralized server. Wireless or network communication channels within DSR systems are typically noisy and bursty. Thus, DSR systems must utilize efficient Error Recovery (ER) schemes during transmission of speech information. Some ER strategies, referred to as forward error control (FEC), aim to create redundancy in the source coded bitstream to overcome the effect of channel errors, while others are designed to create spread or delay in the feature stream in order to overcome the effect of bursty channel errors. Furthermore, ER strategies may be designed as a combination of the previously described techniques. This chapter presents an array of error recovery techniques for remote speech recognition applications.
This paper studies the effect of Bluetooth wireless channels on distributed speech recognition. An approach for implementing speech recognition over Bluetooth is described. We simulate a Bluetooth environment and then incorporate its... more
This paper studies the effect of Bluetooth wireless channels on distributed speech recognition. An approach for implementing speech recognition over Bluetooth is described. We simulate a Bluetooth environment and then incorporate its performance, in the form of packet loss ratio, into the speech recognition system. We show how intelligent framing of speech feature vectors, extracted by a fixed-point arithmetic front-end, together with an interpolation technique for lost vectors, can lead to a 50.48% relative improvement in recognition accuracy. This is achieved at a distance of 10 meters, around the maximum operating distance between a Bluetooth transmitter and a Bluetooth receiver.
This paper investigates a new front-end processing that aims at improving the performance of speech recognition in noisy mobile environments. This approach combines features based on conventional Mel-cepstral Coefficients (MFCCs), Line... more
This paper investigates a new front-end processing that aims at improving the performance of speech recognition in noisy mobile environments. This approach combines features based on conventional Mel-cepstral Coefficients (MFCCs), Line Spectral Frequencies (LSFs) and formant-like (FL) features to constitute robust multivariate feature vectors. The resulting front-end constitutes an alternative to the DSR-XAFE (XAFE: eXtended Audio Front-End) available in GSM mobile communications. Our results showed that for highly noisy speech, using the paradigm that combines these spectral cues leads to a significant improvement in recognition accuracy on the Aurora 2 task.
This paper shows a revised statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The new approach... more
This paper shows a revised statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or non-speech but all the possible hypothesis defined over the individual observations. The implicit hangover mechanism artificially added by the original method was not found in the revised method so its design can be further improved. With these and other innovations the proposed method showed a high speech/non-speech discrimination over a wide range of SNR conditions. The experimental framework showed that the revised method yields significant improvements over standardized VADs for discontinous voice transmission and distributed speech recognition, as well as over recently reported methods.
A robust and effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on well-known statistical tests based on the determination of the... more
A robust and effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on well-known statistical tests based on the determination of the speech/non-speech bispectra by means of third-order auto-cumulants. This algorithm differs from many others in the way the decision rule is formulated being the statistical tests built on a multiple observation (MO) window consisting of averaged bispectrum coefficients of the speech signal. Clear improvements in speech/non-speech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that application of a statistical detection test leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The experimental analysis carried out on the AURORA 3 databases provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs, such as ITU G.729, GSM AMR, and ETSI AFE, for distributed speech recognition (DSR) and other recently reported VADs.
In this paper we present the first application of Independent Component Analysis (ICA) to Voice Activity Detection (VAD). The accuracy of a multiple observation-likelihood ratio test (MO-LRT) VAD is improved by transforming the set of... more
In this paper we present the first application of Independent Component Analysis (ICA) to Voice Activity Detection (VAD). The accuracy of a multiple observation-likelihood ratio test (MO-LRT) VAD is improved by transforming the set of observations to a new set of independent components. Clear improvements in speech/non-speech discrimination accuracy for low false alarm rate demonstrate the effectiveness of the proposed VAD. It is shown that the use of this new set leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm is optimum in those scenarios where the loss of speech frames could be unacceptable, causing a system failure. The experimental analysis carried out on the AURORA 3 databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
Distributed speech recognition (DSR) is an interesting technology for mobile recognition tasks where the recognizer is split up into two parts and connected with a transmission channel. We compare the performance of standard and hybrid... more
Distributed speech recognition (DSR) is an interesting technology for mobile recognition tasks where the recognizer is split up into two parts and connected with a transmission channel. We compare the performance of standard and hybrid modeling approaches in this environment. The evaluation is done on clean and noisy speech samples taken from the TI digits and the AURORA database. Our results show that that the hybrid modeling techniques can outperform standard continuous systems on this task.
Distributed Speech Recognition (DSR) systems rely on efficient transmission of speech information from distributed clients to a centralized server. Wireless or network communication channels within DSR systems are typically noisy and... more
Distributed Speech Recognition (DSR) systems rely on efficient transmission of speech information from distributed clients to a centralized server. Wireless or network communication channels within DSR systems are typically noisy and bursty. Thus, DSR systems must utilize efficient Error Recovery (ER) schemes during transmission of speech information. Some ER strategies, referred to as forward error control (FEC), aim to create redundancy in the source coded bitstream to overcome the effect of channel errors, while others are designed to create spread or delay in the feature stream in order to overcome the effect of bursty channel errors. Furthermore, ER strategies may be designed as a combination of the previously described techniques. This chapter presents an array of error recovery techniques for remote speech recognition applications.
In this paper we apply a model-based compensation method to cancel the effect of the additive noise in Automatic Speech Recognition systems. The method is formulated in a statistical framework in order to perform the optimal compensation... more
In this paper we apply a model-based compensation method to cancel the effect of the additive noise in Automatic Speech Recognition systems. The method is formulated in a statistical framework in order to perform the optimal compensation of the noise effect given the ...
An effective speech event detector is presented in this work for improving the performance of speech processing systems working in noisy environment. The proposed method is based on a trained support vector machine (SVM) that defines an... more
An effective speech event detector is presented in this work for improving the performance of speech processing systems working in noisy environment. The proposed method is based on a trained support vector machine (SVM) that defines an optimized non-linear decision rule involving the subband SNRs of the input speech. It is analyzed the classification rule in the input space and the ability of the SVM model to learn how the signal is masked by the background noise. The algorithm also incorporates a noise reduction block working in tandem with the voice activity detector (VAD) that has shown to be very effective in high noise environments. The experimental analysis carried out on the Spanish SpeechDat-Car database shows clear improvements over standard VADs including ITU G.729, ETSI AMR and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
Abstract—In this paper two approaches are proposed for robust distributed speech recognition (DSR) including (1) sub-frame interleaving for burst packet loss concealment and (2) reference model weighting (RMW) for fast noise environment... more
Abstract—In this paper two approaches are proposed for robust distributed speech recognition (DSR) including (1) sub-frame interleaving for burst packet loss concealment and (2) reference model weighting (RMW) for fast noise environment compensation. The proposed methods were evaluated on the Aurora2 database and compared with European Telecommunications Standards Institute (ETSI) DSR standard ES 202 212. Experimental results show that the average recognition rate of nine simulated burst packet-loss ...
This paper presents a computational complexity estimate of a distributed speech recognition front-end, compliant to ETSI Standard ES 202 212 and implemented at system-level in SystemC. This estimate allows to know which blocks of terminal... more
This paper presents a computational complexity estimate of a distributed speech recognition front-end, compliant to ETSI Standard ES 202 212 and implemented at system-level in SystemC. This estimate allows to know which blocks of terminal front-end are more computational expensive, and therefore it may be useful, in an hardware implementation, to realize them with low-power ad-hoc hardware.
The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and... more
The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors.
An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized... more
An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized order statistics filters (OSFs) working on the subband log-energies. This algorithm differs from many others in the way the decision rule is formulated. Instead of making the decision based on the current frame, it uses OSFs on the subband log-energies which significantly reduces the error probability when discriminating speech from nonspeech in a noisy signal. Clear improvements in speech/nonspeech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that an increase of the OSF order leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a noise reduction block working in tandem with the VAD and showed to further improve its accuracy. A previous noise reduction block also improves the accuracy in detecting speech and nonspeech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR, and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized... more
An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized order statistics filters (OSFs) working on the subband log-energies. This algorithm differs from many others in the way the decision rule is formulated. Instead of making the decision based on the current frame, it uses OSFs on the subband log-energies which significantly reduces the error probability when discriminating speech from nonspeech in a noisy signal. Clear improvements in speech/nonspeech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that an increase of the OSF order leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a noise reduction block working in tandem with the VAD and showed to further improve its accuracy. A previous noise reduction block also improves the accuracy in detecting speech and nonspeech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR, and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized... more
An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized order statistics filters (OSFs) working on the subband log-energies. This algorithm differs from many others in the way the decision rule is formulated. Instead of making the decision based on the current frame, it uses OSFs on the subband log-energies which significantly reduces the error probability when discriminating speech from nonspeech in a noisy signal. Clear improvements in speech/nonspeech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that an increase of the OSF order leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a noise reduction block working in tandem with the VAD and showed to further improve its accuracy. A previous noise reduction block also improves the accuracy in detecting speech and nonspeech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR, and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
The purpose of this work is to demonstrate that distributed speech recognition front-ends can be deployed in environments which provide for very little power and CPU resources, with possibly no degradation of speech recognition quality... more
The purpose of this work is to demonstrate that distributed speech recognition front-ends can be deployed in environments which provide for very little power and CPU resources, with possibly no degradation of speech recognition quality when compared to standard floatingpoint implementations. The ETSI distributed speech recognition front-end standard is implemented on an ultra low-power miniature DSP system. The efficient implementation of the ETSI algorithm components, i.e. feature extraction, feature compression and multi-framing, is accomplished through the use of three processing units running concurrently. In addition to a DSP core, an input/output processor creates frames of input speech signals, and a weighted overlap-add (WOLA) filterbank unit performs windowing, FFT and vector multiplications. System evaluation using the TI digits database shows that the performance of the ultra low-power DSP system is equivalent to the reference implementation provided by ETSI.
This paper describes the main components of MiPad (Multimodal Interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a... more
This paper describes the main components of MiPad (Multimodal Interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the existing pen-based PDA interface.
Communication devices which perform distributed speech recognition (DSR) tasks currently transmit standardized coded parameters of speech signals. Recognition features are extracted from signals reconstructed using these on a remote... more
Communication devices which perform distributed speech recognition (DSR) tasks currently transmit standardized coded parameters of speech signals. Recognition features are extracted from signals reconstructed using these on a remote server. Since reconstruction losses degrade recognition performance, proposals are being considered to standardize DSR-codecs which derive recognition features, to be transmitted and used directly for recognition. However, such a codec must be embedded on the transmitting ...
A new Bispectra Analysis application is presented is this paper. A set of bispectrum estimators for robust and effective voice activity detection (VAD) algorithm are proposed for improving speech recognition performance in noisy... more
A new Bispectra Analysis application is presented is this paper. A set of bispectrum estimators for robust and effective voice activity detection (VAD) algorithm are proposed for improving speech recognition performance in noisy environments. The approach is based on filtering the input channel to avoid high energy noisy components and then the determination of the speech/non-speech bispectra by means of third order auto-cumulants. This algorithm differs from many others in the way the decision rule is formulated (detection tests) and the domain used in this approach. Clear improvements in speech/non-speech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that application of statistical detection test leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a previous noise reduction block improving the accuracy in detecting speech and non-speech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
This paper shows a fuzzy logic speech/non-speech discrimination method for improving the performance of speech processing systems working in noise environments. The fuzzy system is based on a Sugeno inference engine with membership... more
This paper shows a fuzzy logic speech/non-speech discrimination method for improving the performance of speech processing systems working in noise environments. The fuzzy system is based on a Sugeno inference engine with membership functions defined as combination of two Gaussian functions. The rule base consists of ten fuzzy if then statements defined in terms of the denoised subband signal-tonoise ratios (SNRs) and the zero crossing rates (ZCRs). Its operation is optimized by means of a hybrid training algorithm combining the leastsquares method and the backpropagation gradient descent method for training membership function parameters. The experiments conducted on the Spanish SpeechDat-Car database shows that the proposed method yields clear improvements over a set of standardized VADs for discontinuous transmission (DTX) and distributed speech recognition (DSR) and also over recently published VAD methods.
This paper addresses the problem of information and service accessibility in mobile devices with limited resources. A solution is developed and tested through a prototype that applies state-of-the-art Distributed Speech Recognition (DSR)... more
This paper addresses the problem of information and service accessibility in mobile devices with limited resources. A solution is developed and tested through a prototype that applies state-of-the-art Distributed Speech Recognition (DSR) and knowledge-based Information ...
In this paper, the performance of the pitch detection algorithm in ETSI ES-202-212 XAFE standard is evaluated on a Mandarin digit string recognition task. Experimental results showed that the performance of the pitch detection algorithm... more
In this paper, the performance of the pitch detection algorithm in ETSI ES-202-212 XAFE standard is evaluated on a Mandarin digit string recognition task. Experimental results showed that the performance of the pitch detection algorithm degraded seriously when the SNR of speech signal was lower than 10dB. This makes the recognizer using pitch information perform inferior to the original recognizer without using pitch information in low SNR environments. A modification of the pitch detection algorithm is therefore proposed to improve the performance of pitch detection in low SNR environments. The recognition performance can be improved for most SNR levels by integrating the recognizers with and without using pitch information. Overall recognition rates of 82.1% and 86.8% were achieved for clean and multi-condition training cases.
A robust and effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on filtering the input channel to avoid high energy noisy components and... more
A robust and effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on filtering the input channel to avoid high energy noisy components and then the determination of the speech/non-speech bispectra by means of third order autocumulants. This algorithm differs from many others in the way the decision rule is formulated (detection tests) and the domain used in this approach. Clear improvements in speech/non-speech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that application of statistical detection test leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a previous noise reduction block improving the accuracy in detecting speech and non-speech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs.
This paper shows a fuzzy logic speech/non-speech discrimination method for improving the performance of speech processing systems working in noise environments. The fuzzy system is based on a Sugeno inference engine with membership... more
This paper shows a fuzzy logic speech/non-speech discrimination method for improving the performance of speech processing systems working in noise environments. The fuzzy system is based on a Sugeno inference engine with membership functions defined as combination of two Gaussian functions. The rule base consists of ten fuzzy if then statements defined in terms of the denoised subband signal-tonoise ratios (SNRs) and the zero crossing rates (ZCRs). Its operation is optimized by means of a hybrid training algorithm combining the leastsquares method and the backpropagation gradient descent method for training membership function parameters. The experiments conducted on the Spanish SpeechDat-Car database shows that the proposed method yields clear improvements over a set of standardized VADs for discontinuous transmission (DTX) and distributed speech recognition (DSR) and also over recently published VAD methods.
This paper shows an improved statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The motivations... more
This paper shows an improved statistical test for voice activity detection in noise adverse environments. The method is based on a revised contextual likelihood ratio test (LRT) defined over a multiple observation window. The motivations for revising the original multiple observation LRT (MO-LRT) are found in its artificially added hangover mechanism that exhibits an incorrect behavior under different signal-to-noise ratio (SNR) conditions. The new approach defines a maximum a posteriori (MAP) statistical test in which all the global hypotheses on the multiple observation window containing up to one speech-to-nonspeech or nonspeech-to-speech transitions are considered. Thus, the implicit hangover mechanism artificially added by the original method was not found in the revised method so its design can be further improved. With these and other innovations, the proposed method showed a higher speech/nonspeech discrimination accuracy over a wide range of SNR conditions when compared to the original MO-LRT voice activity detector (VAD). Experiments conducted on the AURORA databases and tasks showed that the revised method yields significant improvements in speech recognition performance over standardized VADs such as ITU T G.729 and ETSI AMR for discontinuous voice transmission and the ETSI AFE for distributed speech recognition (DSR), as well as over recently reported methods.