Xiaodong Cui - Academia.edu (original) (raw)
Papers by Xiaodong Cui
We present a back-end solution developed at Texas Instruments for noise robust speech recognition... more We present a back-end solution developed at Texas Instruments for noise robust speech recognition. The solution consists of three techniques: 1) a joint additive and convolutive noise compensation (JAC) which adapts speech acoustic models, 2) an enhanced channel estimation procedure which extends JAC performance towards lower SNR ranges, and 3) an N-pass decoding algorithm. The performance of the proposed back-end is evaluated on the Aurora-2 database. With 20% less model parameters and without the need for second order derivative of the recognition features, the performance of the proposed solution is 91.86%, which outperforms that of the ETSI Advanced Front-End standard (88.19%) by more than 30% relative word error rate reduction.
Acoustics Speech and Signal Processing 1988 Icassp 88 1988 International Conference on, Mar 1, 2008
... able to learn the statistical relationship between clean and noisy speech signals directly fr... more ... able to learn the statistical relationship between clean and noisy speech signals directly from the data for denoising, requir-ing ... We discussed the mathematical connections between the proposed MMSE map-ping and other piece-wise linear algorithms known in noise robust ...
The performance of speech recognition systems trained in quiet degrades significantly under noisy... more The performance of speech recognition systems trained in quiet degrades significantly under noisy conditions. To address this problem, a Weighted Viterbi Recognition (WVR) algorithm that is a function of the SNR of each speech frame is proposed. Acoustic models trained on clean data, and the acoustic front-end features are kept unchanged in this approach. Instead, a confidence/robustness factor is assigned to the output observation probability of each speech frame according to its SNR estimate during the Viterbi decoding stage. Comparative experiments are conducted with Weighted Viterbi Recognition with different front-end features such as MFCC, LPCC and PLP. Results show consistent improvements with all three feature vectors. For a reasonable size of adaptation data, WVR outperforms environment adaptation using MLLR.
In this paper, an MLLR-like adaptation approach is proposed whereby the transformation of the mea... more In this paper, an MLLR-like adaptation approach is proposed whereby the transformation of the means is performed deter- ministically based on linearization of VTLN. Biases and adap- tation of the variances are estimated statistically by the EM al- gorithm. In the discrete frequency domain, we show that un- der certain approximations, frequency warping with Mel-£lter- bank-based MFCCs equals a linear
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
Acoutic models trained with clean speech signals suffer in the presence of background noise. In s... more Acoutic models trained with clean speech signals suffer in the presence of background noise. In some situations, only a limited amount of noisy data of the new environment is available based on which the clean models could be adapted. A feature compensation approach employing polynomial regression of the signal-tonoise ratio (SNR) is proposed in this paper. While clean acoustic models remain unchanged, a bias which is a polynomial function of utterance SNR is estimated and removed from the noisy feature. Depending on the amount of noisy data available, the algorithm could be flexibly carried out at different levels of granularity. Based on the Euclidean distance, the similarity between the residual distribution and the clean models are estimated and used as the confidence factor in a back-end Weighted Viterbi Decoding (WVD) algorithm. With limited amounts of noisy data, the feature compensation algorithm outperforms Maximum Likelihood Linear Regression (MLLR) for the Aurora2 database. Weighted Viterbi decoding further improves recognition accuracy.
2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008
This paper presents our recent development of the real-time speech recognition component in the I... more This paper presents our recent development of the real-time speech recognition component in the IBM English/Iraqi Arabic speech-tospeech translation system for the DARPA Transtac project. We describe the details of the acoustic and language modeling that lead to high recognition accuracy and noise robustness and give the performance of the system on the evaluation sets of spontaneous conversational speech. We also introduce the streaming decoding structure and several speedup techniques that achieves best recognition accuracy at about 0.3¢RT recognition speed.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006
Abstract Variance variation with respect to a continuous environment-dependent variable is invest... more Abstract Variance variation with respect to a continuous environment-dependent variable is investigated in this paper in a variable parameter Gaussian mixture HMM (VP-GMHMM) for noisy speech recognition. The variation is modeled by a scaling polynomial applied to the ...
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006
While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to... more While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniques. We report in this paper on our recent effort to create a publicly available database of the first three VTR frequency trajectories. The database contains a representative subset of the TIMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences. A Matlab-based labeling tool is developed, with high-resolution wideband spectrograms displayed to assist in visual identification of VTR frequency values which are then recorded via mouse clicks and local spline interpolation. Special attention is paid to VTR values during consonantto-vowel (CV) and vowel-to-consonant (VC) transitions, and to speech segments with vocal tract anti-resonances. Using this database, we quantitatively assess two common automatic VTR tracking techniques in terms of their average tracking errors analyzed within each of the six major broad phonetic classes as well as during CV and VC transitions. The potential use of the VTR database for research in several areas of speech processing is discussed.
2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013
ABSTRACT Keyword search, in the context of low resource languages, has emerged as a key area of r... more ABSTRACT Keyword search, in the context of low resource languages, has emerged as a key area of research. The dominant approach in keyword search is to use Automatic Speech Recognition (ASR) as a front end to produce a representation of audio that can be indexed. The biggest drawback of this approach lies in its the inability to deal with out-of-vocabulary words and query terms that are not in the ASR system output. In this paper we present an empirical study evaluating various approaches based on using confusion models as query expansion techniques to address this problem. We present results across four languages using a range of confusion models which lead to significant improvements in keyword search performance as measured by the Maximum Term Weighted Value (MTWV) metric.
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
We present a comparative study on combination schemes for large vocabulary continuous speech reco... more We present a comparative study on combination schemes for large vocabulary continuous speech recognition by incorporating longspan class posterior probability features into conventional shorttime cepstral features. System combination can improve the overall speech recognition performance when multiple systems exhibit different error patterns and multiple knowledge sources encode complementary information. A variety of combination approaches are investigated in this paper, e.g., feature concatenation single stream system, model combination multi-stream system, lattice rescoring and ROVER. These techniques work at different levels of a LVCSR system and have different computational cost. We compared their performance and analyzed their advantages and disadvantages on large vocabulary English broadcast news transcription tasks. Experimental results showed that model combination with independent tree consistently outperforms ROVER, feature concatenation and lattice rescoring. In addition, the phoneme posterior probability features do provide complementary information to short-time cepstral features.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
HMM-based acoustic models built from bootstrap are generally very large, especially when full cov... more HMM-based acoustic models built from bootstrap are generally very large, especially when full covariance matrices are used for Gaussians. Therefore, clustering is needed to compact the acoustic model to a reasonable size for practical applications. This paper discusses and investigates multiple distance measurements and algorithms for the clustering. The distance measurements include Entropy, KL, Bhattacharyya, Chernoff and their weighted versions. For clustering algorithms, besides conventional greedy bottom-up, algorithms such as N-Best distance Refinement (NBR), K-step Look-Ahead (KLA), Breadth-First Searched (BFS) best path are proposed. A two-pass optimization approach is also proposed to improve the model structure. Experiments in the Bootstrap and Restructuring (BSRS) framework on Pashto show that the discussed clustering approach can lead to better quality of the restructured model. It also shows that final acoustic model that is diagonalized from the full covariance yields good improvement over BSRS model directly with diagonal model and yields significant improvement over the conventional diagonal model.
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT Automatic speech recognition is a core component of many applications, including keyword... more ABSTRACT Automatic speech recognition is a core component of many applications, including keyword search. In this paper we describe experiments on acoustic modeling, language modeling, and decoding for keyword search on a Cantonese conversational telephony corpus collected as part of the IARPA Babel program. We show that acoustic modeling techniques such as the bootstrapped-and-restructured model and deep neural network acoustic model significantly outperform a state-of-the-art baseline GMM/HMM model, in terms of both recognition performance and keyword search performance, with improvements of up to 11% relative character error rate reduction and 31% relative maximum term weighted value improvement. We show that while an interpolated Model M and neural network LM improve recognition performance, they do not improve keyword search results; however, the advanced LM does reduce the size of the keyword search index. Finally, we show that a simple form of automatically adapted keyword search performs 16% better than a preindexed search system, indicating that out-of-vocabulary search is still a challenge.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT Spoken content in languages of emerging importance needs to be searchable to provide acc... more ABSTRACT Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. In this paper, we investigate the problem of extending data fusion methodologies from Information Retrieval for Spoken Term Detection on low-resource languages in the framework of the IARPA Babel program. We describe a number of alternative methods improving keyword search performance. We apply these methods to Cantonese, a language that presents some new issues in terms of reduced resources and shorter query lengths. First, we show score normalization methodology that improves in average by 20% keyword search performance. Second, we show that properly combining the outputs of diverse ASR systems performs 14% better than the best normalized ASR system.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT We present a system for keyword search on Cantonese conversational telephony audio, coll... more ABSTRACT We present a system for keyword search on Cantonese conversational telephony audio, collected for the IARPA Babel program, that achieves good performance by combining postings lists produced by diverse speech recognition systems from three different research groups. We describe the keyword search task, the data on which the work was done, four different speech recognition systems, and our approach to system combination for keyword search. We show that the combination of four systems outperforms the best single system by 7%, achieving an actual term-weighted value of 0.517.
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
This paper proposes an efficient algorithm for the automatic selection of sentences given a desir... more This paper proposes an efficient algorithm for the automatic selection of sentences given a desired phoneme distribution. The algorithm is based on the Kullback-Leibler measure under the criterion of minimum cross-entropy. One application of this algorithm is the design of adaptation text for automatic speech recognition with a particular phoneme distribution. The algorithm is efficient and flexible, especially in the case of limited text size. Experimental results verify the advantage of this approach.
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
In this paper we investigate stereo-based stochastic mapping (SSM) with context for the noise rob... more In this paper we investigate stereo-based stochastic mapping (SSM) with context for the noise robustness of automatic speech recognition, especially under unseen conditions. Probabilistic PCA (PPCA) is used in the SSM framework to reduce the high dimensionality of the noisy speech features with context and derive an eigen representation in the noisy feature space for the prediction of clean features. To reduce the computational cost in training, an approximation by single-pass re-training is considered for the estimation of joint GMM. We also show that the SSM estimate under the minimum mean square error (MMSE) in a space where low dimensional representation of clean speech and uncorrelated additive noise can be assumed is related to the subspace speech enhancement. Experiments on large vocabulary continuous speech recognition tasks observe gains from the proposed approach under the conditions with seen, unseen and real noise.
We present a back-end solution developed at Texas Instruments for noise robust speech recognition... more We present a back-end solution developed at Texas Instruments for noise robust speech recognition. The solution consists of three techniques: 1) a joint additive and convolutive noise compensation (JAC) which adapts speech acoustic models, 2) an enhanced channel estimation procedure which extends JAC performance towards lower SNR ranges, and 3) an N-pass decoding algorithm. The performance of the proposed back-end is evaluated on the Aurora-2 database. With 20% less model parameters and without the need for second order derivative of the recognition features, the performance of the proposed solution is 91.86%, which outperforms that of the ETSI Advanced Front-End standard (88.19%) by more than 30% relative word error rate reduction.
Acoustics Speech and Signal Processing 1988 Icassp 88 1988 International Conference on, Mar 1, 2008
... able to learn the statistical relationship between clean and noisy speech signals directly fr... more ... able to learn the statistical relationship between clean and noisy speech signals directly from the data for denoising, requir-ing ... We discussed the mathematical connections between the proposed MMSE map-ping and other piece-wise linear algorithms known in noise robust ...
The performance of speech recognition systems trained in quiet degrades significantly under noisy... more The performance of speech recognition systems trained in quiet degrades significantly under noisy conditions. To address this problem, a Weighted Viterbi Recognition (WVR) algorithm that is a function of the SNR of each speech frame is proposed. Acoustic models trained on clean data, and the acoustic front-end features are kept unchanged in this approach. Instead, a confidence/robustness factor is assigned to the output observation probability of each speech frame according to its SNR estimate during the Viterbi decoding stage. Comparative experiments are conducted with Weighted Viterbi Recognition with different front-end features such as MFCC, LPCC and PLP. Results show consistent improvements with all three feature vectors. For a reasonable size of adaptation data, WVR outperforms environment adaptation using MLLR.
In this paper, an MLLR-like adaptation approach is proposed whereby the transformation of the mea... more In this paper, an MLLR-like adaptation approach is proposed whereby the transformation of the means is performed deter- ministically based on linearization of VTLN. Biases and adap- tation of the variances are estimated statistically by the EM al- gorithm. In the discrete frequency domain, we show that un- der certain approximations, frequency warping with Mel-£lter- bank-based MFCCs equals a linear
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
Acoutic models trained with clean speech signals suffer in the presence of background noise. In s... more Acoutic models trained with clean speech signals suffer in the presence of background noise. In some situations, only a limited amount of noisy data of the new environment is available based on which the clean models could be adapted. A feature compensation approach employing polynomial regression of the signal-tonoise ratio (SNR) is proposed in this paper. While clean acoustic models remain unchanged, a bias which is a polynomial function of utterance SNR is estimated and removed from the noisy feature. Depending on the amount of noisy data available, the algorithm could be flexibly carried out at different levels of granularity. Based on the Euclidean distance, the similarity between the residual distribution and the clean models are estimated and used as the confidence factor in a back-end Weighted Viterbi Decoding (WVD) algorithm. With limited amounts of noisy data, the feature compensation algorithm outperforms Maximum Likelihood Linear Regression (MLLR) for the Aurora2 database. Weighted Viterbi decoding further improves recognition accuracy.
2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008
This paper presents our recent development of the real-time speech recognition component in the I... more This paper presents our recent development of the real-time speech recognition component in the IBM English/Iraqi Arabic speech-tospeech translation system for the DARPA Transtac project. We describe the details of the acoustic and language modeling that lead to high recognition accuracy and noise robustness and give the performance of the system on the evaluation sets of spontaneous conversational speech. We also introduce the streaming decoding structure and several speedup techniques that achieves best recognition accuracy at about 0.3¢RT recognition speed.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006
Abstract Variance variation with respect to a continuous environment-dependent variable is invest... more Abstract Variance variation with respect to a continuous environment-dependent variable is investigated in this paper in a variable parameter Gaussian mixture HMM (VP-GMHMM) for noisy speech recognition. The variation is modeled by a scaling polynomial applied to the ...
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006
While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to... more While vocal tract resonances (VTRs, or formants that are defined as such resonances) are known to play a critical role in human speech perception and in computer speech processing, there has been a lack of standard databases needed for the quantitative evaluation of automatic VTR extraction techniques. We report in this paper on our recent effort to create a publicly available database of the first three VTR frequency trajectories. The database contains a representative subset of the TIMIT corpus with respect to speaker, gender, dialect and phonetic context, with a total of 538 sentences. A Matlab-based labeling tool is developed, with high-resolution wideband spectrograms displayed to assist in visual identification of VTR frequency values which are then recorded via mouse clicks and local spline interpolation. Special attention is paid to VTR values during consonantto-vowel (CV) and vowel-to-consonant (VC) transitions, and to speech segments with vocal tract anti-resonances. Using this database, we quantitatively assess two common automatic VTR tracking techniques in terms of their average tracking errors analyzed within each of the six major broad phonetic classes as well as during CV and VC transitions. The potential use of the VTR database for research in several areas of speech processing is discussed.
2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013
ABSTRACT Keyword search, in the context of low resource languages, has emerged as a key area of r... more ABSTRACT Keyword search, in the context of low resource languages, has emerged as a key area of research. The dominant approach in keyword search is to use Automatic Speech Recognition (ASR) as a front end to produce a representation of audio that can be indexed. The biggest drawback of this approach lies in its the inability to deal with out-of-vocabulary words and query terms that are not in the ASR system output. In this paper we present an empirical study evaluating various approaches based on using confusion models as query expansion techniques to address this problem. We present results across four languages using a range of confusion models which lead to significant improvements in keyword search performance as measured by the Maximum Term Weighted Value (MTWV) metric.
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
We present a comparative study on combination schemes for large vocabulary continuous speech reco... more We present a comparative study on combination schemes for large vocabulary continuous speech recognition by incorporating longspan class posterior probability features into conventional shorttime cepstral features. System combination can improve the overall speech recognition performance when multiple systems exhibit different error patterns and multiple knowledge sources encode complementary information. A variety of combination approaches are investigated in this paper, e.g., feature concatenation single stream system, model combination multi-stream system, lattice rescoring and ROVER. These techniques work at different levels of a LVCSR system and have different computational cost. We compared their performance and analyzed their advantages and disadvantages on large vocabulary English broadcast news transcription tasks. Experimental results showed that model combination with independent tree consistently outperforms ROVER, feature concatenation and lattice rescoring. In addition, the phoneme posterior probability features do provide complementary information to short-time cepstral features.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
HMM-based acoustic models built from bootstrap are generally very large, especially when full cov... more HMM-based acoustic models built from bootstrap are generally very large, especially when full covariance matrices are used for Gaussians. Therefore, clustering is needed to compact the acoustic model to a reasonable size for practical applications. This paper discusses and investigates multiple distance measurements and algorithms for the clustering. The distance measurements include Entropy, KL, Bhattacharyya, Chernoff and their weighted versions. For clustering algorithms, besides conventional greedy bottom-up, algorithms such as N-Best distance Refinement (NBR), K-step Look-Ahead (KLA), Breadth-First Searched (BFS) best path are proposed. A two-pass optimization approach is also proposed to improve the model structure. Experiments in the Bootstrap and Restructuring (BSRS) framework on Pashto show that the discussed clustering approach can lead to better quality of the restructured model. It also shows that final acoustic model that is diagonalized from the full covariance yields good improvement over BSRS model directly with diagonal model and yields significant improvement over the conventional diagonal model.
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT Automatic speech recognition is a core component of many applications, including keyword... more ABSTRACT Automatic speech recognition is a core component of many applications, including keyword search. In this paper we describe experiments on acoustic modeling, language modeling, and decoding for keyword search on a Cantonese conversational telephony corpus collected as part of the IARPA Babel program. We show that acoustic modeling techniques such as the bootstrapped-and-restructured model and deep neural network acoustic model significantly outperform a state-of-the-art baseline GMM/HMM model, in terms of both recognition performance and keyword search performance, with improvements of up to 11% relative character error rate reduction and 31% relative maximum term weighted value improvement. We show that while an interpolated Model M and neural network LM improve recognition performance, they do not improve keyword search results; however, the advanced LM does reduce the size of the keyword search index. Finally, we show that a simple form of automatically adapted keyword search performs 16% better than a preindexed search system, indicating that out-of-vocabulary search is still a challenge.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT Spoken content in languages of emerging importance needs to be searchable to provide acc... more ABSTRACT Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. In this paper, we investigate the problem of extending data fusion methodologies from Information Retrieval for Spoken Term Detection on low-resource languages in the framework of the IARPA Babel program. We describe a number of alternative methods improving keyword search performance. We apply these methods to Cantonese, a language that presents some new issues in terms of reduced resources and shorter query lengths. First, we show score normalization methodology that improves in average by 20% keyword search performance. Second, we show that properly combining the outputs of diverse ASR systems performs 14% better than the best normalized ASR system.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT We present a system for keyword search on Cantonese conversational telephony audio, coll... more ABSTRACT We present a system for keyword search on Cantonese conversational telephony audio, collected for the IARPA Babel program, that achieves good performance by combining postings lists produced by diverse speech recognition systems from three different research groups. We describe the keyword search task, the data on which the work was done, four different speech recognition systems, and our approach to system combination for keyword search. We show that the combination of four systems outperforms the best single system by 7%, achieving an actual term-weighted value of 0.517.
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
This paper proposes an efficient algorithm for the automatic selection of sentences given a desir... more This paper proposes an efficient algorithm for the automatic selection of sentences given a desired phoneme distribution. The algorithm is based on the Kullback-Leibler measure under the criterion of minimum cross-entropy. One application of this algorithm is the design of adaptation text for automatic speech recognition with a particular phoneme distribution. The algorithm is efficient and flexible, especially in the case of limited text size. Experimental results verify the advantage of this approach.
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
In this paper we investigate stereo-based stochastic mapping (SSM) with context for the noise rob... more In this paper we investigate stereo-based stochastic mapping (SSM) with context for the noise robustness of automatic speech recognition, especially under unseen conditions. Probabilistic PCA (PPCA) is used in the SSM framework to reduce the high dimensionality of the noisy speech features with context and derive an eigen representation in the noisy feature space for the prediction of clean features. To reduce the computational cost in training, an approximation by single-pass re-training is considered for the estimation of joint GMM. We also show that the SSM estimate under the minimum mean square error (MMSE) in a space where low dimensional representation of clean speech and uncorrelated additive noise can be assumed is related to the subspace speech enhancement. Experiments on large vocabulary continuous speech recognition tasks observe gains from the proposed approach under the conditions with seen, unseen and real noise.