Kaisheng Yao | Microsoft Research (original) (raw)

Papers by Kaisheng Yao

7th International Conference on Spoken Language Processing (ICSLP 2002)

In this paper, we present evaluation results of a noise adaptive speech recognition system with c... more In this paper, we present evaluation results of a noise adaptive speech recognition system with combination of several techniques for robust speech recognition. The evaluation was on AURORA 3 database which contains noisy digit utterances collected in real car environments through close-talking and hands-free microphones. The techniques in the system include segmentation, maximum likelihood linear regression (MLLR) and non-stationary environment compensation by noise adaptive speech recognition. Through experiments, it is observed that the system has competitive performance improvement in all evaluations over the baseline results provided for the evaluation. As a whole, the system achieved 28% of relative performance improvement.

Interspeech 2013, 2013

Recurrent Neural Network Language Models (RNN-LMs) have recently shown exceptional performance ac... more Recurrent Neural Network Language Models (RNN-LMs) have recently shown exceptional performance across a variety of applications. In this paper, we modify the architecture to perform Language Understanding, and advance the state-of-the-art for the widely used ATIS dataset. The core of our approach is to take words as input as in a standard RNN-LM, and then to predict slot labels rather than words on the output side. We present several variations that differ in the amount of word context that is used on the input side, and in the use of non-lexical features. Remarkably, our simplest model produces state-of-the-art results, and we advance state-of-the-art through the use of bagof-words, word embedding, named-entity, syntactic, and wordclass features. Analysis indicates that the superior performance is attributable to the task-specific word representations learned by the RNN.

Interspeech 2015, 2015

In this work we present intermediate-layer deep neural network adaptation (DNN) techniques upon w... more In this work we present intermediate-layer deep neural network adaptation (DNN) techniques upon which we build offline as well as iterative speaker adaptation for online applications. We motivate our online work for task completion in Microsoft personal voice assistant, where we present different adaptation styles in a speech session e.g., (a) adapt the speakerindependent (SI) model on the current utterance, (b) recursively adapt an incremental speaker-dependent (SD) model in the session for just the previous utterance, (c) adapt the SI model for all past utterances in the session. We considered a number of adaptation techniques and demonstrated that the intermediatelayer approach with inserting-and-adapting a linear layer on top of an intermediate singular-value-decomposition layer provides the best results for offline adaptation, where we obtained respectively 22.6% and 12% relative reduction in word-errorrate (WER) for supervised and unsupervised adaptation on 100utterances. An alternative intermediate-layer recursive adaptation in a 5-utterances session provided 6% relative-reduction in WER for online applications.

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

This paper presents a novel interactive method for recognizing handwritten words, using the inert... more This paper presents a novel interactive method for recognizing handwritten words, using the inertial sensor data available on smart watches. The goal is to allow the user to write with a finger, and use the smart watch sensor signals to infer what the user has written. Past work has exploited the similarity of handwriting recognition to speech recognition in order to deploy HMM based methods. In contrast to speech recognition, however, in our scenario, the user can see the individual letters that are recognized on a sequential basis, and provide feedback or corrections after each letter. In this paper, we exploit this key difference to improve the input mechanism over a classical source-channel model. For a small increase in the amount of time required to input a word, we improve recognition accuracy from 59.6% to 91.4% with an implicit feedback mechanism, and to 100% with an explicit feedback mechanism.

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

A new type of deep neural networks (DNNs) is presented in this paper. Traditional DNNs use the mu... more A new type of deep neural networks (DNNs) is presented in this paper. Traditional DNNs use the multinomial logistic regression (softmax activation) at the top layer for classification. The new DNN instead uses a support vector machine (SVM) at the top layer. Two training algorithms are proposed at the frame and sequence-level to learn parameters of SVM and DNN in the maximum-margin criteria. In the frame-level training, the new model is shown to be related to the multiclass SVM with DNN features; In the sequence-level training, it is related to the structured SVM with DNN features and HMM state transition features. Its decoding process is similar to the DNN-HMM hybrid system but with framelevel posterior probabilities replaced by scores from the SVM. We term the new model deep neural support vector machine (DNSVM). We have verified its effectiveness on the TIMIT task for continuous speech recognition.

2014 IEEE Spoken Language Technology Workshop (SLT), 2014

Neural network based approaches have recently produced record-setting performances in natural lan... more Neural network based approaches have recently produced record-setting performances in natural language understanding tasks such as word labeling. In the word labeling task, a tagger is used to assign a label to each word in an input sequence. Specifically, simple recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have shown to significantly outperform the previous state-of-theart-conditional random fields (CRFs). This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task. To explicitly model output-label dependence, we propose a regression model on top of the LSTM un-normalized scores. We also propose to apply deep LSTM to the task. We investigated the relative importance of each gate in the LSTM by setting other gates to a constant and only learning particular gates. Experiments on the ATIS dataset validated the effectiveness of the proposed models.

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015

Semantic slot filling is one of the most challenging problems in spoken language understanding (S... more Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this study, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain. Index Termsspoken language understanding, word embedding, recurrent neural network, slot filling. I. INTRODUCTION he term "spoken language understanding"' (SLU) refers to the targeted understanding of human speech directed at machines [1]. The goal of such "targeted" understanding is to convert the recognition of user input, , into a task-specific semantic representation of the user's intention, at each turn. The dialog manager then interprets and decides on the most appropriate system action, , exploiting semantic context, user specific meta-information, such as geo-location and personal preferences, and other contextual information.

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In ... more Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed.

2012 IEEE Spoken Language Technology Workshop (SLT), 2012

In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neu... more In this paper, we evaluate the effectiveness of adaptation methods for context-dependent deep-neural-network hidden Markov models (CD-DNN-HMMs) for automatic speech recognition. We investigate the affine transformation and several of its variants for adapting the top hidden layer. We compare the affine transformations against direct adaptation of the softmax layer weights. The feature-space discriminative linear regression (fDLR) method with the affine transformations on the input layer is also evaluated. On a large vocabulary speech recognition task, a stochastic gradient ascent implementation of the fDLR and the top hidden layer adaptation is shown to reduce word error rates (WERs) by 17% and 14%, respectively, compared to the baseline DNN performances. With a batch update implementation, the softmax layer adaptation technique reduces WERs by 10%. We observe that using bias shift performs as well as doing scaling plus bias shift.

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

We propose a novel regularized adaptation technique for context dependent deep neural network hid... more We propose a novel regularized adaptation technique for context dependent deep neural network hidden Markov models (CD-DNN-HMMs). The CD-DNN-HMM has a large output layer and many large hidden layers, each with thousands of neurons. The huge number of parameters in the CD-DNN-HMM makes adaptation a challenging task, esp. when the adaptation set is small. The technique developed in this paper adapts the model conservatively by forcing the senone distribution estimated from the adapted model to be close to that from the unadapted model. This constraint is realized by adding Kullback-Leibler divergence (KLD) regularization to the adaptation criterion. We show that applying this regularization is equivalent to changing the target distribution in the conventional backpropagation algorithm. Experiments on Xbox voice search, short message dictation, and Switchboard and lecture speech transcription tasks demonstrate that the proposed adaptation technique can provide 2%-30% relative error reduction against the already very strong speaker independent CD-DNN-HMM systems using different adaptation sets under both supervised and unsupervised adaptation setups.

Neurocomputing, 2014

We describe a novel maximum likelihood nonlinear feature bias compensation method for Gaussian mi... more We describe a novel maximum likelihood nonlinear feature bias compensation method for Gaussian mixture model-hidden Markov model (GMM-HMM) adaptation. Our approach exploits a single-hiddenlayer neural network (SHLNN) that, similar to the extreme learning machine (ELM), uses randomly generated lower-layer weights and linear output units. Different from the conventional ELM, however, our approach optimizes the SHLNN parameters by maximizing the likelihood of observing the features given the speaker-independent GMM-HMM. We derive a novel and efficient learning algorithm for optimizing this criterion. We show, on a large vocabulary speech recognition task, that the proposed approach can cut the word error rate (WER) by 13% over the feature maximum likelihood linear regression (fMLLR) method with bias compensation, and can cut the WER by more than 5% over the fMLLR method with both bias and rotation transformations if applied on top of the fMLLR. Overall, it can reduce the WER by more than 27% over the speaker-independent system with 0.2 real-time adaptation time.

We present a sequential Monte Carlo method applied to additive noise compensation for robust spee... more We present a sequential Monte Carlo method applied to additive noise compensation for robust speech recognition in time-varying noise. The method generates a set of samples according to the prior distribution given by clean speech models and noise prior evolved from previous estimation. An explicit model representing noise effects on speech features is used, so that an extended Kalman filter is constructed for each sample, generating the updated continuous state estimate as the estimation of the noise parameter, and prediction likelihood for weighting each sample. Minimum mean square error (MMSE) inference of the time-varying noise parameter is carried out over these samples by fusion the estimation of samples according to their weights. A residual resampling selection step and a Metropolis-Hastings smoothing step are used to improve calculation efficiency. Experiments were conducted on speech recognition in simulated non-stationary noises, where noise power changed artificially, and highly non-stationary Machinegun noise. In all the experiments carried out, we observed that the method can have significant recognition performance improvement, over that achieved by noise compensation with stationary noise assumption.

7th International Conference on Spoken Language Processing (ICSLP 2002)

Interspeech 2013, 2013

Interspeech 2015, 2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

2014 IEEE Spoken Language Technology Workshop (SLT), 2014

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

2012 IEEE Spoken Language Technology Workshop (SLT), 2012

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

Neurocomputing, 2014