Hugo Van hamme | Katholieke Universiteit Leuven (original) (raw)
Papers by Hugo Van hamme
Interspeech 2014
In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-spee... more In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-speech recognition. These models are first, frame-based Gaussian posteriorgrams, obtained from Vector Quantization (VQ), second, so-called Acoustic Unit Descriptors (AUDs), which are hidden Markov models of phone-like units, that are trained in an unsupervised fashion, and, third, posteriorgrams computed on the AUDs. Experiments were carried out on a database collected from a home automation task and containing nine speakers, of which seven are considered to utter dysarthric speech. All unsupervised modeling approaches delivered significantly better recognition rates than a speaker-independent phoneme recognition baseline, showing the suitability of unsupervised acoustic model training for dysarthric speech. While the AUD models led to the most compact representation of an utterance for the subsequent semantic inference stage, posteriorgram-based representations resulted in higher recognition rates, with the Gaussian posteriorgram achieving the highest slot filling F-score of 97.02%.
Interspeech 2004, 2004
Missing data theory has been applied to the problem of speech recognition in adverse environments... more Missing data theory has been applied to the problem of speech recognition in adverse environments. The resulting systems require acoustic models that are expressed in the spectral rather than in the cepstral domain, which leads to loss of accuracy. Cepstral Missing Data Techniques (CMDT) surmount this disadvantage, but require significantly more computation. In this paper, we study alternatives to the cepstral representation that lead to more efficient MDT systems. The proposed solution, PROSPECT features (Projected Spectra), can be interpreted as a novel speech representation, or as an approximation of the inverse covariance (precision) matrix of the Gaussian distributions modeling the log-spectra.
Interspeech 2006, 2006
Missing Data Techniques have already shown their effectiveness in dealing with additive noise in ... more Missing Data Techniques have already shown their effectiveness in dealing with additive noise in automatic speech recognition systems. For real-life deployments, a compensation for linear filtering distortions is also required. Channel compensation in speech recognition typically involves estimating an additive shift in the log-spectral or cepstral domain. This paper explores a maximum likelihood technique to estimate this model offset while some data are missing. Recognition experiments on the Aurora2 recognition task demonstrate the effectiveness of this technique. In particular, we show that our method is more accurate than previously published methods and can handle narrow-band data.
Interspeech 2009, 2009
The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic spe... more The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic speech recognition (ASR) systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate accurate masks in environments with unknown, non-stationary noise statistics, we need to rely on a strong model for the speech. In this paper, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed using a high resolution and reassigned time-frequency representation. This representation facilitates an accurate detection of the patches that are active in unseen noisy speech. After further denoising of the patch activations, speech and noise can be reconstructed from which missing feature masks are estimated. Recognition experiments on the Aurora2 database demonstrate the effectiveness of this technique.
Interspeech 2015, 2015
We present a novel automatic speech recognition (ASR) scheme which uses the recently proposed noi... more We present a novel automatic speech recognition (ASR) scheme which uses the recently proposed noise robust exemplar matching framework for speech enhancement in the front-end. The proposed system employs a GMM-HMM back-end to recognize the enhanced speech signals unlike the prior work focusing on template matching only. Speech enhancement is achieved using multiple dictionaries containing speech exemplars representing a single speech unit and several noise exemplars of the same length. These combined dictionaries are used to approximate the noisy segments and the speech component is obtained as a linear combination of the speech exemplars in the combined dictionaries yielding the minimum total reconstruction error. The performance of the proposed system is evaluated on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database and the results have shown the effectiveness of the proposed approach in improving the noise robustness of a conventional ASR system.
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016
Exemplar-based acoustic modeling is based on labeled training segments that are compared with the... more Exemplar-based acoustic modeling is based on labeled training segments that are compared with the unseen test utterances with respect to a dissimilarity measure. Using a larger number of accurately labeled exemplars provides better generalization thus improved recognition performance which comes with increased computation and memory requirements. We have recently developed a noise robust exemplar matching-based automatic speech recognition system which uses a large number of undercomplete dictionaries containing speech exemplars of the same length and label to recognize noisy speech. In this work, we investigate several speech exemplar selection techniques proposed for undercomplete speech dictionaries to find a trade-off between the recognition accuracy and the acoustic model size in terms of the amount of speech exemplars used for recognition. The exemplar selection criterion has be to chosen carefully as the amount of redundancy in these dictionaries is very limited compared to overcomplete dictionaries containing plenty of exemplars. The recognition accuracies obtained on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database using the complete and pruned dictionaries are compared to investigate the performance of each selection criterion.
2015 23rd European Signal Processing Conference (EUSIPCO), 2015
Exemplar-based feature enhancement successfully exploits a wide temporal signal context. We exten... more Exemplar-based feature enhancement successfully exploits a wide temporal signal context. We extend this technique with hybrid input spaces that are chosen for a more effective separation of speech from background noise. This work investigates the use of two different hybrid input spaces which are formed by incorporating the full-resolution and modulation envelope spectral representations with the Mel features. A coupled output dictionary containing Mel exemplars, which are jointly extracted with the hybrid space exemplars, is used to reconstruct the enhanced Mel features for the ASR back-end. When compared to the system which uses Mel features only as input exemplars, these hybrid input spaces are found to yield improved word error rates on the AURORA-2 database especially with unseen noise cases.
Speech Communication, 2016
The noise robust exemplar matching (N-REM) framework performs automatic speech recognition using ... more The noise robust exemplar matching (N-REM) framework performs automatic speech recognition using exemplars, which are the labeled spectrographic representations of speech segments extracted from training data. By incorporating a sparse representations formulation, this technique remedies the inherent noise modeling problem of conventional exemplar matching-based automatic speech recognition systems. In this framework, noisy speech segments are approximated as a sparse linear combination of the exemplars of multiple lengths, each associated with a single speech unit such as words, half-words or phones. On account of the reconstruction error-based back end, the recognition accuracy highly depends on the congruence of the speech features and the divergence metric used to compare the speech segments with exemplars. In this work, we replace the conventional Kullback-Leibler divergence (KLD) with a generalized divergence family called the Alpha-Beta divergence with two parameters, α and β, in conjunction with mel-scaled magnitude spectral features. The proposed recognizer traverses the (α,β) plane depending on the amount of contamination to provide better separation of speech and noise sources. Moreover, we apply our recently proposed active noise exemplar selection (ANES) technique in a more realistic scenario where the target utterances are degraded by genuine room noise. Recognition experiments on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database have shown that the novel recognizer with the AB divergence and ANES outperforms the baseline system using the generalized KLD with tuned sparsity, especially at lower SNR levels.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015
Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum ... more Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary, and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a lowrank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
Deep neural network (DNN) based acoustic modelling has been successfully used for a variety of au... more Deep neural network (DNN) based acoustic modelling has been successfully used for a variety of automatic speech recognition (ASR) tasks, thanks to its ability to learn higher-level information using multiple hidden layers. This paper investigates the recently proposed exemplar-based speech enhancement technique using coupled dictionaries as a pre-processing stage for DNN-based systems. In this setting, the noisy speech is decomposed as a weighted sum of atoms in an input dictionary containing exemplars sampled from a domain of choice, and the resulting weights are applied to a coupled output dictionary containing exemplars sampled in the shorttime Fourier transform (STFT) domain to directly obtain the speech and noise estimates for speech enhancement. In this work, settings using input dictionary of exemplars sampled from the STFT, Melintegrated magnitude STFT and modulation envelope spectra are evaluated. Experiments performed on the AURORA-4 database revealed that these pre-processing stages can improve the performance of the DNN-HMM-based ASR systems with both clean and multicondition training.
In missing feature theory (MFT), noise robustness of speech recognizers is obtained by modifying ... more In missing feature theory (MFT), noise robustness of speech recognizers is obtained by modifying the likelihood computed by the acoustic model to express that some features extracted from the signal are unreliable or missing. In one implementation of MFT, the acoustic model and bounds on the unreliable feature are used to infer an estimate of the missing data. This paper addresses an observed bias of the likelihood evaluated at the estimate. Theoretical and experimental evidence are provided that an upper bound on the accuracy is improved by applying a computationally simple correction for the number of free variables in the likelihood maximization.
2014 IEEE Spoken Language Technology Workshop (SLT), 2014
We propose a novel exemplar-based feature enhancement method for automatic speech recognition whi... more We propose a novel exemplar-based feature enhancement method for automatic speech recognition which uses coupled dictionaries: an input dictionary containing atoms sampled in the modulation (envelope) spectrogram domain and an output dictionary with atoms in the Mel or full-resolution frequency domain. The input modulation representation is chosen for its separation properties of speech and noise and for its relation with human auditory processing. The output representation is one which can be processed by the ASR backend. The proposed method was investigated on the AURORA-2 and AURORA-4 databases and improved word error rates (WER) were obtained when compared to the system which uses Mel features in the input exemplars. The paper also proposes a hybrid system which combines the baseline and the proposed algorithm on the AURORA-2 database which in turn also yielded improvement over both the algorithms.
The speech reception threshold (SRT) is the noise level at which the speech recognition rate of a... more The speech reception threshold (SRT) is the noise level at which the speech recognition rate of a test person is 50%. SRT measurement is relevant for patient screening, psychoacoustic research and algorithm development in hearing aids and cochlear implants. In this paper, we report on our efforts to automate SRT measurement using an automatic speech recognizer. During a test, sentences are presented to the test subject at different SNR levels. The person under test repeats the sentence and the keywords it contains are scored by an audiologist. If all keywords are repeated correctly, the sentence is evaluated as correct. The SNR level of each sentence is adjusted based on the previous sentence's evaluation. Aiming for an objective and repeatable measurement, the audiologist's assessment is replaced by an automatic speech recognizer's evaluation. For this purpose, we investigate different finite state transducer structures to model the expected sentences as well as the impact of several speaker adaptation schemes on the keyword detection accuracy. A baseline recognizer using general acoustic models achieves a performance of 88.8% keyword detection rate. Speaker adapted acoustic models improve the performance yielding a keyword detection accuracy of up to 90.7%. Finally, the impact of recognition errors on the estimated SRT value is simulated showing a minimal impact on the SRT measurement process. Based on this analysis, it can be concluded that the proposed automatic evaluation scheme is a viable tool for speech reception threshold measurements.
In this work we describe research aimed at developing an assistive vocal interface for users with... more In this work we describe research aimed at developing an assistive vocal interface for users with a speech impairment. In contrast to existing approaches, the vocal interface is self-learning, which means it is maximally adapted to the end-user and can be used with any language, dialect, vocabulary and grammar. The paper describes the overall learning framework and the vocabulary acquisition technique, and proposes a novel grammar induction technique based on weakly supervised hidden Markov model learning. We evaluate early implementations of these vocabulary and grammar learning components on two datasets: recorded sessions of a vocally guided card game by non-impaired speakers and speech-impaired users engaging in a home automation task.
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
In this paper, we investigate the performance of a noise-robust sparse representations (SR)-based... more In this paper, we investigate the performance of a noise-robust sparse representations (SR)-based recognizer using the Alpha-Beta (AB)divergence to compare the noisy speech segments and exemplars. The baseline recognizer, which approximates noisy speech segments as a linear combination of speech and noise exemplars of variable length, uses the generalized Kullback-Leibler divergence to quantify the approximation quality. Incorporating a reconstruction errorbased back-end, the recognition performance highly depends on the congruence of the divergence measure and used speech features. Having two tuning parameters, namely α and β, the AB-divergence provides improved robustness against background noise and outliers. These parameters can be adjusted for better performance depending on the distribution of speech and noise exemplars in the high-dimensional feature space. Moreover, various well-known distance/divergence measures such as the Euclidean distance, generalized Kullback-Leibler divergence, Itakura-Saito divergence and Hellinger distance are special cases of the AB-divergence for different (α, β) values. The goal of this work is to investigate the optimal divergence for mel-scaled magnitude spectral features by performing recognition experiments at several SNR levels using different (α, β) pairs. The results demonstrate the effectiveness of the AB-divergence compared to the generalized Kullback-Leibler divergence especially at the lower SNR levels.
Theory and Applications of Natural Language Processing, 2012
Robust Speech Recognition of Uncertain or Missing Data, 2011
Speech Communication, 2009
We present a self-learning algorithm using a bottom-up based approach to automatically discover, ... more We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time-frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness.
IEEE Journal of Selected Topics in Signal Processing, 2010
An effective way to increase the noise robustness of automatic speech recognition is to label noi... more An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing), and to replace (impute) the missing ones by clean speech estimates. Conventional im putation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low SNR's these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper we introduce a novel non-parametric, exemplarbased method for reconstructing clean speech from noisy ob servations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation tech niques at SNR =-5 dB when using an ideal 'oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.
Interspeech 2014
In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-spee... more In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-speech recognition. These models are first, frame-based Gaussian posteriorgrams, obtained from Vector Quantization (VQ), second, so-called Acoustic Unit Descriptors (AUDs), which are hidden Markov models of phone-like units, that are trained in an unsupervised fashion, and, third, posteriorgrams computed on the AUDs. Experiments were carried out on a database collected from a home automation task and containing nine speakers, of which seven are considered to utter dysarthric speech. All unsupervised modeling approaches delivered significantly better recognition rates than a speaker-independent phoneme recognition baseline, showing the suitability of unsupervised acoustic model training for dysarthric speech. While the AUD models led to the most compact representation of an utterance for the subsequent semantic inference stage, posteriorgram-based representations resulted in higher recognition rates, with the Gaussian posteriorgram achieving the highest slot filling F-score of 97.02%.
Interspeech 2004, 2004
Missing data theory has been applied to the problem of speech recognition in adverse environments... more Missing data theory has been applied to the problem of speech recognition in adverse environments. The resulting systems require acoustic models that are expressed in the spectral rather than in the cepstral domain, which leads to loss of accuracy. Cepstral Missing Data Techniques (CMDT) surmount this disadvantage, but require significantly more computation. In this paper, we study alternatives to the cepstral representation that lead to more efficient MDT systems. The proposed solution, PROSPECT features (Projected Spectra), can be interpreted as a novel speech representation, or as an approximation of the inverse covariance (precision) matrix of the Gaussian distributions modeling the log-spectra.
Interspeech 2006, 2006
Missing Data Techniques have already shown their effectiveness in dealing with additive noise in ... more Missing Data Techniques have already shown their effectiveness in dealing with additive noise in automatic speech recognition systems. For real-life deployments, a compensation for linear filtering distortions is also required. Channel compensation in speech recognition typically involves estimating an additive shift in the log-spectral or cepstral domain. This paper explores a maximum likelihood technique to estimate this model offset while some data are missing. Recognition experiments on the Aurora2 recognition task demonstrate the effectiveness of this technique. In particular, we show that our method is more accurate than previously published methods and can handle narrow-band data.
Interspeech 2009, 2009
The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic spe... more The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic speech recognition (ASR) systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate accurate masks in environments with unknown, non-stationary noise statistics, we need to rely on a strong model for the speech. In this paper, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed using a high resolution and reassigned time-frequency representation. This representation facilitates an accurate detection of the patches that are active in unseen noisy speech. After further denoising of the patch activations, speech and noise can be reconstructed from which missing feature masks are estimated. Recognition experiments on the Aurora2 database demonstrate the effectiveness of this technique.
Interspeech 2015, 2015
We present a novel automatic speech recognition (ASR) scheme which uses the recently proposed noi... more We present a novel automatic speech recognition (ASR) scheme which uses the recently proposed noise robust exemplar matching framework for speech enhancement in the front-end. The proposed system employs a GMM-HMM back-end to recognize the enhanced speech signals unlike the prior work focusing on template matching only. Speech enhancement is achieved using multiple dictionaries containing speech exemplars representing a single speech unit and several noise exemplars of the same length. These combined dictionaries are used to approximate the noisy segments and the speech component is obtained as a linear combination of the speech exemplars in the combined dictionaries yielding the minimum total reconstruction error. The performance of the proposed system is evaluated on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database and the results have shown the effectiveness of the proposed approach in improving the noise robustness of a conventional ASR system.
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016
Exemplar-based acoustic modeling is based on labeled training segments that are compared with the... more Exemplar-based acoustic modeling is based on labeled training segments that are compared with the unseen test utterances with respect to a dissimilarity measure. Using a larger number of accurately labeled exemplars provides better generalization thus improved recognition performance which comes with increased computation and memory requirements. We have recently developed a noise robust exemplar matching-based automatic speech recognition system which uses a large number of undercomplete dictionaries containing speech exemplars of the same length and label to recognize noisy speech. In this work, we investigate several speech exemplar selection techniques proposed for undercomplete speech dictionaries to find a trade-off between the recognition accuracy and the acoustic model size in terms of the amount of speech exemplars used for recognition. The exemplar selection criterion has be to chosen carefully as the amount of redundancy in these dictionaries is very limited compared to overcomplete dictionaries containing plenty of exemplars. The recognition accuracies obtained on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database using the complete and pruned dictionaries are compared to investigate the performance of each selection criterion.
2015 23rd European Signal Processing Conference (EUSIPCO), 2015
Exemplar-based feature enhancement successfully exploits a wide temporal signal context. We exten... more Exemplar-based feature enhancement successfully exploits a wide temporal signal context. We extend this technique with hybrid input spaces that are chosen for a more effective separation of speech from background noise. This work investigates the use of two different hybrid input spaces which are formed by incorporating the full-resolution and modulation envelope spectral representations with the Mel features. A coupled output dictionary containing Mel exemplars, which are jointly extracted with the hybrid space exemplars, is used to reconstruct the enhanced Mel features for the ASR back-end. When compared to the system which uses Mel features only as input exemplars, these hybrid input spaces are found to yield improved word error rates on the AURORA-2 database especially with unseen noise cases.
Speech Communication, 2016
The noise robust exemplar matching (N-REM) framework performs automatic speech recognition using ... more The noise robust exemplar matching (N-REM) framework performs automatic speech recognition using exemplars, which are the labeled spectrographic representations of speech segments extracted from training data. By incorporating a sparse representations formulation, this technique remedies the inherent noise modeling problem of conventional exemplar matching-based automatic speech recognition systems. In this framework, noisy speech segments are approximated as a sparse linear combination of the exemplars of multiple lengths, each associated with a single speech unit such as words, half-words or phones. On account of the reconstruction error-based back end, the recognition accuracy highly depends on the congruence of the speech features and the divergence metric used to compare the speech segments with exemplars. In this work, we replace the conventional Kullback-Leibler divergence (KLD) with a generalized divergence family called the Alpha-Beta divergence with two parameters, α and β, in conjunction with mel-scaled magnitude spectral features. The proposed recognizer traverses the (α,β) plane depending on the amount of contamination to provide better separation of speech and noise sources. Moreover, we apply our recently proposed active noise exemplar selection (ANES) technique in a more realistic scenario where the target utterances are degraded by genuine room noise. Recognition experiments on the small vocabulary track of the 2 nd CHiME Challenge and the AURORA-2 database have shown that the novel recognizer with the AB divergence and ANES outperforms the baseline system using the generalized KLD with tuned sparsity, especially at lower SNR levels.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015
Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum ... more Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary, and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a lowrank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
Deep neural network (DNN) based acoustic modelling has been successfully used for a variety of au... more Deep neural network (DNN) based acoustic modelling has been successfully used for a variety of automatic speech recognition (ASR) tasks, thanks to its ability to learn higher-level information using multiple hidden layers. This paper investigates the recently proposed exemplar-based speech enhancement technique using coupled dictionaries as a pre-processing stage for DNN-based systems. In this setting, the noisy speech is decomposed as a weighted sum of atoms in an input dictionary containing exemplars sampled from a domain of choice, and the resulting weights are applied to a coupled output dictionary containing exemplars sampled in the shorttime Fourier transform (STFT) domain to directly obtain the speech and noise estimates for speech enhancement. In this work, settings using input dictionary of exemplars sampled from the STFT, Melintegrated magnitude STFT and modulation envelope spectra are evaluated. Experiments performed on the AURORA-4 database revealed that these pre-processing stages can improve the performance of the DNN-HMM-based ASR systems with both clean and multicondition training.
In missing feature theory (MFT), noise robustness of speech recognizers is obtained by modifying ... more In missing feature theory (MFT), noise robustness of speech recognizers is obtained by modifying the likelihood computed by the acoustic model to express that some features extracted from the signal are unreliable or missing. In one implementation of MFT, the acoustic model and bounds on the unreliable feature are used to infer an estimate of the missing data. This paper addresses an observed bias of the likelihood evaluated at the estimate. Theoretical and experimental evidence are provided that an upper bound on the accuracy is improved by applying a computationally simple correction for the number of free variables in the likelihood maximization.
2014 IEEE Spoken Language Technology Workshop (SLT), 2014
We propose a novel exemplar-based feature enhancement method for automatic speech recognition whi... more We propose a novel exemplar-based feature enhancement method for automatic speech recognition which uses coupled dictionaries: an input dictionary containing atoms sampled in the modulation (envelope) spectrogram domain and an output dictionary with atoms in the Mel or full-resolution frequency domain. The input modulation representation is chosen for its separation properties of speech and noise and for its relation with human auditory processing. The output representation is one which can be processed by the ASR backend. The proposed method was investigated on the AURORA-2 and AURORA-4 databases and improved word error rates (WER) were obtained when compared to the system which uses Mel features in the input exemplars. The paper also proposes a hybrid system which combines the baseline and the proposed algorithm on the AURORA-2 database which in turn also yielded improvement over both the algorithms.
The speech reception threshold (SRT) is the noise level at which the speech recognition rate of a... more The speech reception threshold (SRT) is the noise level at which the speech recognition rate of a test person is 50%. SRT measurement is relevant for patient screening, psychoacoustic research and algorithm development in hearing aids and cochlear implants. In this paper, we report on our efforts to automate SRT measurement using an automatic speech recognizer. During a test, sentences are presented to the test subject at different SNR levels. The person under test repeats the sentence and the keywords it contains are scored by an audiologist. If all keywords are repeated correctly, the sentence is evaluated as correct. The SNR level of each sentence is adjusted based on the previous sentence's evaluation. Aiming for an objective and repeatable measurement, the audiologist's assessment is replaced by an automatic speech recognizer's evaluation. For this purpose, we investigate different finite state transducer structures to model the expected sentences as well as the impact of several speaker adaptation schemes on the keyword detection accuracy. A baseline recognizer using general acoustic models achieves a performance of 88.8% keyword detection rate. Speaker adapted acoustic models improve the performance yielding a keyword detection accuracy of up to 90.7%. Finally, the impact of recognition errors on the estimated SRT value is simulated showing a minimal impact on the SRT measurement process. Based on this analysis, it can be concluded that the proposed automatic evaluation scheme is a viable tool for speech reception threshold measurements.
In this work we describe research aimed at developing an assistive vocal interface for users with... more In this work we describe research aimed at developing an assistive vocal interface for users with a speech impairment. In contrast to existing approaches, the vocal interface is self-learning, which means it is maximally adapted to the end-user and can be used with any language, dialect, vocabulary and grammar. The paper describes the overall learning framework and the vocabulary acquisition technique, and proposes a novel grammar induction technique based on weakly supervised hidden Markov model learning. We evaluate early implementations of these vocabulary and grammar learning components on two datasets: recorded sessions of a vocally guided card game by non-impaired speakers and speech-impaired users engaging in a home automation task.
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
In this paper, we investigate the performance of a noise-robust sparse representations (SR)-based... more In this paper, we investigate the performance of a noise-robust sparse representations (SR)-based recognizer using the Alpha-Beta (AB)divergence to compare the noisy speech segments and exemplars. The baseline recognizer, which approximates noisy speech segments as a linear combination of speech and noise exemplars of variable length, uses the generalized Kullback-Leibler divergence to quantify the approximation quality. Incorporating a reconstruction errorbased back-end, the recognition performance highly depends on the congruence of the divergence measure and used speech features. Having two tuning parameters, namely α and β, the AB-divergence provides improved robustness against background noise and outliers. These parameters can be adjusted for better performance depending on the distribution of speech and noise exemplars in the high-dimensional feature space. Moreover, various well-known distance/divergence measures such as the Euclidean distance, generalized Kullback-Leibler divergence, Itakura-Saito divergence and Hellinger distance are special cases of the AB-divergence for different (α, β) values. The goal of this work is to investigate the optimal divergence for mel-scaled magnitude spectral features by performing recognition experiments at several SNR levels using different (α, β) pairs. The results demonstrate the effectiveness of the AB-divergence compared to the generalized Kullback-Leibler divergence especially at the lower SNR levels.
Theory and Applications of Natural Language Processing, 2012
Robust Speech Recognition of Uncertain or Missing Data, 2011
Speech Communication, 2009
We present a self-learning algorithm using a bottom-up based approach to automatically discover, ... more We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time-frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness.
IEEE Journal of Selected Topics in Signal Processing, 2010
An effective way to increase the noise robustness of automatic speech recognition is to label noi... more An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing), and to replace (impute) the missing ones by clean speech estimates. Conventional im putation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low SNR's these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper we introduce a novel non-parametric, exemplarbased method for reconstructing clean speech from noisy ob servations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation tech niques at SNR =-5 dB when using an ideal 'oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.