An Iterative Longest Matching Segment Approach to Speech Enhancement with Additive Noise and Channel Distortion (original) (raw)

Speech enhancement from additive noise and channel distortion — a corpus-based approach

Interspeech 2014, 2014

This paper presents a new approach to single-channel speech enhancement involving both noise and channel distortion (i.e., convolutional noise). The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise. Third, we present an iterative algorithm for improved speech estimates. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement.

A Statistical Analysis on the Impact of Speech Enhancement Techniques on the Feature Vectors of Noisy Speech Signals for Speech Recognition

Noise is one of the major challenges in the development of robust automatic speech recognition (ASR) System. There are several speech enhancement techniques available to reduce the effect of noise from speech signals. In this paper, a statistical analysis is presented on the impact of speech enhancement techniques on the feature vectors of noisy speech signals by estimating Bhattacharya distances (BD) from the feature vectors of approximately noise free training speech signals to the feature vectors of noisy testing speech signals. Here Sub-band Spectral Subtraction (SSS) and Frame Selection (FS) have been used as speech enhancement techniques at signal level and Cepstral Mean Normalization (CMN) has been used as feature normalization technique at feature level. In this research work, combination of Mel-Frequency Cepstral Coefficients (MFCC), Log energies, first time derivatives and second time derivatives of MFCCs and Log energies has been used as speech feature vectors. Speech rec...

Wide matching — An approach to improving noise robustness for speech enhancement

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016

It is shown that under certain conditions it is possible to obtain a good speech estimate from noise without requiring noise estimation. We study an implementation of the theory, namely wide matching, for speech enhancement. The new approach performs sentence-wide joint speech segment estimation subject to maximum recognizability to gain noise robustness. Experiments have been conducted to evaluate the new approach with variable noises and SNRs from-5 dB to noise free. It is shown that the new approach, without any estimation of the noise, significantly outperformed conventional methods in the low SNR conditions while retaining comparable performance in the high SNR conditions. It is further suggested that the wide matching and deep learning approaches can be combined towards a highly robust and accurate speech estimator.

Recognition of noisy speech

Proceedings of the workshop on Speech and Natural Language - HLT '90, 1990

A model-based spectral estimation algorithm is derived that improves the robustness of speech recognition systems to additive noise. The algorithm is tailored for filter-bank-based systems, where the estimation should seek to minimize the distortion as measured by the recognizer's distance metric. This estimation criterion is approximated by minimizing the Euclidean distance between spectral log-energy vectors, which is equivalent to minimizing the nonweighted, nontruncated cepstral distance. Correlations between frequency channels are incorporated in the estimation by modeling the spectral distribution of speech as a mixture of components, each representing a different speech class, and assuming that spectral energies at different frequency channels are uncorrelated within each class. The algorithm was tested with SRI's continuous-speech, speaker-independent, hidden Markov model recognition system using the largevocabulary NIST "Resource Management Task." When trained on a clean-speech database and tested with additive white Gaussian noise, the new algorithm has an error rate half of that with MMSE estimation of log spectral energies at individual frequency channels, and it achieves a level similar to that with the ideal condition of training and testing at constant SNR. The algorithm is also very efficient with additive environmental noise, recorded with a desktop microphone.

AV Speech Enhancement Challenge using a Real Noisy Corpus

arXiv (Cornell University), 2019

This paper presents, a first of its kind, audiovisual (AV) speech enhacement challenge in real-noisy settings. A detailed description of the AV challenge, a novel real noisy AV corpus (AS-PIRE), benchmark speech enhancement task, and baseline performance results are outlined. The latter are based on training a deep neural architecture on a synthetic mixture of Grid corpus and ChiME3 noises (consisting of bus, pedestrian, cafe, and street noises) and testing on the ASPIRE corpus. Subjective evaluations of five different speech enhancement algorithms (including SEAGN, spectrum subtraction (SS) , log-minimum mean-square error (LMMSE), audio-only CochleaNet, and AV CochleaNet) are presented as baseline results. The aim of the multi-modal challenge is to provide a timely opportunity for comprehensive evaluation of novel AV speech enhancement algorithms, using our new benchmark, real-noisy AV corpus and specified performance metrics. This will promote AV speech processing research globally, stimulate new groundbreaking multi-modal approaches, and attract interest from companies, academics and researchers working in AV speech technologies and applications. We encourage participants (through a challenge website sign-up) from both the speech and hearing research communities, to benefit from their complementary approaches to AV speech in noise processing.

Constrained iterative speech enhancement with application to speech recognition

1991

In this paper, an improved form of iterative speech enhancement for single channel inputs is formulated. The basis of the procedure is sequential maximum a posteriori estimation of the speech waveform and its all-pole parameters as originally formulated by Lim and Oppenheim, followed by imposition of constraints upon the sequence of speech spectra. The new approaches impose intraframe and interframe constaints on the input speech signal to ensure more speechlike formant trajectories, reduce frame-to-frame pole jitter, and effectively introduce a relaxation parameter to the iterative scheme. Recently discovered properties of the line spectral pair representation of speech allow for an efficient and direct procedure for application of many of the constraint requirements. Substantial improvement over the unconstrained method has been observed in a variety of domains. First, informal listener quality evaluation tests and objective speech quality measures demonstrate the technique's effectiveness for additive white Gaussian noise. A consistent terminating point for the iterative technique is also shown. Second, the algorithms have been generalized and successfully tested for noise which is nonwhite and slowly varying in characteristics. The current systems result in substantially improved speech quality and LPC parameter estimation in this context with only a minor increase in computational requirements. Third, the algorithms were evaluated with respect to improving automatic recognition of speech in the presence of additive noise, and shown to outperform other enhancement methods in this application.

An Analysis of Noise-aware Features in Combination with the Size and Diversity of Training Data for DNN-based Speech Enhancement

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

In this work, the generalization of speech enhancement algorithms based on deep neural networks (DNNs) for training datasets that differ in size and diversity is analyzed. For this, we compare noise aware training (NAT) features and signal-to-noise ratio (SNR) based noise aware training (SNR-NAT) features. NAT appends an estimate of the noise power spectral density (PSD) to a noisy periodogram input feature, whereas SNR-NAT uses the noise PSD for normalization. We show that the Hu noise corpus (limited size) and the CHiME 3 noise corpus (limited diversity) may result in DNNs which do not generalize well to unseen noises. We construct a large and diverse dataset from freely available data and show that it helps DNNs to generalize. However, we also show that with SNR-NAT features, the trained models are more robust even if a small or less diverse training set is employed. Using t-distributed stochastic neighbor embedding (t-SNE), we demonstrate that using SNR-NAT both the features and the resulting internal representation of the DNN are less dependent on the background noise which facilitates the generalization to unseen noise types.

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

EURASIP Journal on Advances in Signal Processing, 2016

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

Minimum variance distortionless response spectral estimation and subtraction for robust speech recognition

S pectral analysis is a fundamental part of many speech processing algorithms, including compression, coding, voice conversion, and feature extraction for automatic recognition. These applications present a variety of requirements: spectral resolution, variance of the estimated spectra, and capacity to model the frequency response function of the vocal tract during voiced speech. To satisfy these requirements, a broad variety of solutions has been proposed in the literature, all of which can be classified as either parametric methods, those using a small number of parameters estimated from the data [e.g., linear prediction (LP)], or nonparametric methods, those based on periodograms (e.g., the power spectrum).