An Iterative Longest Matching Segment Approach to Speech Enhancement with Additive Noise and Channel Distortion (original) (raw)

AV Speech Enhancement Challenge using a Real Noisy Corpus

arXiv (Cornell University), 2019

This paper presents, a first of its kind, audiovisual (AV) speech enhacement challenge in real-noisy settings. A detailed description of the AV challenge, a novel real noisy AV corpus (AS-PIRE), benchmark speech enhancement task, and baseline performance results are outlined. The latter are based on training a deep neural architecture on a synthetic mixture of Grid corpus and ChiME3 noises (consisting of bus, pedestrian, cafe, and street noises) and testing on the ASPIRE corpus. Subjective evaluations of five different speech enhancement algorithms (including SEAGN, spectrum subtraction (SS) , log-minimum mean-square error (LMMSE), audio-only CochleaNet, and AV CochleaNet) are presented as baseline results. The aim of the multi-modal challenge is to provide a timely opportunity for comprehensive evaluation of novel AV speech enhancement algorithms, using our new benchmark, real-noisy AV corpus and specified performance metrics. This will promote AV speech processing research globally, stimulate new groundbreaking multi-modal approaches, and attract interest from companies, academics and researchers working in AV speech technologies and applications. We encourage participants (through a challenge website sign-up) from both the speech and hearing research communities, to benefit from their complementary approaches to AV speech in noise processing.

Constrained iterative speech enhancement with application to speech recognition

1991

In this paper, an improved form of iterative speech enhancement for single channel inputs is formulated. The basis of the procedure is sequential maximum a posteriori estimation of the speech waveform and its all-pole parameters as originally formulated by Lim and Oppenheim, followed by imposition of constraints upon the sequence of speech spectra. The new approaches impose intraframe and interframe constaints on the input speech signal to ensure more speechlike formant trajectories, reduce frame-to-frame pole jitter, and effectively introduce a relaxation parameter to the iterative scheme. Recently discovered properties of the line spectral pair representation of speech allow for an efficient and direct procedure for application of many of the constraint requirements. Substantial improvement over the unconstrained method has been observed in a variety of domains. First, informal listener quality evaluation tests and objective speech quality measures demonstrate the technique's effectiveness for additive white Gaussian noise. A consistent terminating point for the iterative technique is also shown. Second, the algorithms have been generalized and successfully tested for noise which is nonwhite and slowly varying in characteristics. The current systems result in substantially improved speech quality and LPC parameter estimation in this context with only a minor increase in computational requirements. Third, the algorithms were evaluated with respect to improving automatic recognition of speech in the presence of additive noise, and shown to outperform other enhancement methods in this application.

An Analysis of Noise-aware Features in Combination with the Size and Diversity of Training Data for DNN-based Speech Enhancement

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

In this work, the generalization of speech enhancement algorithms based on deep neural networks (DNNs) for training datasets that differ in size and diversity is analyzed. For this, we compare noise aware training (NAT) features and signal-to-noise ratio (SNR) based noise aware training (SNR-NAT) features. NAT appends an estimate of the noise power spectral density (PSD) to a noisy periodogram input feature, whereas SNR-NAT uses the noise PSD for normalization. We show that the Hu noise corpus (limited size) and the CHiME 3 noise corpus (limited diversity) may result in DNNs which do not generalize well to unseen noises. We construct a large and diverse dataset from freely available data and show that it helps DNNs to generalize. However, we also show that with SNR-NAT features, the trained models are more robust even if a small or less diverse training set is employed. Using t-distributed stochastic neighbor embedding (t-SNE), we demonstrate that using SNR-NAT both the features and the resulting internal representation of the DNN are less dependent on the background noise which facilitates the generalization to unseen noise types.

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

EURASIP Journal on Advances in Signal Processing, 2016

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

Statistical Methods for the Enhancement of Noisy Speech

Signals and Communication Technology, 2005

With the advent and wide dissemination of mobile communications, speech processing systems must be made robust with respect to environmental noise. In fact, the performance of speech coders or speech recognition systems is degraded when the input signal contains a significant level of noise. As a result, speech quality, speech intelligibility, or recognition rate requirements cannot be met. Improvements are obtained when the speech processing system is combined with a speech enhancement preprocessor. In this paper we will outline algorithms for noise reduction which are based on statistics and optimal estimation techniques. The focus will be on estimation procedures for the spectral coefficients of the clean speech signal and on the estimation of the power spectral density of the background noise.

Minimum variance distortionless response spectral estimation and subtraction for robust speech recognition

S pectral analysis is a fundamental part of many speech processing algorithms, including compression, coding, voice conversion, and feature extraction for automatic recognition. These applications present a variety of requirements: spectral resolution, variance of the estimated spectra, and capacity to model the frequency response function of the vocal tract during voiced speech. To satisfy these requirements, a broad variety of solutions has been proposed in the literature, all of which can be classified as either parametric methods, those using a small number of parameters estimated from the data [e.g., linear prediction (LP)], or nonparametric methods, those based on periodograms (e.g., the power spectrum).

A Brief Survey of Speech Enhancement 1

We present a brief overview of the speech enhancement problem for wide-band noise sources that are not correlated with the speech signal. Our main focus is on the spectral subtraction approach and some of its derivatives in the forms of linear and non-linear minimum mean square error estimators. For the linear case, we review the signal subspace approach, and for the non-linear case, we review spectral magnitude and phase estimators. On line estimation of the second order statistics of speech signals using parametric and non-parametric models is also addressed.

Exemplar-based speech enhancement for deep neural network based automatic speech recognition

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

Deep neural network (DNN) based acoustic modelling has been successfully used for a variety of automatic speech recognition (ASR) tasks, thanks to its ability to learn higher-level information using multiple hidden layers. This paper investigates the recently proposed exemplar-based speech enhancement technique using coupled dictionaries as a pre-processing stage for DNN-based systems. In this setting, the noisy speech is decomposed as a weighted sum of atoms in an input dictionary containing exemplars sampled from a domain of choice, and the resulting weights are applied to a coupled output dictionary containing exemplars sampled in the shorttime Fourier transform (STFT) domain to directly obtain the speech and noise estimates for speech enhancement. In this work, settings using input dictionary of exemplars sampled from the STFT, Melintegrated magnitude STFT and modulation envelope spectra are evaluated. Experiments performed on the AURORA-4 database revealed that these pre-processing stages can improve the performance of the DNN-HMM-based ASR systems with both clean and multicondition training.

The 2nd ‘CHIME’speech separation and recognition challenge: Approaches on single-channel source separation and model-driven speech enhancement

In this paper, we address the small vocabulary track (track 1) described in the CHiME 2 challenge dedicated to recognize utterances of a target speaker with small head movements. The utterances are recorded in a reverberant room acoustics corrupted with highly non-stationary noise sources. Such adverse noise scenario imposes a challenge to state-of-the-art automatic speech recognition systems. We developed two individual front ends for the output of the delay-and-sum beamformer: (i) a model-driven single-channel speech enhancement stage which combines the knowledge of the speaker identity modeled by a trained vector quantizer with a minimum statistics based noise tracker, and (ii) a single-channel source separation stage which employs models of the target speaker as well as the background noise as codebooks. Our perceived signal quality and separation results averaged on the CHiME 2 development set justify the effectiveness of both strategies in terms of recovering the target speech signal. Also, our best results on keyword recognition accuracy show 20% improvement over the provided baseline results on the development and test sets.