Multichannel Speech Enhancement Based on Speech Spectral Magnitude Estimation Using Generalized Gamma Prior Distribution (original) (raw)
Related papers
IEICE Transactions on Information and Systems, 2008
We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of crosschannel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.
Gamma Modeling of Speech Power and Its On-Line Estimation for Statistical Speech Enhancement
IEICE Transactions on Information and Systems, 2006
This study shows the effectiveness of using gamma distribution in the speech power domain as a more general prior distribution for the model-based speech enhancement approaches. This model is a superset of the conventional Gaussian model of the complex spectrum and provides more accurate prior modeling when the optimal parameters are estimated. We develop a method to adapt the modeled distribution parameters from each actual noisy speech in a frame-by-frame manner. Next, we derive and investigate the minimum mean square error (MMSE) and maximum a posterior probability (MAP) estimations in different domains of speech spectral magnitude, generalized power and its logarithm, using the proposed gamma modeling. Finally, a comparative evaluation of the MAP and MMSE filters is conducted. As the MMSE estimations tend to more complicated using more general prior distributions, the MAP estimations are given in closed-form extractions and therefore are suitable in the implementation. The adaptive estimation of the modeled distribution parameters provides more accurate prior modeling and this is the principal merit of the proposed method and the reason for the better performance. From the experiments, the MAP estimation is recommended due to its high efficiency and low complexity. Among the MAP based systems, the estimation in log-magnitude domain is shown to be the best for the speech recognition as the estimation in power domain is superior for the noise reduction.
Log-spectral amplitude estimation with Generalized Gamma distributions for speech enhancement
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This paper presents a family of log-spectral amplitude (LSA) estimators for speech enhancement. Generalized Gamma distributed (GGD) priors are assumed for speech short-time spectral amplitudes (STSAs), providing mathematical flexibility in capturing the statistical behavior of speech. Although solutions are not obtainable in closed-form, estimators are expressed as limits, and can be efficiently approximated. When applied to the Noizeus database [1], proposed estimators are shown to provide improvements in segmental signal-to-noise ratio (SSNR) and COSH distance [2], relative to the LSA estimator proposed by Ephraim and Malah .
On the Importance of Super-Gaussian Speech Priors for Machine-Learning Based Speech Enhancement
IEEE/ACM Transactions on Audio, Speech, and Language Processing
For enhancing noisy signals, machine-learning based single-channel speech enhancement schemes exploit prior knowledge about typical speech spectral structures. To ensure a good generalization and to meet requirements in terms of computational complexity and memory consumption, certain methods restrict themselves to learning speech spectral envelopes. We refer to these approaches as machine-learning spectral envelope (MLSE)-based approaches. In this paper we show by means of theoretical and experimental analyses that for MLSE-based approaches, super-Gaussian priors allow for a reduction of noise between speech spectral harmonics which is not achievable using Gaussian estimators such as the Wiener filter. For the evaluation, we use a deep neural network (DNN)-based phoneme classifier and a low-rank nonnegative matrix factorization (NMF) framework as examples of MLSE-based approaches. A listening experiment and instrumental measures confirm that while super-Gaussian priors yield only moderate improvements for classic enhancement schemes, for MLSE-based approaches super-Gaussian priors clearly make an important difference and significantly outperform Gaussian priors.
On the Importance of Super-Gaussian Speech Priors for Pre-Trained Speech Enhancement
arXiv (Cornell University), 2017
For enhancing noisy signals, machine-learning based single-channel speech enhancement schemes exploit prior knowledge about typical speech spectral structures. To ensure a good generalization and to meet requirements in terms of computational complexity and memory consumption, certain methods restrict themselves to learning speech spectral envelopes. We refer to these approaches as machine-learning spectral envelope (MLSE)-based approaches. In this paper we show by means of theoretical and experimental analyses that for MLSE-based approaches, super-Gaussian priors allow for a reduction of noise between speech spectral harmonics which is not achievable using Gaussian estimators such as the Wiener filter. For the evaluation, we use a deep neural network (DNN)-based phoneme classifier and a low-rank nonnegative matrix factorization (NMF) framework as examples of MLSE-based approaches. A listening experiment and instrumental measures confirm that while super-Gaussian priors yield only moderate improvements for classic enhancement schemes, for MLSE-based approaches super-Gaussian priors clearly make an important difference and significantly outperform Gaussian priors.
Generalized maximum a posteriori spectral amplitude estimation for speech enhancement
Speech Communication, 2015
Spectral restoration methods for speech enhancement aim to remove noise components in noisy speech signals by using a gain function in the spectral domain. How to design the gain function is one of the most important parts for obtaining enhanced speech with good quality. In most studies, the gain function is designed by optimizing a criterion based on some assumptions of the noise and speech distributions, such as minimum mean square error (MMSE), maximum likelihood (ML), and maximum a posteriori (MAP) criteria. The MAP criterion shows advantage in obtaining a more reliable gain function by incorporating a suitable prior density. However, it has a problem as several studies showed: although MAP based estimator effectively reduces noise components when the signal-to-noise ratio (SNR) is low, it brings large speech distortion when the SNR is high. For solving this problem, we have proposed a generalized maximum a posteriori spectral amplitude (GMAPA) algorithm in designing a gain function for speech enhancement. The proposed GMAPA algorithm dynamically specifies the weight of prior density of speech spectra according to the SNR of the testing speech signals to calculate the optimal gain function. When the SNR is high, GMAPA adopts a small weight to prevent overcompensations that may result in speech distortions. On the other hand, when the SNR is low, GMAPA uses a large weight to avoid disturbance of the restoration caused by measurement noises. In our previous study, it has been proven that the weight of the prior density plays a crucial role to the GMAPA performance, and the weight is determined based on the SNR in an utterance-level. In this paper, we propose to compute the weight with the consideration of time-frequency correlations that result in a more accurate estimation of the gain function. Experiments were carried out to evaluate the proposed algorithm on both objective tests and subjective tests. The experimental results obtained from objective tests indicate that GMAPA is promising compared to several well-known algorithms at both high and low SNRs. The results of subjective listening tests indicate that GMAPA provides significantly higher sound quality than other speech enhancement algorithms.
Several investigations showed that speech enhancement approaches can be improved by speech presence uncertainty (SPU) estimation. Although there has been a strong focus on the use of correct statistical models for spectral weighting rules for the last few decades, there is just a few publications about SPU estimation based on a speech prior consistent with the spectral weighting rule. This contribution presents a new consistent solution for MMSE speech amplitude (SA) estimation under SPU, being based on the generalized gamma distribution representing a variety of speech priors. Employing the gamma speech model which is a special case of the generalized gamma distribution, the new approach is shown to outperform both the SPU-based MMSE-SA estimator relying on a Gaussian speech prior, and the gamma MMSE-SA estimation without SPU.
Semi-Supervised Multichannel Speech Enhancement With a Deep Speech Prior
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019
This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank speech model with a deep generative speech model, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
Speech Enhancement using Beta-order MMSE Spectral Amplitude Estimator with Laplacian Prior
This report addresses the problem of speech enhancement employing the Minimum Mean-Square Error (MMSE) of β-order Short Time Spectral Amplitude (STSA). We present an analytical solution for β-order MMSE estimator where Discrete Fourier Transform (DFT) coefficients of (clean) speech are modeled by Laplacian distributions. Using some approximations for the joint probability density function and the Bessel function, we also present a closed-form version of the estimator (called β-order LapMMSE). The performance of the proposed estimator is compared to the state-of-the-art spectral amplitude estimators that assume Gaussian priors for clean DFT coefficients. Comparative results demonstrate the superiority of the proposed estimator in terms of speech enhancement/ noise reduction measures.