Richard Hendriks - Academia.edu (original) (raw)
Papers by Richard Hendriks
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Originally, ideal binary mask (idbm) techniques have been used as a tool for studying aspects of ... more Originally, ideal binary mask (idbm) techniques have been used as a tool for studying aspects of the auditory system. More recently, idbm techniques have been adapted to the practical problem of retrieving a target speech signal from a noisy observation. In this practical setting, the binary mask techniques show similarities with existing DFT based speech enhancement techniques. In this context, we derive single-channel, binary mask estimators which minimize the spectral magnitude mean-square error. We show in simulation experiments with natural speech and noise signals that the proposed estimators perform signi cantly better than existing binary mask estimators. However, even the best of the proposed estimators is clearly outperformed by non-binary estimators, both in terms of speech quality and intelligibility.
Signal Processing, 2015
ABSTRACT
We study the distribution of time-domain speech samples as well as the distribution of Discrete F... more We study the distribution of time-domain speech samples as well as the distribution of Discrete Fourier Transform (DFT) coefficients obtained from speech segments. We consider four possible pdf model types, namely Gaussian, Laplacian, Gamma, and a Generalized Gaussian density (GGD). Our time-domain results suggest that for segment lengths of 20-200 ms, the Laplacian density is the better choice, while for shorter segments the Gaussian model is more appropriate. For segments of 20 ms, the Gaussian model is advantageous for broad speech classes of fricatives, nasals and glides, but stop sounds are much better represented with the Laplacian model, making the latter model better on average. Finally, our study supports the often made assumption that DFT coefficients collected within short time intervals can be considered Gaussian distributed, across all types of speech sounds.
In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimat... more In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimator [1] and present an improvement. We will show that the MMSE based spectral noise power estimate is only updated when the a posteriori signal-to-noise ratio (SNR) is lower than one. This threshold on the a posteriori SNR can be interpreted as a voice activity detector (VAD).
IEEE Transactions on Audio Speech and Language Processing
Recently, binary mask techniques have been proposed as a tool for retrieving a target speech sign... more Recently, binary mask techniques have been proposed as a tool for retrieving a target speech signal from a noisy observation. A binary gain function is applied to time-frequency tiles of the noisy observation in order to suppress noise dominated and retain target dominated time-frequency regions. When implemented using discrete Fourier transform (DFT) techniques, the binary mask techniques can be seen as a special case of the broader class of DFT-based speech enhancement algorithms, for which the applied gain function is not constrained to be binary. In this context, we develop and compare binary mask techniques to state-of-the-art continuous gain techniques. We derive spectral magnitude minimum mean-square error binary gain estimators; the binary gain estimators turn out to be simple functions of the continuous gain estimators. We show that the optimal binary estimators are closely related to a range of existing, heuristically developed, binary gain estimators. The derived binary gain estimators perform better than existing binary gain estimators in simulation experiments with speech signals contaminated by several different noise sources as measured by speech quality and intelligibility measures. However, even the best binary mask method is significantly outperformed by state-of-the-art continuous gain estimators. The instrumental intelligibility results are confirmed in an intelligibility listening test.
Synthesis Lectures on Speech and Audio Processing, 2013
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT To improve speech communication in noisy and reverberant environments, an increased inte... more ABSTRACT To improve speech communication in noisy and reverberant environments, an increased interest is shown to develop algorithms that make efficiently use of acoustic wireless sensor networks (WSNs). The processors and sensors forming these WSNs can be owned by multiple users. Sending private data across such a WSN can lead to severe privacy and security issues and may limit its acceptance. Using the advantages of WSNs, while guaranteeing people's privacy, requires therefore to share processors and data in a privacy preserving manner. In this paper we raise attention to the problem of privacy and security for distributed speech enhancement and propose the new paradigm of privacy preserving distributed beamforming. Using cryptographic techniques, particularly homomorphic encryption, we demonstrate how distributed beamforming techniques can be computed in a privacy preserving manner in the encrypted domain.
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011
In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimat... more In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimator [1] and present an improvement. We will show that the MMSE based spectral noise power estimate is only updated when the a posteriori signal-to-noise ratio (SNR) is lower than one. This threshold on the a posteriori SNR can be interpreted as a voice activity detector (VAD).
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
To harvest the potential of multi-channel noise reduction methods, it is crucial to have an accur... more To harvest the potential of multi-channel noise reduction methods, it is crucial to have an accurate estimate of the noise correlation matrix. Existing algorithms either assume speech absence and exploit a voice activity detector (VAD), or make use of additional assumptions like a diffuse noise field. Therefore, these algorithms are limited with respect to their tracking speed and the type of noise fields for which they can estimate the correlation matrix.
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
ABSTRACT
Short-time Fourier transform (STFT) methods are often used to overcome the degradation of speech ... more Short-time Fourier transform (STFT) methods are often used to overcome the degradation of speech signals affected by noise. STFT-gain functions are usually expressed as a function of the a priori SNR, say ξ, and good techniques to estimate ξ are of vital importance for the quality of enhanced speech. Often, ξ is estimated using the so-called decision directed approach (DD). However, the DD approach builds on a number of approximations, where certain expected values of signal related quantities are approximated by instantaneous estimates. In this paper we present a method to improve these approximations by combining the DD approach with an adaptive time segmentation. Objective and subjective experiments show that the proposed method leads to significant improvements compared to the conventional DD approach. Furthermore, simulation experiments confirm a decreased amount of non-stationary residual noise.
Subspace based speech enhancement relies on the decomposi- tion of the vector space spanned by th... more Subspace based speech enhancement relies on the decomposi- tion of the vector space spanned by the covariance matrix of noisy speech into a noise subspace and a signal subspace, where the noise subspace is nulled and the signal subspace is modied by applying a gain function. This gain function is determined by the eigenvalues of the noise and noisy speech covariance ma- trix that are typically estimated from the noisy data using a x ed segmentation. A x ed segmentation often leads to covariance matrix estimates with an unnecessary high variance or a bias, because segments are shorter or longer, respectively, than the re- gion where the noisy data is stationary. To overcome this prob- lem we present an adaptive time-segmentation algorithm com- bined with subspace based speech enhancement. As a result, smearing of speech sounds and musical noise in the enhanced speech signal are reduced. Experiments show improvements in terms of segmental SNR of 0.6 dB and symmetrical Itakura- Saito d...
Subspace based noise suppression schemes typically rely on eigen- value estimates of covariance m... more Subspace based noise suppression schemes typically rely on eigen- value estimates of covariance matrices of successive noisy signal frames. We propose in this paper a scheme for improving these estimates, and, consequently, the performance of the noise suppres- sor. More specifically, the presented scheme aims at combining past and current eigenvalue estimates into approximately stationary time series in order to obtain a smoothed eigenvalue estimator with a re- duced variance. The scheme is general in the sense that it is appli- cable to essentially any subspace-based noise suppression scheme. In simulation experiments with speech signals degraded by additive white Gaussian noise, the proposed scheme shows improvements over the traditional non-smoothed approach for a range of objec- tive quality measures. Further, in a subjective preference test, the proposed method was prefered in more than 90% of the cases.
We consider DFT based techniques for single-channel speech en- hancement. Specifically, we derive... more We consider DFT based techniques for single-channel speech en- hancement. Specifically, we derive minimum mean-square error estimators of clean speech DFT coefficients based on generalized gamma prior probability density functions. Our estimators con- tain as special cases the well-known Wiener estimator and the more recently derived estimators based on Laplacian and two- sided gamma priors. Simulation experiments with speech signals degraded by various additive noise sources verifythat theestimator based on the two-sided gamma prior is close to optimal amongst all the estimators considered in this paper.
Most DFT domain based speech enhancement methods are de- pendent on an estimate of the noise powe... more Most DFT domain based speech enhancement methods are de- pendent on an estimate of the noise power spectral density (PSD). For non-stationary noise sources it is desirable to es- timate the noise PSD also in spectral regions where speech is present. In this paper a new method for noise tracking is pre- sented, based on eigenvalue decompositions of correlation ma- trices that are constructed from time series of noisy DFT coef- ficients. The presented method can estimate the noise PSD at time-frequency points where both speech and noise are present. In comparison to state-of-the-art noise tracking algorithms the proposed algorithm reduces the estimation error between the estimated and the true noise PSD and improves segmental SNR when combined with an enhancement system with several dB. Index Terms: Speech enhancement, noise tracking, DFT do- main subspace decompositions.
This is an implementation of alg. 3 described in the book DFT-Domain Based Single-Microphone Nois... more This is an implementation of alg. 3 described in the book DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement-A Survey of the State of the Art, by Richard C. Hendriks, Timo Gerkmann and Jesper Jensen; Morgan and Claypool Publishers, 2013.
Toolbox for log-spectral magnitude MMSE estimators under super-Gaussian densities The toolbox can... more Toolbox for log-spectral magnitude MMSE estimators under super-Gaussian densities The toolbox can be downloaded from here: log_spec_super_gaussV1.rar The matlab files enclosed in this toolbox can be used to tabulate gain functions for log-spectral magnitude MMSE estimators under an assumed Generalized- Gamma model for the clean speech magnitude DFT coefficients. For the theory behind these estimators and constraints on the parameters we refer to the article [1] R.C.Hendriks, R.Heusdens and J.Jensen "Log-spectral magnitude MMSE estimators under super-Gaussian densities", Interspeech, 2009. Short description of the 2 main m-files (see the headers of the files for more info): For an assumed Generalized-Gamma prior density of the magnitude DFT coefficients with gamma=2 and specific nu parameter the p-file [G1]=TabulateGainGamma2logmmse(Rprior,Rpost,nu) tabulates the gain function for the log-spectral magnitude DFT coefficients, For mathematical expressions of the gain function...
MMSE based noise PSD tracking algorithm Matlab implementation of the noise PSD tracking algorithm... more MMSE based noise PSD tracking algorithm Matlab implementation of the noise PSD tracking algorithm described in "MMSE BASED NOISE PSD TRACKING WITH LOW COMPLEXITY", by Richard C. Hendriks, Richard Heusdens and Jesper Jensen, IEEE International Conference on Acoustics, Speech and Signal Processing, 03/2010, Dallas, TX, p.4266-4269, (2010) The algorithm can be run by starting the m-file "noise_psd_tracker" with the command [shat, noise_psd_matrix,T]=noise_psd_tracker(noisy,fs), where noisy is the noisy time-domain waveform and fs the sample frequency. The ouput, noise_psd_mat is a matrix where the columns contain the estimated noise PSDs per time-frame. shat containds the estimated clean signal, and T the computation time. For more details on the algorithm, see the above referenced paper (included in the zip-file) Update details: 23/1/2012 V2: uploaded a computationally faster version where special functions are tabulated. In addition the script also computes the es...
Matlab implementation of the Short-Time Objective Intelligibility (STOI) measure described in C.H... more Matlab implementation of the Short-Time Objective Intelligibility (STOI) measure described in C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen 'A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech', ICASSP 2010, Texas, Dallas.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Originally, ideal binary mask (idbm) techniques have been used as a tool for studying aspects of ... more Originally, ideal binary mask (idbm) techniques have been used as a tool for studying aspects of the auditory system. More recently, idbm techniques have been adapted to the practical problem of retrieving a target speech signal from a noisy observation. In this practical setting, the binary mask techniques show similarities with existing DFT based speech enhancement techniques. In this context, we derive single-channel, binary mask estimators which minimize the spectral magnitude mean-square error. We show in simulation experiments with natural speech and noise signals that the proposed estimators perform signi cantly better than existing binary mask estimators. However, even the best of the proposed estimators is clearly outperformed by non-binary estimators, both in terms of speech quality and intelligibility.
Signal Processing, 2015
ABSTRACT
We study the distribution of time-domain speech samples as well as the distribution of Discrete F... more We study the distribution of time-domain speech samples as well as the distribution of Discrete Fourier Transform (DFT) coefficients obtained from speech segments. We consider four possible pdf model types, namely Gaussian, Laplacian, Gamma, and a Generalized Gaussian density (GGD). Our time-domain results suggest that for segment lengths of 20-200 ms, the Laplacian density is the better choice, while for shorter segments the Gaussian model is more appropriate. For segments of 20 ms, the Gaussian model is advantageous for broad speech classes of fricatives, nasals and glides, but stop sounds are much better represented with the Laplacian model, making the latter model better on average. Finally, our study supports the often made assumption that DFT coefficients collected within short time intervals can be considered Gaussian distributed, across all types of speech sounds.
In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimat... more In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimator [1] and present an improvement. We will show that the MMSE based spectral noise power estimate is only updated when the a posteriori signal-to-noise ratio (SNR) is lower than one. This threshold on the a posteriori SNR can be interpreted as a voice activity detector (VAD).
IEEE Transactions on Audio Speech and Language Processing
Recently, binary mask techniques have been proposed as a tool for retrieving a target speech sign... more Recently, binary mask techniques have been proposed as a tool for retrieving a target speech signal from a noisy observation. A binary gain function is applied to time-frequency tiles of the noisy observation in order to suppress noise dominated and retain target dominated time-frequency regions. When implemented using discrete Fourier transform (DFT) techniques, the binary mask techniques can be seen as a special case of the broader class of DFT-based speech enhancement algorithms, for which the applied gain function is not constrained to be binary. In this context, we develop and compare binary mask techniques to state-of-the-art continuous gain techniques. We derive spectral magnitude minimum mean-square error binary gain estimators; the binary gain estimators turn out to be simple functions of the continuous gain estimators. We show that the optimal binary estimators are closely related to a range of existing, heuristically developed, binary gain estimators. The derived binary gain estimators perform better than existing binary gain estimators in simulation experiments with speech signals contaminated by several different noise sources as measured by speech quality and intelligibility measures. However, even the best binary mask method is significantly outperformed by state-of-the-art continuous gain estimators. The instrumental intelligibility results are confirmed in an intelligibility listening test.
Synthesis Lectures on Speech and Audio Processing, 2013
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT To improve speech communication in noisy and reverberant environments, an increased inte... more ABSTRACT To improve speech communication in noisy and reverberant environments, an increased interest is shown to develop algorithms that make efficiently use of acoustic wireless sensor networks (WSNs). The processors and sensors forming these WSNs can be owned by multiple users. Sending private data across such a WSN can lead to severe privacy and security issues and may limit its acceptance. Using the advantages of WSNs, while guaranteeing people's privacy, requires therefore to share processors and data in a privacy preserving manner. In this paper we raise attention to the problem of privacy and security for distributed speech enhancement and propose the new paradigm of privacy preserving distributed beamforming. Using cryptographic techniques, particularly homomorphic encryption, we demonstrate how distributed beamforming techniques can be computed in a privacy preserving manner in the encrypted domain.
2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011
In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimat... more In this paper, we analyze the minimum mean square error (MMSE) based spectral noise power estimator [1] and present an improvement. We will show that the MMSE based spectral noise power estimate is only updated when the a posteriori signal-to-noise ratio (SNR) is lower than one. This threshold on the a posteriori SNR can be interpreted as a voice activity detector (VAD).
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
To harvest the potential of multi-channel noise reduction methods, it is crucial to have an accur... more To harvest the potential of multi-channel noise reduction methods, it is crucial to have an accurate estimate of the noise correlation matrix. Existing algorithms either assume speech absence and exploit a voice activity detector (VAD), or make use of additional assumptions like a diffuse noise field. Therefore, these algorithms are limited with respect to their tracking speed and the type of noise fields for which they can estimate the correlation matrix.
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
ABSTRACT
Short-time Fourier transform (STFT) methods are often used to overcome the degradation of speech ... more Short-time Fourier transform (STFT) methods are often used to overcome the degradation of speech signals affected by noise. STFT-gain functions are usually expressed as a function of the a priori SNR, say ξ, and good techniques to estimate ξ are of vital importance for the quality of enhanced speech. Often, ξ is estimated using the so-called decision directed approach (DD). However, the DD approach builds on a number of approximations, where certain expected values of signal related quantities are approximated by instantaneous estimates. In this paper we present a method to improve these approximations by combining the DD approach with an adaptive time segmentation. Objective and subjective experiments show that the proposed method leads to significant improvements compared to the conventional DD approach. Furthermore, simulation experiments confirm a decreased amount of non-stationary residual noise.
Subspace based speech enhancement relies on the decomposi- tion of the vector space spanned by th... more Subspace based speech enhancement relies on the decomposi- tion of the vector space spanned by the covariance matrix of noisy speech into a noise subspace and a signal subspace, where the noise subspace is nulled and the signal subspace is modied by applying a gain function. This gain function is determined by the eigenvalues of the noise and noisy speech covariance ma- trix that are typically estimated from the noisy data using a x ed segmentation. A x ed segmentation often leads to covariance matrix estimates with an unnecessary high variance or a bias, because segments are shorter or longer, respectively, than the re- gion where the noisy data is stationary. To overcome this prob- lem we present an adaptive time-segmentation algorithm com- bined with subspace based speech enhancement. As a result, smearing of speech sounds and musical noise in the enhanced speech signal are reduced. Experiments show improvements in terms of segmental SNR of 0.6 dB and symmetrical Itakura- Saito d...
Subspace based noise suppression schemes typically rely on eigen- value estimates of covariance m... more Subspace based noise suppression schemes typically rely on eigen- value estimates of covariance matrices of successive noisy signal frames. We propose in this paper a scheme for improving these estimates, and, consequently, the performance of the noise suppres- sor. More specifically, the presented scheme aims at combining past and current eigenvalue estimates into approximately stationary time series in order to obtain a smoothed eigenvalue estimator with a re- duced variance. The scheme is general in the sense that it is appli- cable to essentially any subspace-based noise suppression scheme. In simulation experiments with speech signals degraded by additive white Gaussian noise, the proposed scheme shows improvements over the traditional non-smoothed approach for a range of objec- tive quality measures. Further, in a subjective preference test, the proposed method was prefered in more than 90% of the cases.
We consider DFT based techniques for single-channel speech en- hancement. Specifically, we derive... more We consider DFT based techniques for single-channel speech en- hancement. Specifically, we derive minimum mean-square error estimators of clean speech DFT coefficients based on generalized gamma prior probability density functions. Our estimators con- tain as special cases the well-known Wiener estimator and the more recently derived estimators based on Laplacian and two- sided gamma priors. Simulation experiments with speech signals degraded by various additive noise sources verifythat theestimator based on the two-sided gamma prior is close to optimal amongst all the estimators considered in this paper.
Most DFT domain based speech enhancement methods are de- pendent on an estimate of the noise powe... more Most DFT domain based speech enhancement methods are de- pendent on an estimate of the noise power spectral density (PSD). For non-stationary noise sources it is desirable to es- timate the noise PSD also in spectral regions where speech is present. In this paper a new method for noise tracking is pre- sented, based on eigenvalue decompositions of correlation ma- trices that are constructed from time series of noisy DFT coef- ficients. The presented method can estimate the noise PSD at time-frequency points where both speech and noise are present. In comparison to state-of-the-art noise tracking algorithms the proposed algorithm reduces the estimation error between the estimated and the true noise PSD and improves segmental SNR when combined with an enhancement system with several dB. Index Terms: Speech enhancement, noise tracking, DFT do- main subspace decompositions.
This is an implementation of alg. 3 described in the book DFT-Domain Based Single-Microphone Nois... more This is an implementation of alg. 3 described in the book DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement-A Survey of the State of the Art, by Richard C. Hendriks, Timo Gerkmann and Jesper Jensen; Morgan and Claypool Publishers, 2013.
Toolbox for log-spectral magnitude MMSE estimators under super-Gaussian densities The toolbox can... more Toolbox for log-spectral magnitude MMSE estimators under super-Gaussian densities The toolbox can be downloaded from here: log_spec_super_gaussV1.rar The matlab files enclosed in this toolbox can be used to tabulate gain functions for log-spectral magnitude MMSE estimators under an assumed Generalized- Gamma model for the clean speech magnitude DFT coefficients. For the theory behind these estimators and constraints on the parameters we refer to the article [1] R.C.Hendriks, R.Heusdens and J.Jensen "Log-spectral magnitude MMSE estimators under super-Gaussian densities", Interspeech, 2009. Short description of the 2 main m-files (see the headers of the files for more info): For an assumed Generalized-Gamma prior density of the magnitude DFT coefficients with gamma=2 and specific nu parameter the p-file [G1]=TabulateGainGamma2logmmse(Rprior,Rpost,nu) tabulates the gain function for the log-spectral magnitude DFT coefficients, For mathematical expressions of the gain function...
MMSE based noise PSD tracking algorithm Matlab implementation of the noise PSD tracking algorithm... more MMSE based noise PSD tracking algorithm Matlab implementation of the noise PSD tracking algorithm described in "MMSE BASED NOISE PSD TRACKING WITH LOW COMPLEXITY", by Richard C. Hendriks, Richard Heusdens and Jesper Jensen, IEEE International Conference on Acoustics, Speech and Signal Processing, 03/2010, Dallas, TX, p.4266-4269, (2010) The algorithm can be run by starting the m-file "noise_psd_tracker" with the command [shat, noise_psd_matrix,T]=noise_psd_tracker(noisy,fs), where noisy is the noisy time-domain waveform and fs the sample frequency. The ouput, noise_psd_mat is a matrix where the columns contain the estimated noise PSDs per time-frame. shat containds the estimated clean signal, and T the computation time. For more details on the algorithm, see the above referenced paper (included in the zip-file) Update details: 23/1/2012 V2: uploaded a computationally faster version where special functions are tabulated. In addition the script also computes the es...
Matlab implementation of the Short-Time Objective Intelligibility (STOI) measure described in C.H... more Matlab implementation of the Short-Time Objective Intelligibility (STOI) measure described in C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen 'A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech', ICASSP 2010, Texas, Dallas.