TRUTH-TO-ESTIMATE RATIO MASK: A POST-PROCESSING METHOD FOR SPEECH ENHANCEMENT DIRECT AT LOW SIGNAL-TO-NOISE RATIOS (original) (raw)
2020, International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)
This study proposes a bi-directional recurrent neural network (Bi-RNN) post-processing method for speech enhancement (SE) at low signal-to noise ratios (SNR). Current speech enhancement solutions performed badly under low SNR situations. Loizou and Kim proposed a solution to reduce speech distortion errors in time-frequency (T-F) domain but it requires the knowledge of ground truth. As ground truth is unknown in real-life applications, the current study proposes to use a Bi-RNN to implement Loizou and Kim's solution as a post-processing method for SE engines. Our solutions do not require prior knowledge of ground truth. The effectiveness of the proposed method is investigated with a spectral subtraction (SS) SE engine, a non-negative matrix factorization (NMF) SE engine, and a deep neural network ideal ratio mask (DNN-IRM) SE engine, under matched/mis-matched noise and different SNR conditions. Experimental results demonstrate that the proposed post-processing method effectively improved both perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) for all of these SE engines, especially at low SNR conditions.
Related papers
Speech Enhancement Using Deep Neural Network
2016
Speech is the main source of human interaction. In everyday life,Speech understanding in noisy environments is still one of the major challenges for users. The quality and intelligibility of speech signals are generally gets corrupted by the surrounding background noise during communication. So to improve the quality and intelligibility, Corrupted speech signals is to be enhanced. In the field of speech processing, different effort has been taken to develop speech enhancement techniques in order to enhance the speech signal by reducing the amount of noise. Speech enhancement deals with improving the quality and intelligibility of speech which gets degraded in the presence of surrounding background noise. In various everyday environments, the goal of speech enhancement methods is to improving the quality and intelligibility of speech especially at low Signal-to-Noise ratios (SNR). Regarding intelligibility, different machine learning methods that aim to estimate an ideal binary mask ...
WEIGHTED SPEECH DISTORTION LOSSES FOR NEURAL-NETWORK-BASED REAL-TIME SPEECH ENHANCEMENT
This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality of enhanced speech for real-time single-channel speech enhancement. Specifically, we focus on a RNN that enhances short-time speech spectra on a single-frame-in, single-frame-out basis, a framework adopted by most classical signal processing methods. We propose two novel mean-squared-error-based learning objectives that enable separate control over the importance of speech distortion versus noise reduction. The proposed loss functions are evaluated by widely accepted objective quality and intelligibility measures and compared to other competitive online methods. In addition, we study the impact of feature normalization and varying batch sequence lengths on the objective quality of enhanced speech. Finally , we show subjective ratings for the proposed approach and a state-of-the-art real-time RNN-based method.
Proceedings of the International Congress on Acoustics 2019, 2019
The intelligibility of noisy speech can be improved by applying an ideal binary or soft gain mask in the time-frequency domain for signal-to-noise ratios (SNRs) that are typically between -10 and +10 dB. In this study, two mask-based algorithms are compared when applied to speech mixed with white Gaussian noise (WGN) at low SNRs (from -29 to -5 dB). These comprise an Ideal Binary Mask (IBM) with a local criterion set to 0 dB and an Ideal Ratio Mask (IRM). The performance of Short-Time Objective Intelligibility (STOI), and a STOI variant (termed STOI+), is compared with that of other monaural intelligibility metrics that can be used before and after mask-based processing. The results show that IRMs can be used to obtain near maximal speech intelligibility (> 90% for sentence material) even at very low mixture SNRs, while IBMs with LC = 0 provide limited intelligibility gains for SNR < 14 dB. It is also shown that STOI+ is a suitable metric for speech mixed with WGN at low SNRs and processed by IBMs with LC = 0, even when the speech is high-pass filtered to flatten the spectral tilt.
Robust DNN-Based Speech Enhancement with Limited Training Data
2018
In conventional speech enhancement, statistical models for speech and noise are used to derive clean speech estimators. The parameters of the models are estimated blindly from the noisy observation using carefully designed algorithms. These algorithms generalize well to unseen acoustic conditions, but are unable to reduce highly non-stationary noise types. This shortcoming motivated the usage of machine-learning-based (ML-based) algorithms, in particular deep neural networks (DNNs). But if only limited training data are available, the noise reduction performance in unseen acoustic conditions suffers. In this paper, motivated by conventional speech enhancement, we propose to use the a priori and a posteriori signal-to-noise ratios (SNRs) for DNN-based speech enhancement systems. Instrumental measures show that the proposed features increase the robustness in unknown noise types even if only limited training data are available.
Speech Enhancement Using Deep Learning Methods: A Review
Jurnal Elektronika dan Telekomunikasi
Speech enhancement, which aims to recover the clean speech of the corrupted signal, plays an important role in the digital speech signal processing. According to the type of degradation and noise in the speech signal, approaches to speech enhancement vary. Thus, the research topic remains challenging in practice, specifically when dealing with highly non-stationary noise and reverberation. Recent advance of deep learning technologies has provided great support for the progress in speech enhancement research field. Deep learning has been known to outperform the statistical model used in the conventional speech enhancement. Hence, it deserves a dedicated survey. In this review, we described the advantages and disadvantages of recent deep learning approaches. We also discussed challenges and trends of this field. From the reviewed works, we concluded that the trend of the deep learning architecture has shifted from the standard deep neural network (DNN) to convolutional neural network ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.