Audio enhancing with DNN autoencoder for speaker recognition (original) (raw)

How to Leverage DNN-based speech enhancement for multi-channel speaker verification?

Cornell University - arXiv, 2022

Speaker verification (SV) suffers from unsatisfactory performance in far-field scenarios due to environmental noise and the adverse impact of room reverberation. This work presents a benchmark of multichannel speech enhancement for far-field speaker verification. One approach is a deep neural network-based, and the other is a combination of deep neural network and signal processing. We integrated a DNN architecture with signal processing techniques to carry out various experiments. Our approach is compared to the existing state-of-the-art approaches. We examine the importance of enrollment in pre-processing, which has been largely overlooked in previous studies. Experimental evaluation shows that pre-processing can improve the SV performance as long as the enrollment files are processed similarly to the test data and that test and enrollment occur within similar SNR ranges. Considerable improvement is obtained on the generated and all the noise conditions of the VOiCES dataset.

On autoencoders in the i-vector space for speaker recognition

Odyssey 2016, 2016

We present the detailed empirical investigation of the speaker verification system based on denoising autoencoder (DAE) in the i-vector space firstly proposed in [1]. This paper includes description of this system and discusses practical issues of the system training. The aim of this investigation is to study the properties of DAE in the i-vector space and analyze different strategies of initialization and training of the back-end parameters. Also in this paper we propose several improvements to our system to increase the accuracy. Finally, we demonstrate potential of the proposed system in the case of domain mismatch. It achieves considerable gain in performance compared to the baseline system for the unsupervised domain adaptation scenario on the NIST 2010 SRE task.

A Comprehensive Exploration of Noise Robustness and Noise Compensation in ResNet and TDNN-based Speaker Recognition Systems

2022 30th European Signal Processing Conference (EUSIPCO)

In this paper, a comprehensive exploration of noise robustness and noise compensation of ResNet and TDNN speaker recognition systems is presented. Firstly the robustness of the TDNN and ResNet in the presence of noise, reverberation, and both distortions is explored. Our experimental results show that in all cases the ResNet system is more robust than TDNN. After that, a noise compensation task is done with denoising autoencoder (DAE) over the x-vectors extracted from both systems. We explored two scenarios: 1) compensation of artificial noise with artificial data, 2) compensation of real noise with artificial data. The second case is the most desired scenario, because it makes noise compensation affordable without having real data to train denoising techniques. The experimental results show that in the first scenario noise compensation gives significant improvement with TDNN while this improvement in Resnet is not significant. In the second scenario, we achieved 15% improvement of EER over VoiCes Eval challenge in both TDNN and ResNet systems. In most cases the performance of ResNet without compensation is superior to TDNN with noise compensation.

Discriminative autoencoders for speaker verification

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper presents a learning and scoring framework based on neural networks for speaker verification. The framework employs an autoencoder as its primary structure while three factors are jointly considered in the objective function for speaker discrimination. The first one, relating to the sample reconstruction error, makes the structure essentially a generative model, which benefits to learn most salient and useful properties of the data. Functioning in the middlemost hidden layer, the other two attempt to ensure that utterances spoken by the same speaker are mapped into similar identity codes in the speaker discriminative subspace, where the dispersion of all identity codes are maximized to some extent so as to avoid the effect of over-concentration. Finally, the decision score of each utterance pair is simply computed by cosine similarity of their identity codes. Dealing with utterances represented by i-vectors, the results of experiments conducted on the male portion of the core task in the NIST 2010 Speaker Recognition Evaluation (SRE) significantly demonstrate the merits of our approach over the conventional PLDA method.

Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments

2020 28th European Signal Processing Conference (EUSIPCO), 2021

The explosion of available speech data and new speaker modeling methods based on deep neural networks (DNN) have given the ability to develop more robust speaker recognition systems. Among DNN speaker modelling techniques, x-vector system has shown a degree of robustness in noisy environments. Previous studies suggest that by increasing the number of speakers in the training data and using data augmentation more robust speaker recognition systems are achievable in noisy environments. In this work, we want to know if explicit noise compensation techniques continue to be effective despite the general noise robustness of these systems. For this study, we will use two different x-vector networks: the first one is trained on Voxceleb1 (Protocol1), and the second one is trained on Voxceleb1+Voxveleb2 (Protocol2). We propose to add a denoising x-vector subsystem before scoring. Experimental results show that, the x-vector system used in Protocol2 is more robust than the other one used Protocol1. Despite this observation we will show that explicit noise compensation gives almost the same EER relative gain in both protocols. For example, in the Protocol2 we have 21% to 66% improvement of EER with denoising techniques.

End-to-end recurrent denoising autoencoder embeddings for speaker identification

Neural Computing and Applications

Speech 'in-the-wild' is a handicap for speaker recognition systems due to the variability induced by real-life conditions, such as environmental noise and emotions in the speaker. Taking advantage of the principles of representation learning, we aim to design a recurrent denoising autoencoder that extracts robust speaker embeddings from noisy spectrograms to perform speaker identification. The end-to-end proposed architecture uses a feedback loop to encode information regarding the speaker into low-dimensional representations extracted by a spectrogram denoising autoencoder. We use data augmentation techniques by additively corrupting clean speech with real life environmental noise and make use of a database with real stressed speech. We prove that the joint optimization of both the denoiser and speaker identification modules outperforms independent optimization of both modules under stress and noise distortions as well as hand-crafted features.

Denoising x-vectors for Robust Speaker Recognition

The Speaker and Language Recognition Workshop (Odyssey 2020), 2020

Using deep learning methods has led to significant improvement in speaker recognition systems. Introducing xvectors as a speaker modeling method has made these systems more robust. Since, in challenging environments with noise and reverberation, the performance of x-vectors systems degrades significantly, the demand for denoising techniques remains as before. In this paper, for the first time, we try to denoise the xvectors speaker embedding. Our focus is on additive noise. Firstly, we use the i-MAP method which considers that both noise and clean x-vectors have a Gaussian distribution. Then, leveraging denoising autoencoders (DAE) we try to reconstruct the clean x-vector from the corrupted version. After that, we propose two hybrid systems composed of statistical i-MAP and DAE. Finally, we propose a novel DAE architecture, named Deep Stacked DAE, composed of several DAEs where each DAE receives as input the output of its predecessor DAE concatenated with the difference between noisy x-vectors and its predecessor's output. The experiments on Fabiol corpus show that the results given by the hybrid DAE i-MAP method in several cases outperforms the conventional DAE and i-MAP methods. Also, the results for Deep Stacked DAE in most cases is better than the other proposed methods. For utterances longer than 12 seconds we achieved a 51% improvement in terms of EER with Deep Stacked DAE, and for utterances shorter than 2 seconds, Deep Stacked DAE gives 18% improvements compared to the baseline system.

Compensate multiple distortions for speaker recognition systems

2021 29th European Signal Processing Conference (EUSIPCO), 2021

The performance of speaker recognition systems reduces dramatically in severe conditions in the presence of additive noise and/or reverberation. In some cases, there is only one kind of domain mismatch like additive noise or reverberation, but in many cases, there are more than one distortion. Finding a solution for domain adaptation in the presence of different distortions is a challenge. In this paper we investigate the situation in which there is none, one or more of the following distortions: early reverberation, full reverberation, additive noise. We propose two configurations to compensate for these distortions. In the first one a specific denoising autoencoder is used for each distortion. In the second configuration, a denoising autoencoder is used to compensate for all of these distortions simultaneously. Our experiments show that, in the coexistence of noise and reverberation, the second configuration gives better results. For example, with the second configuration we obtained 76.6% relative improvement of EER for utterances longer than 12 seconds. For other situations in the presence of only one distortion, the second configuration gives almost the same results achieved by using a specific model for each distortion.

Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally sub-optimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this article, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: training and testing. In the training stage, we build multiple component models to form a multi-branched encoder based on a decision tree (DSDT). The DSDT is built based on prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the testing stage, noisy speech is first processed by each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics, automatic speech recognition results, and quality in subjective human listening tests.

Bottleneck features from SNR-adaptive denoising deep classifier for speaker identification

2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015

In this paper, we explore the potential of using deep learning for extracting speaker-dependent features for noise robust speaker identification. More specifically, an SNR-adaptive denoising classifier is constructed by stacking two layers of restricted Boltzmann machines (RBMs) on top of a denoising deep autoencoder, where the top-RBM layer is connected to a soft-max output layer that outputs the posterior probabilities of speakers and the top-RBM layer outputs speaker-dependent bottleneck features. Both the deep autoencoder and RBMs are trained by contrastive divergence, followed by backpropagation fine-tuning. The autoencoder aims to reconstruct the clean spectra of a noisy test utterance using the spectra of the noisy test utterance and its SNR as input. With this denoising capability, the output from the bottleneck layer of the classifier can be considered as a low-dimension representation of denoised utterances. These frame-based bottleneck features are than used to train an iVector extractor and a PLDA model for speaker identification. Experimental results based on a noisy YOHO corpus show that the bottleneck features slightly outperform the conventional MFCC under low SNR conditions and that fusion of the two features lead to further performance gain, suggesting that the two features are complementary with each other.