An investigation into the reliability of speaker recognition schemes: analysing the impact of environmental factors utilising deep learning techniques (original) (raw)

Speaker Recognition through Deep Learning Techniques

Periodica Polytechnica Electrical Engineering and Computer Science

Deep learning has now become an integral part of today's world and advancement in the field of deep learning has gained a huge development. Due to the extensive use and fast growth of deep learning, it has captured the attention of researchers in the field of speaker recognition. A detailed investigation regarding the process becomes essential and helpful to the researchers for designing robust applications in the field of speaker recognition, both in speaker verification and identification. This paper reviews the field of speaker recognition taking into consideration of deep learning advancement in the present era that boosts up this technology. The paper continues with a systematic review by firstly giving a basic idea of deep learning and its architecture with its field of application, then entering into the high-lighted portion of our paper i.e., speaker recognition which is one of the important applications of deep learning. Here we have mentioned its types, different proce...

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion

The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with noise, channel variation, physical and behavioral changes with the speaker. The studies confirmed that the features which represent speech in the Equal Rectangular Band (ERB) scale are more robust than Mel Scale at low Signal to Noise Ratio (SNR) level. Gammatone Frequency Cepstral Coefficient (GFCC) which represents speech in ERB scale is widely used in classical machine learning based speaker recognition at noisy conditions. Recently, deep learning models are widely applied in speaker recognition and show better performance than classical machine learning. Previous deep learning based speaker recognition models used Mel Spectrogram as an input rather than hand crafted features. However, the performance of Mel spectrogram drastically degraded at low SNR level because Mel Spectrogram represents speech in Mel Scale. Cochle...

Machine Learning for Speaker Recognition

2020

In the last years, many methods have been developed and deployed for real-world biometric applications and multimedia information systems. Machine learning has been playing a crucial role in these applications where the model parameters could be learned and the system performance could be optimized. As for speaker recognition, researchers and engineers have been attempting to tackle the most di cult challenges: noise robustness and domain mismatch. These e↵orts have now been fruitful, leading to commercial products starting to emerge, e.g., voice authentication for e-banking and speaker identification in smart speakers. Research in speaker recognition has traditionally been focused on signal processing (for extracting the most relevant and robust features) and machine learning (for classifying the features). Recently, we have witnessed the shift in the focus from signal processing to machine learning. In particular, many studies have shown that model adaptation can address both robustness and domain mismatch. As for robust feature extraction, recent studies also demonstrate that deep learning and feature learning can be a great alternative to traditional signal processing algorithms. This book has two perspectives: machine learning and speaker recognition. The machine learning perspective gives readers insights on what makes stateof-the-art systems perform so well. The speaker recognition perspective enables readers to apply machine learning techniques to address practical issues (e.g., robustness under adverse acoustic environments and domain mismatch) when deploying speaker recognition systems. The theories and practices of speaker recognition are tightly connected in the book. This book covers di↵erent components in speaker recognition including frontend feature extraction, back-end modeling, and scoring. A range of learning models are detailed, from Gaussian mixture models, support vector machines, joint factor analysis, and probabilistic linear discriminant analysis (PLDA) to deep neural networks (DNN). The book also covers various learning algorithms, from Bayesian learning, unsupervised learning, discriminative learning, transfer learning, manifold learning, and adversarial learning to deep learning. A series of case studies and modern models based on PLDA and DNN are addressed. In particular, di↵erent variants of deep models and their solutions to di↵erent problems in speaker recognition are presented. In addition, the book highlights some of the new trends and directions for speaker recognition based on deep

Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets

Speaker verification (SV) is the process of verifying whether speech from two audio signals originate from the same speaker or different speakers. Current state-of-the-art SV systems are based on deep neural networds, predominantly trained using the VoxCeleb dataset. This may lead to varying SV performance when using the models for inference on real-world data. To research these possible variations in performance, three establised SV models, namely the ECAPA-TDNN, ResNet and WavLM, are evaluated on the UCLA variability, CommonVoice, FRIDA and Wyred datasets. The ECAPA-TDNN and ResNet models are found to perform slightly worse when compared with the VoxCeleb evaluation results while the WavLM model performs significantly worse. The ResNet model shows the best performance on all four datasets. After evaluation, the ResNet model is improved by fine-tuning the model on the UCLA dataset and, further by creating a Deep Weight Space Ensemble (WSE) model between the pre-trained and fine-tuned models. Between the pretrained, fine-tuned and WSE models, the WSE model has the best overall performance, attaining the best scores on the UCLA test set. Scores for the other three datasets show a lower decrease than the fine-tuned model. This indicates that fine-tuning with WSE can alleviate the loss in model performance on real-world data.

Investigation of Deep Neural Network for Speaker Recognition

international journal for research in applied science and engineering technology ijraset, 2020

In this paper, deep neural networks are investigated for Speaker recognition. Deep neural networks (DNN) are recently proposed for this task. However, many architectural choices and training aspects that are made while building such systems haven't been studied carefully. We perform several experiments on a dataset consisting of 10 speakers,100 speakers and 300 speakers with a complete training data of about 120 hours in evaluating the effect of such choices. Evaluations of models were performed on 10,100,300 speakers of testing data with 2.5 hours for every speaker utterance. In our results, we compare the accuracy of GMM, GMM+UBM, ivectors and also time taken for the various modelling techniques. Also, DNN outperforms the fundamental models indicating the effectiveness of the DNN mechanism.

An Approach for Identification of Speaker using Deep Learning

International Journal of Artificial Intelligence & Mathematical Sciences

The audio data is getting increased on daily basis across the world with the increase of telephonic conversations, video conferences, podcasts and voice notes. This study presents a mechanism for identification of a speaker in an audio file, which is based on the biometric features of human voice such as frequency, amplitude, and pitch. We proposed an unsupervised learning model which uses wav2vec 2.0 where the model learns speech representation with the dataset provided. We used Librispeech dataset in our research and we achieved our results at an error rate which is 1.8.

Deep Learning for Speaker Identification: Architectural Insights from AB-1 Corpus Analysis and Performance Evaluation

2024

In the fields of security systems, forensic investigations, and personalized services, the importance of speech as a fundamental human input outweighs text-based interactions. This research delves deeply into the complex field of Speaker Identification (SID), examining its essential components and emphasising Mel Spectrogram and Mel Frequency Cepstral Coefficients (MFCC) for feature extraction. Moreover, this study evaluates six slightly distinct model architectures using extensive analysis to evaluate their performance, with hyperparameter tuning applied to the best-performing model. This work performs a linguistic analysis to verify accent and gender accuracy, in addition to bias evaluation within the AB-1 Corpus dataset.

The 2016 NIST Speaker Recognition Evaluation

Interspeech 2017, 2017

In 2018, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). SRE18 was organized in a similar manner to SRE16, focusing on speaker detection over conversational telephony speech (CTS) collected outside north America. SRE18 also featured several new aspects including: two new data domains, namely voice over internet protocol (VoIP) and audio extracted from amateur online videos (AfV), as well as a new language (Tunisian Arabic). A total of 78 organizations (forming 48 teams) from academia and industry participated in SRE18 and submitted 129 valid system outputs under fixed and open training conditions first introduced in SRE16. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in SRE18. The evaluation results suggest that 1) speaker recognition on AfV was more challenging than on telephony data, 2) speaker representations (aka embeddings) extracted using end-to-end neural network frameworks were most effective, 3) top performing systems exhibited similar performance, and 4) greatest performance improvements were largely due to data augmentation, use of extended and more complex models for data representation, as well as effective use of the provided development sets.

Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network

IEEE Access

Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.

ASVtorch toolkit: Speaker verification with deep neural networks

SoftwareX

The human voice differs substantially between individuals. This facilitates automatic speaker verification (ASV)-recognizing a person from his/her voice. ASV accuracy has substantially increased throughout the past decade due to recent advances in machine learning, particularly deep learning methods. An unfortunate downside has been substantially increased complexity of ASV systems. To help non-experts to kick-start reproducible ASV development, a state-of-the-art toolkit implementing various ASV pipelines and functionalities is required. To this end, we introduce a new open-source toolkit, ASVtorch, implemented in Python using the widely used PyTorch machine learning framework.