Speaker identification under noisy conditions using hybrid convolutional neural network and gated recurrent unit (original) (raw)

Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion

The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with noise, channel variation, physical and behavioral changes with the speaker. The studies confirmed that the features which represent speech in the Equal Rectangular Band (ERB) scale are more robust than Mel Scale at low Signal to Noise Ratio (SNR) level. Gammatone Frequency Cepstral Coefficient (GFCC) which represents speech in ERB scale is widely used in classical machine learning based speaker recognition at noisy conditions. Recently, deep learning models are widely applied in speaker recognition and show better performance than classical machine learning. Previous deep learning based speaker recognition models used Mel Spectrogram as an input rather than hand crafted features. However, the performance of Mel spectrogram drastically degraded at low SNR level because Mel Spectrogram represents speech in Mel Scale. Cochle...

Speaker Identification Using a Convolutional Neural Network

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 2022

Speech, a mode of communication between humans and machines, has various applications, including biometric systems for identifying people have access to secure systems. Feature extraction is an important factor in speech recognition with high accuracy. Therefore, we implemented a spectrogram, which is a pictorial representation of speech in terms of raw features, to identify speakers. These features were inputted into a convolutional neural network (CNN), and a CNN-visual geometry group (CNN-VGG) architecture was used to recognize the speakers. We used 780 primary data from 78 speakers, and each speaker uttered a number in Bahasa Indonesia. The proposed architecture, CNN-VGG-f, has a learning rate of 0.001, batch size of 256, and epoch of 100. The results indicate that this architecture can generate a suitable model for speaker identification. A spectrogram was used to determine the best features for identifying the speakers. The proposed method exhibited an accuracy of 98.78%, whic...

An Approach for Identification of Speaker using Deep Learning

International Journal of Artificial Intelligence & Mathematical Sciences

The audio data is getting increased on daily basis across the world with the increase of telephonic conversations, video conferences, podcasts and voice notes. This study presents a mechanism for identification of a speaker in an audio file, which is based on the biometric features of human voice such as frequency, amplitude, and pitch. We proposed an unsupervised learning model which uses wav2vec 2.0 where the model learns speech representation with the dataset provided. We used Librispeech dataset in our research and we achieved our results at an error rate which is 1.8.

Speaker Recognition through Deep Learning Techniques

Periodica Polytechnica Electrical Engineering and Computer Science

Deep learning has now become an integral part of today's world and advancement in the field of deep learning has gained a huge development. Due to the extensive use and fast growth of deep learning, it has captured the attention of researchers in the field of speaker recognition. A detailed investigation regarding the process becomes essential and helpful to the researchers for designing robust applications in the field of speaker recognition, both in speaker verification and identification. This paper reviews the field of speaker recognition taking into consideration of deep learning advancement in the present era that boosts up this technology. The paper continues with a systematic review by firstly giving a basic idea of deep learning and its architecture with its field of application, then entering into the high-lighted portion of our paper i.e., speaker recognition which is one of the important applications of deep learning. Here we have mentioned its types, different proce...

Investigation of Deep Neural Network for Speaker Recognition

international journal for research in applied science and engineering technology ijraset, 2020

In this paper, deep neural networks are investigated for Speaker recognition. Deep neural networks (DNN) are recently proposed for this task. However, many architectural choices and training aspects that are made while building such systems haven't been studied carefully. We perform several experiments on a dataset consisting of 10 speakers,100 speakers and 300 speakers with a complete training data of about 120 hours in evaluating the effect of such choices. Evaluations of models were performed on 10,100,300 speakers of testing data with 2.5 hours for every speaker utterance. In our results, we compare the accuracy of GMM, GMM+UBM, ivectors and also time taken for the various modelling techniques. Also, DNN outperforms the fundamental models indicating the effectiveness of the DNN mechanism.

Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network

IEEE Access

Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.

Speaker Verification Using Deep Neural Networks: A Review

International Journal of Machine Learning and Computing

Speaker verification involves examining the speech signal to authenticate the claim of a speaker as true or false. Deep neural networks are one of the successful implementation of complex non-linear models to learn unique and invariant features of data.

Automatic Speaker Recognition using Transfer Learning Approach of Deep Learning Models

2021 6th International Conference on Inventive Computation Technologies (ICICT), 2021

Speaker Recognition has been one of the most interesting yet challenging problem in the field of machine learning and artificial intelligence. It is used In areas of human voice authentication for security purpose and identifying a person from a group of speakers. It has been a grinding task to teach a machine the differences in human voices when people belong to different background like gender, language, accent.etc. In this paper, we use the deep learning approach to build and train two models : ANN and CNN; and compare their results. In former, the neural networks are fed on diverse extracted features from audio collection. The latter is a convolutional neural network trained on spectrograms. At last, we use transfer learning approach on both to get a viable output using less data.

An investigation into the reliability of speaker recognition schemes: analysing the impact of environmental factors utilising deep learning techniques

Journal of Engineering and Applied Science, 2024

This paper studies the performance and reliability of deep learning-based speaker recognition schemes under various recording situations and background noise presence. The study uses the Speaker Recognition Dataset offered in the Kaggle website, involving audio recordings from different speakers, and four scenarios with various combinations of speakers. In the first scenario, the scheme achieves discriminating capability and high accuracy in identifying speakers without taking into account outside noise, having roughly one area under the ROC curve. Nevertheless, in the second scenario, with background noise added to the recording, accuracy decreases, and misclassifications increase. However, the scheme still reveals good discriminating power, with ROC areas ranging from 0.77 to 1.

Text-independent Speaker Verification Using Convolutional Deep Belief Network and Gaussian Mixture Model

2018

There has been much interest in new deep learning approaches for representing and extracting high-level features for audio processing. In this paper convolutional deep belief network was used to generate new speech features for textindependent speaker verification. Structure and parameters of a convolutional deep belief network were described. New high-level speech features were extracted using proposed method. Relevance of speaker verification systems for mobile authentication was considered. Gaussian mixture model and universal background model speaker verification system used for experiments was described. Speaker verification accuracy using extracted features was evaluated on a 50 speaker set and a result is presented. Different layers and combinations of layers of convolutional deep belief network were used as a features for a text-independent speaker verification. High level features extracted by convolutional deep belief network were illustrated and analyzed. Reasons of insuf...