Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion (original) (raw)
Related papers
Journal of Engineering and Applied Science, 2024
This paper studies the performance and reliability of deep learning-based speaker recognition schemes under various recording situations and background noise presence. The study uses the Speaker Recognition Dataset offered in the Kaggle website, involving audio recordings from different speakers, and four scenarios with various combinations of speakers. In the first scenario, the scheme achieves discriminating capability and high accuracy in identifying speakers without taking into account outside noise, having roughly one area under the ROC curve. Nevertheless, in the second scenario, with background noise added to the recording, accuracy decreases, and misclassifications increase. However, the scheme still reveals good discriminating power, with ROC areas ranging from 0.77 to 1.
Investigation of Deep Neural Network for Speaker Recognition
international journal for research in applied science and engineering technology ijraset, 2020
In this paper, deep neural networks are investigated for Speaker recognition. Deep neural networks (DNN) are recently proposed for this task. However, many architectural choices and training aspects that are made while building such systems haven't been studied carefully. We perform several experiments on a dataset consisting of 10 speakers,100 speakers and 300 speakers with a complete training data of about 120 hours in evaluating the effect of such choices. Evaluations of models were performed on 10,100,300 speakers of testing data with 2.5 hours for every speaker utterance. In our results, we compare the accuracy of GMM, GMM+UBM, ivectors and also time taken for the various modelling techniques. Also, DNN outperforms the fundamental models indicating the effectiveness of the DNN mechanism.
The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space, 2023
As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important for devices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Instead of the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposed to use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices.
Automatic Speaker Recognition using Transfer Learning Approach of Deep Learning Models
2021 6th International Conference on Inventive Computation Technologies (ICICT), 2021
Speaker Recognition has been one of the most interesting yet challenging problem in the field of machine learning and artificial intelligence. It is used In areas of human voice authentication for security purpose and identifying a person from a group of speakers. It has been a grinding task to teach a machine the differences in human voices when people belong to different background like gender, language, accent.etc. In this paper, we use the deep learning approach to build and train two models : ANN and CNN; and compare their results. In former, the neural networks are fed on diverse extracted features from audio collection. The latter is a convolutional neural network trained on spectrograms. At last, we use transfer learning approach on both to get a viable output using less data.
IAES International Journal of Artificial Intelligence, 2024
Speaker identification is biometrics that classifies or identifies a person from other speakers based on speech characteristics. Recently, deep learning models outperformed conventional machine learning models in speaker identification. Spectrograms of the speech have been used as input in deep learning-based speaker identification using clean speech. However, the performance of speaker identification systems gets degraded under noisy conditions. Cochleograms have shown better results than spectrograms in deep learning-based speaker recognition under noisy and mismatched conditions. Moreover, hybrid convolutional neural network (CNN) and recurrent neural network (RNN) variants have shown better performance than CNN or RNN variants in recent studies. However, there is no attempt conducted to use a hybrid CNN and enhanced RNN variants in speaker identification using cochleogram input to enhance the performance under noisy and mismatched conditions. In this study, a speaker identification using hybrid CNN and the gated recurrent unit (GRU) is proposed for noisy conditions using cochleogram input. VoxCeleb1 audio dataset with real-world noises, white Gaussian noises (WGN) and without additive noises were employed for experiments. The experiment results and the comparison with existing works show that the proposed model performs better than other models in this study and existing works.
Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network
IEEE Access
Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.
Speaker Recognition through Deep Learning Techniques
Periodica Polytechnica Electrical Engineering and Computer Science
Deep learning has now become an integral part of today's world and advancement in the field of deep learning has gained a huge development. Due to the extensive use and fast growth of deep learning, it has captured the attention of researchers in the field of speaker recognition. A detailed investigation regarding the process becomes essential and helpful to the researchers for designing robust applications in the field of speaker recognition, both in speaker verification and identification. This paper reviews the field of speaker recognition taking into consideration of deep learning advancement in the present era that boosts up this technology. The paper continues with a systematic review by firstly giving a basic idea of deep learning and its architecture with its field of application, then entering into the high-lighted portion of our paper i.e., speaker recognition which is one of the important applications of deep learning. Here we have mentioned its types, different proce...
Linear versus mel frequency cepstral coefficients for speaker recognition
2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011
Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-theart speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.
The problem addressed in this paper is related to the fact that classical statistical approach for speaker recognition yields satisfactory results but at the expense of long length training and test utterances. An attempt to reduce the length of speaker samples is of great importance in the field of speaker recognition since the statistical approach, due to its limitations, is usually precluded from use in real-time applications. A novel method of text-independent speaker recognition which uses only the correlations among MFCCs, computed over selected speech segments of very-short length (approximately 120ms) is proposed. Three different neural networks -the Multi-Layer Perceptron (MLP), the Steinbuch's Learnmatrix (SLM) and the Self-Organizing Feature Finder (SOFF) -are evaluated in a speaker recognition task. The ability of dimensionality reduction of the SOFF paradigm is also discussed.