Speaker Recognition from Spectrogram Images (original) (raw)
Related papers
Speaker Identification Using a Convolutional Neural Network
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 2022
Speech, a mode of communication between humans and machines, has various applications, including biometric systems for identifying people have access to secure systems. Feature extraction is an important factor in speech recognition with high accuracy. Therefore, we implemented a spectrogram, which is a pictorial representation of speech in terms of raw features, to identify speakers. These features were inputted into a convolutional neural network (CNN), and a CNN-visual geometry group (CNN-VGG) architecture was used to recognize the speakers. We used 780 primary data from 78 speakers, and each speaker uttered a number in Bahasa Indonesia. The proposed architecture, CNN-VGG-f, has a learning rate of 0.001, batch size of 256, and epoch of 100. The results indicate that this architecture can generate a suitable model for speaker identification. A spectrogram was used to determine the best features for identifying the speakers. The proposed method exhibited an accuracy of 98.78%, whic...
IAES International Journal of Artificial Intelligence, 2024
Speaker identification is biometrics that classifies or identifies a person from other speakers based on speech characteristics. Recently, deep learning models outperformed conventional machine learning models in speaker identification. Spectrograms of the speech have been used as input in deep learning-based speaker identification using clean speech. However, the performance of speaker identification systems gets degraded under noisy conditions. Cochleograms have shown better results than spectrograms in deep learning-based speaker recognition under noisy and mismatched conditions. Moreover, hybrid convolutional neural network (CNN) and recurrent neural network (RNN) variants have shown better performance than CNN or RNN variants in recent studies. However, there is no attempt conducted to use a hybrid CNN and enhanced RNN variants in speaker identification using cochleogram input to enhance the performance under noisy and mismatched conditions. In this study, a speaker identification using hybrid CNN and the gated recurrent unit (GRU) is proposed for noisy conditions using cochleogram input. VoxCeleb1 audio dataset with real-world noises, white Gaussian noises (WGN) and without additive noises were employed for experiments. The experiment results and the comparison with existing works show that the proposed model performs better than other models in this study and existing works.
Automatic Speaker Recognition using Transfer Learning Approach of Deep Learning Models
2021 6th International Conference on Inventive Computation Technologies (ICICT), 2021
Speaker Recognition has been one of the most interesting yet challenging problem in the field of machine learning and artificial intelligence. It is used In areas of human voice authentication for security purpose and identifying a person from a group of speakers. It has been a grinding task to teach a machine the differences in human voices when people belong to different background like gender, language, accent.etc. In this paper, we use the deep learning approach to build and train two models : ANN and CNN; and compare their results. In former, the neural networks are fed on diverse extracted features from audio collection. The latter is a convolutional neural network trained on spectrograms. At last, we use transfer learning approach on both to get a viable output using less data.
Speaker Recognition with Recurrent Neural Networks
We report on the application of recurrent neural nets in a open- set text-dependent speaker identification task. The motivation for applying recurrent neural nets to this domain is to find out if their ability to take short-term spectral features but yet respond to long-term temporal events is advantageous for speaker identifica- tion. We use a feedforward net architecture adapted from that intro- duced by Robinson et.al. We introduce a fully-connected hidden layer between the input and state nodes and the output. We show that this hidden layer makes the learning of complex classifica- tion tasks more efficient. Training uses back propagation through time. There is one output unit per speaker, with the training tar- gets corresponding to speaker identity. For 12 speakers (a mixture of male and female) we obtain a true acceptance rate 100% with a false acceptance rate 4%. For 16 speakers these figures are 94% and 7% respectively. We also investigate the sensitivity of identification ...
An Approach for Identification of Speaker using Deep Learning
International Journal of Artificial Intelligence & Mathematical Sciences
The audio data is getting increased on daily basis across the world with the increase of telephonic conversations, video conferences, podcasts and voice notes. This study presents a mechanism for identification of a speaker in an audio file, which is based on the biometric features of human voice such as frequency, amplitude, and pitch. We proposed an unsupervised learning model which uses wav2vec 2.0 where the model learns speech representation with the dataset provided. We used Librispeech dataset in our research and we achieved our results at an error rate which is 1.8.
Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
In this paper we propose a new method of speaker diariza-tion that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convo-lutional neural network applied directly on magnitude spectro-grams. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outper-forms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.
Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network
IEEE Access
Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.
Identifying Speakers Using Deep Learning: A review
International Journal of Science and Business, 2021
With the advancement of technology and the increasing demand on smart systems and smart applications that provide a quality-of-life improvement, there has been a surge in the demand of more conscious applications, Machine Learning (ML) is considered one of the driving forces behind implementing these types of applications, and one of its implementations is Speaker Identification (SID). Deep Neural Networks (DNNs) and also Recurrent Neural Networks (RNNs) are two main types of Deep Learning that are being used in the implementation of such applications. Speaker Identification is being utilized more and more on daily basis and is being focused on by the research community as a result of this demand. In this paper, a review will be conducted to some of the most recent researches that were conducted in this area and compare their results while discussing their outcomes.
Computational Intelligence, 2020
Speaker recognition is a major challenge in various languages for researchers. For programmed speaker recognition structure prepared by utilizing ordinary speech, shouting creates a confusion between the enlistment and test, henceforth minimizing the identification execution as extreme vocal exertion is required during shouting. Speaker recognition requires more time for classification of data, accuracy is optimized, and the low root‐mean‐square error rate is the major problem. The objective of this work is to develop an efficient system of speaker recognition. In this work, an improved method of Wiener filter algorithm is applied for better noise reduction. To obtain the essential feature vector values, Mel‐frequency cepstral coefficient feature extraction method is used on the noise‐removed signals. Furthermore, input samples are created by using these extracted features after the dimensions have been reduced using probabilistic principal component analysis. Finally, recurrent neu...
2024
In the fields of security systems, forensic investigations, and personalized services, the importance of speech as a fundamental human input outweighs text-based interactions. This research delves deeply into the complex field of Speaker Identification (SID), examining its essential components and emphasising Mel Spectrogram and Mel Frequency Cepstral Coefficients (MFCC) for feature extraction. Moreover, this study evaluates six slightly distinct model architectures using extensive analysis to evaluate their performance, with hyperparameter tuning applied to the best-performing model. This work performs a linguistic analysis to verify accent and gender accuracy, in addition to bias evaluation within the AB-1 Corpus dataset.