Comparison of Distance Metrics for Phoneme Classification based on Deep Neural Network Features and Weighted k-NN Classifier (original) (raw)

Investigation of Deep Neural Network for Speaker Recognition

international journal for research in applied science and engineering technology ijraset, 2020

In this paper, deep neural networks are investigated for Speaker recognition. Deep neural networks (DNN) are recently proposed for this task. However, many architectural choices and training aspects that are made while building such systems haven't been studied carefully. We perform several experiments on a dataset consisting of 10 speakers,100 speakers and 300 speakers with a complete training data of about 120 hours in evaluating the effect of such choices. Evaluations of models were performed on 10,100,300 speakers of testing data with 2.5 hours for every speaker utterance. In our results, we compare the accuracy of GMM, GMM+UBM, ivectors and also time taken for the various modelling techniques. Also, DNN outperforms the fundamental models indicating the effectiveness of the DNN mechanism.

An Approach for Identification of Speaker using Deep Learning

International Journal of Artificial Intelligence & Mathematical Sciences

The audio data is getting increased on daily basis across the world with the increase of telephonic conversations, video conferences, podcasts and voice notes. This study presents a mechanism for identification of a speaker in an audio file, which is based on the biometric features of human voice such as frequency, amplitude, and pitch. We proposed an unsupervised learning model which uses wav2vec 2.0 where the model learns speech representation with the dataset provided. We used Librispeech dataset in our research and we achieved our results at an error rate which is 1.8.

Speaker Gender Classification based on an Improved Deep Learning Approach

2020

Kurzfassung: With the great evolution of technology, Speaker gender and age classification is one of the major problems for large range of applications in speech analysis and recognition. The identification of speakers has become crucial in the cases of criminal suspect, speech recognition, speech emotion, and computer-aided physiological. To improve the accuracy of speaker gender classification, we must generate robust features with a depth classifier. With the promising results giving by machine learning for classification problem, our approach has taken advantage of deep learning. In this paper, we propose to apply a speaker gender classification based on the Recurrent Neural Network (RNN) which is able to determine the long term dependencies of a sequential speech signal. The most popular RNN is Long Short-Term Memory (LSTM) model. However, it has a complex design which makes it difficult to implement. So, we refine the LSTM model to our proposed Simplified Gated Recurrent Units...

A Study on the Performance Evaluation of Machine Learning Models for Phoneme Classification

ICMLC '19 Proceedings of the 2019 11th International Conference on Machine Learning and Computing - ACM, 2019

This paper provides a comparative performance analysis of both shallow and deep machine learning classifiers for speech recognition task using frame-level phoneme classification. Phoneme recognition is still a fundamental and equally crucial initial step toward automatic speech recognition (ASR) systems. Often conventional classifiers perform exceptionally well on domain-specific ASR systems having a limited set of vocabulary and training data in contrast to deep learning approaches. It is thus imperative to evaluate performance of a system using deep artificial networks in terms of correctly recognizing atomic speech units, i.e., phonemes in this case with conventional state-of-the-art machine learning classifiers. Two deep learning models-DNN and LSTM with multiple configuration architectures by varying the number of layers and the number of neurons in each layer on the OLLO speech corpora along with six shallow machine learning classifiers for Filterbank acoustic features are thoroughly studied. Additionally, features with three and ten frames temporal context are computed and compared with no-context features for different models. The classifier's performance is evaluated in terms of precision, recall, and F1 score for 14 consonants and 10 vowels classes for 10 speakers with 4 different dialects. High classification accuracy of 93% and 95% F1 score is obtained with DNN and LSTM networks respectively on context-dependent features for 3-hidden layers containing 1024 nodes each. SVM surprisingly obtained even a higher classification score of 96.13% and a misclassification error of less than 5% for consonants and 4% for vowels.

Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

In this paper we propose a new method of speaker diariza-tion that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convo-lutional neural network applied directly on magnitude spectro-grams. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outper-forms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.