Development of a Text-Dependent Speaker Identification System with the OGI Toolkit (original) (raw)

Text-Dependent Speaker Verification System Using Neural Network

International Journal of Emerging Technology and Advanced Engineering, 2015

This paper presents the use of a back-propagation neural network to implement voice recognition. The focus is to identify the voice patterns of different people so as to recognize their voices electronically. The signals corresponding to a text phrase of a group of people are recorded in voice files on a computer using sound recording software. The information in these files is converted from the time domain to the frequency domain using digital signal processing techniques. The resulting preprocessed signal samples in the frequency domain are then used to train a neural network to identify them from among other voice samples.

A Tutorial on Text-Independent Speaker Verification

Eurasip Journal on Advances in Signal Processing, 2004

This paper presents an overview of a state-of-the-art text-independent speaker verification system. First, an introduction proposes a modular scheme of the training and test phases of a speaker verification system. Then, the most commonly speech parameterization used in speaker verification, namely, cepstral analysis, is detailed. Gaussian mixture modeling, which is the speaker modeling technique used in most systems, is then explained. A few speaker modeling alternatives, namely, neural networks and support vector machines, are mentioned. Normalization of scores is then explained, as this is a very important step to deal with real-world data. The evaluation of a speaker verification system is then detailed, and the detection error trade-off (DET) curve is explained. Several extensions of speaker verification are then enumerated, including speaker tracking and segmentation by speakers. Then, some applications of speaker verification are proposed, including on-site applications, remote applications, applications relative to structuring audio information, and games. Issues concerning the forensic area are then recalled, as we believe it is very important to inform people about the actual performance and limitations of speaker verification systems. This paper concludes by giving a few research trends in speaker verification for the next couple of years.

Text Dependent Speaker Verification with a Hybrid HMM/ANN System

2002

The aim of this report was to implement a text-dependent speaker verification system using speaker adapted neural networks and to evaluate the system. The idea was to use a hybrid HMM/ANN approach, i.e. Artificial Neural Networks were used to estimate Hidden Markov Model emission posterior probabilities from speech data, and the system was implemented in C++ as a module for GIVES. The report also contains an overview over speaker verification. Methods and algorithms for network training and adaptation are explained, and the performance of the system is tested. Both Multi-Layer perceptrons and Single-Layer perceptrons are tested and compared to other speaker verification systems. The test results show that the hybrid HMM/ANN system does not perform as well as other speaker verification systems, but if the system parameters are optimised further performance might increase. Along with an analysis and summary of the project possible improvements of the system are suggested. I would like to thank all the people at TMH who supported me in this thesis project, especially Håkan Melin, my supervisor.

Neural network based speaker classification and verification systems with enhanced features

2017 Intelligent Systems Conference (IntelliSys), 2017

This work presents a novel framework based on feed-forward neural network for text-independent speaker classification and verification, two related systems of speaker recognition. With optimized features and model training, it achieves 100% classification rate in classification and less than 6% Equal Error Rate (ERR), using merely about 1 second and 5 seconds of data respectively. Features with stricter Voice Active Detection (VAD) than the regular one for speech recognition ensure extracting stronger voiced portion for speaker recognition, speaker-level mean and variance normalization helps to eliminate the discrepancy between samples from the same speaker. Both are proven to improve the system performance. In building the neural network speaker classifier, the network structure parameters are optimized with grid search and dynamically reduced regularization parameters are used to avoid training terminated in local minimum. It enables the training goes further with lower cost. In speaker verification, performance is improved with prediction score normalization, which rewards the speaker identity indices with distinct peaks and penalizes the weak ones with high scores but more competitors, and speaker-specific thresholding, which significantly reduces ERR in the ROC curve. TIMIT corpus with 8K sampling rate is used here. First 200 male speakers are used to train and test the classification performance. The testing files of them are used as in-domain registered speakers, while data from the remaining 126 male speakers are used as out-of-domain speakers, i.e. imposters in speaker verification.

A Comparative Assessment of Text-independent Automatic Speaker Identification Methods Using Limited Data

European Journal of Science and Technology, 2021

Automatic Speaker Identification (ASI) is one of the active fields of research in signal processing. Various machine learning algorithms have been used for this purpose. With the recent developments in hardware technologies and data accumulation, Deep Learning (DL) methods have become the new state-of-the-art approach in several classification and identification tasks. In this paper, we evaluate the performance of traditional methods such as Gaussian Mixture Model-Universal Background Model (GMM-UBM) and DL-based techniques such as Factorized Time-Delay Neural Network (FTDNN) and Convolutional Neural Networks (CNN) for text-independent closed-set automatic speaker identification on two datasets with different conditions. LibriSpeech is one of the experimental datasets, which consists of clean audio signals from audiobooks, collected from a large number of speakers. The other dataset was collected and prepared by us, which has rather limited speech data with low signal-to-noise-ratio from real-life conversations of customers with the agents in a call center. The duration of the speech signals in the query phase is an important factor affecting the performances of ASI methods. In this work, a CNN architecture is proposed for automatic speaker identification from short speech segments. The architecture design aims at capturing the temporal nature of speech signal in an optimum convolutional neural network with low number of parameters compared to the well-known CNN architectures. We show that the proposed CNN-based algorithm performs better on the large and clean dataset, whereas on the other dataset with limited amount of data, traditional method outperforms all DL approaches. The achieved top-1 accuracy by the proposed model is 99.5% on 1-second voice instances from LibriSpeech dataset.

A novel neural feature for a text-dependent speaker identification system

2018

A novel feature based on the simulated neural response of the auditory periphery was proposed in this study for a speaker identification system. A well-known computational model of the auditory-nerve (AN) fiber by Zilany and colleagues, which incorporates most of the stages and the relevant nonlinearities observed in the peripheral auditory system, was employed to simulate neural responses to speech signals from different speakers. Neurograms were constructed from responses of innerhair-cell (IHC)-AN synapses with characteristic frequencies spanning the dynamic range of hearing. The synapse responses were subjected to an analytical function to incorporate the effects of absolute and relative refractory periods. The proposed IHC-AN neurogram feature was then used to train and test the text-dependent speaker identification system using standard classifiers. The performance of the proposed method was compared to the results from existing baseline methods for both quiet and noisy conditions. While the performance using the proposed feature was comparable to the results of existing methods in quiet environments, the neural feature exhibited a substantially better classification accuracy in noisy conditions, especially with white Gaussian and street noises. Also, the performance of the proposed system was relatively independent of various types of distortions in the acoustic signals and classifiers. The proposed feature can be employed to design a robust speech recognition system.

A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network

This article presents the implementation of Text Independent Speaker Identification system. It involves two parts- “Speech Signal Processing” and “Artificial Neural Network”. The speech signal processing uses Mel Frequency Cepstral Coefficients (MFCC) acquisition algorithm that extracts features from the speech signal, which are actually the vectors of coefficients. The backpropagation algorithm of the artificial neural network stores the extracted features on a database and then identify speaker based on the information. The direct speech does not work for the identification of the voice or speaker. Since the speech signal is not always periodic and only half of the frames are voiced, it is not a good practice to work with the half voiced and half unvoiced frames. So the speech must be preprocessed to successfully identify a voice or speaker. The major goal of this work is to derive a set of features that improves the accuracy of the text independent speaker identification system.

IJERT-A Study of Various Speech Features and Classifiers used in Speaker Identification

International Journal of Engineering Research and Technology (IJERT)`, 2016

https://www.ijert.org/a-study-of-various-speech-features-and-classifiers-used-in-speaker-identification https://www.ijert.org/research/a-study-of-various-speech-features-and-classifiers-used-in-speaker-identification-IJERTV5IS020637.pdf Speech processing consists of analysis/synthesis, recognition & coding of speech signal. The recognition field further branched to Speech recognition, Speaker recognition and speaker identification. Speaker identification system is used to identify a speaker among many speakers. To have a good identification rate is a prerequisite for any Speaker identification system which can be achieved by making an optimal choice among the available techniques. In this paper, different speech features & extraction techniques such as MFCC, LPCC, LPC, GLFCC, PLPC etc and different features classification models such as VQ, GMM, DTW, HMM and ANN for speaker identification system have been discussed. Keywords-Linear Predictive Cepstral Coefficients (LPCC), Mel Frequency Cepstral Coefficients (MFCC), Gaussian Mixture Model (GMM), Vector Quantization (VQ), Hidden Markov Model (HMM), Artificial Neural Network (ANN)

The Innovative Approach for Text-Independent Human Speaker Identification Utilizing Concepts of Artificial Neural Network

This article presents the implementation of Text Independent Speaker Identification system. It involves two parts-―Speech Signal Processing‖ and― Artificial Neural Network‖. The speech signal processing usesMel Frequency Cepstral Coefficients (MFCC) acquisition algorithm that extracts features from the speech signal ,which are actually the vectors of coefficients. The back propagation algorithm of the artificial neural network stores the extracted features on a data base and then identify speaker based on the information. The direct speech does not work for the identification of the voice or speaker. Since the speech signal is not always periodic and only half of the framesare voiced, it is not a good practice to work with the half voice dand half unvoiced frames. Hence, the speech must be pre processed to successfully identify a voice or speaker. Them a jorgoal of this work is to derivease to ffeatures that improves the accuracy of the text independent speaker identification system.