Automatic speaker recognition from speech signal using bidirectional long‐short‐term memory recurrent neural network (original) (raw)

Speaker Recognition with Recurrent Neural Networks

We report on the application of recurrent neural nets in a open- set text-dependent speaker identification task. The motivation for applying recurrent neural nets to this domain is to find out if their ability to take short-term spectral features but yet respond to long-term temporal events is advantageous for speaker identifica- tion. We use a feedforward net architecture adapted from that intro- duced by Robinson et.al. We introduce a fully-connected hidden layer between the input and state nodes and the output. We show that this hidden layer makes the learning of complex classifica- tion tasks more efficient. Training uses back propagation through time. There is one output unit per speaker, with the training tar- gets corresponding to speaker identity. For 12 speakers (a mixture of male and female) we obtain a true acceptance rate 100% with a false acceptance rate 4%. For 16 speakers these figures are 94% and 7% respectively. We also investigate the sensitivity of identification ...

Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Applied Sciences

Recently, neural network technology has shown remarkable progress in speech recognition, including word classification, emotion recognition, and identity recognition. This paper introduces three novel speaker recognition methods to improve accuracy. The first method, called long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL), utilizes MFCC as input features for the LSTM model and incorporates triplet loss and cluster training for effective training. The second method, bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL), enhances speaker recognition accuracy by employing a bidirectional LSTM model. The third method, bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL), utilizes an autoencoder to extract additional AE features, which are then concatenated with MFCC and fed into the BLSTM model. The r...

A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network

This article presents the implementation of Text Independent Speaker Identification system. It involves two parts- “Speech Signal Processing” and “Artificial Neural Network”. The speech signal processing uses Mel Frequency Cepstral Coefficients (MFCC) acquisition algorithm that extracts features from the speech signal, which are actually the vectors of coefficients. The backpropagation algorithm of the artificial neural network stores the extracted features on a database and then identify speaker based on the information. The direct speech does not work for the identification of the voice or speaker. Since the speech signal is not always periodic and only half of the frames are voiced, it is not a good practice to work with the half voiced and half unvoiced frames. So the speech must be preprocessed to successfully identify a voice or speaker. The major goal of this work is to derive a set of features that improves the accuracy of the text independent speaker identification system.

SPEAKER RECOGNITION WITH ARTIFICIAL NEURAL NETWORKS AND MEL-FREQUENCY CEPSTRAL COEFFICIENTS CORRELATIONS

The problem addressed in this paper is related to the fact that classical statistical approach for speaker recognition yields satisfactory results but at the expense of long length training and test utterances. An attempt to reduce the length of speaker samples is of great importance in the field of speaker recognition since the statistical approach, due to its limitations, is usually precluded from use in real-time applications. A novel method of text-independent speaker recognition which uses only the correlations among MFCCs, computed over selected speech segments of very-short length (approximately 120ms) is proposed. Three different neural networks -the Multi-Layer Perceptron (MLP), the Steinbuch's Learnmatrix (SLM) and the Self-Organizing Feature Finder (SOFF) -are evaluated in a speaker recognition task. The ability of dimensionality reduction of the SOFF paradigm is also discussed.

A Speaker Identification Method Based on Time-Delay Neural Network Methodology in Multi-Layer Perceptron

Speaker identification is one of the prevalent problems that we confront with it and have been increasing dramatically in recent times due to its application in identification problems; the goal of this research is to obtain acceptable results to help the other security methods of the banking system in identification. This paper presents neural network based text independent speaker identification for Persian language. We introduce a new approach for speaker recognition using Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Cepstral Coefficients (LPCC) with Artificial Neural Network (ANN) based on how to apply features to neural network. In this method we compounded Time Delay Neural Network (TDNN) methodology in Multi layer Perceptron (MLP) and achieved suitable results in comparison with other methods such as Linear Vector Quantization (LVQ). In this research we used a dataset consist of digit utterances of 16 men and 4 women. The results of this approach can be used to improve banking security systems. In this paper we will represent that our idea have high capability to recognize large number of speakers with feature extraction methods such as MFCC and LPCC. Within this context, the time delay neural network methodology demonstrated to be more reliable than multi layer perceptron architecture and linear vector quantization; in best condition yielding, 96.2 % and 99.8% for 13 MFCC and 26 LPCC respectively.

The Innovative Approach for Text-Independent Human Speaker Identification Utilizing Concepts of Artificial Neural Network

This article presents the implementation of Text Independent Speaker Identification system. It involves two parts-―Speech Signal Processing‖ and― Artificial Neural Network‖. The speech signal processing usesMel Frequency Cepstral Coefficients (MFCC) acquisition algorithm that extracts features from the speech signal ,which are actually the vectors of coefficients. The back propagation algorithm of the artificial neural network stores the extracted features on a data base and then identify speaker based on the information. The direct speech does not work for the identification of the voice or speaker. Since the speech signal is not always periodic and only half of the framesare voiced, it is not a good practice to work with the half voice dand half unvoiced frames. Hence, the speech must be pre processed to successfully identify a voice or speaker. Them a jorgoal of this work is to derivease to ffeatures that improves the accuracy of the text independent speaker identification system.

Speaker identification using modular recurrent neural networks

1995

This paper demonstrates a speaker identification system based on recurrent neural networks trained with the Real-time Recurrent Learning algorithm (RTRL). A series of speaker identification experiments based on isolated digits has been conducted. The database contains four utterances of ten digits spoken by ten speakers over a period of nine months. The results suggest that recurrent networks can encode static and dynamic features of speech signals. They also show that the proposed system outperforms the traditional speaker identification systems in which Backpropagation networks are used. However, this paper demonstrates experimentally that the outputs of the RTRL networks are highly dependent on the initial portion of the input sequences. Removing the first few vectors from the input sequences will lead to a substantial reduction in identification accuracy.

Feature Selection Method for Speaker Recognition using Neural Network

International Journal of Computer Applications, 2014

The aim of this paper is to extract and select features from speech signal that will make it possible to have acceptable speaker recognition rate in real life. A variety of combinations among formants (F1, F2, F3), Linear Predictive Coefficients (LPC), Mel Frequency Cepstral Coefficients (MFCC) and delta-Mel Frequency Cepstral Coefficients representing features are considered and their effect in speaker recognition is observed. Two similar volume data sets with differed string (words) are considered in the present study. These two data sets are prepared taking into account two differed data sampling rates. The study reveals another interesting fact that the selection of strings in speaker enrollment process is a matter of importance for accurate result. This means that the speaker will be tested for authentication with the same string with which he was enrolled earlier during the time of his first access to the system.

Speaker recognition using PCA-based feature transformation

Speech Communication

This paper introduces a Weighted-Correlation Principal Component Analysis (WCR-PCA) for efficient transformation of speech features in speaker recognition. A Recurrent Neural Network (RNN) technique is also introduced to perform the weighted PCA. The weights are taken as the log-likelihood values from a fitted Single Gaussian-Background Model (SG-BM). For speech features, we show that there are large differences between feature variances which makes covariance based PCA less optimal. A comparative study of the performance of speaker recognition is presented using weighted and unweighted correlation and covariance based PCA. Extensions to improve the extraction of MFCC and LPCC features of speech are also proposed. These are Odd Even filter banks MFCC (OE-MFCC) and Multitaper-Fitted LPCC. The methodologies are evaluated for the i-vector speaker recognition system. A subset of the 2010 NIST speaker recognition evaluation set is used in the performance testing in addition to evaluations on the VoxCeleb1 dataset. A relative improvement of 44% in terms of EER is found in the system performance using the NIST data and 18% using the VoxCeleb1 dataset.