Dimension reduction of the modulation spectrogram for speaker verification (original) (raw)

Speaker Identification using Spectrograms of Varying Frame Sizes

International Journal of Computer Applications, 2012

In this paper, a text dependent speaker recognition algorithm based on spectrogram is proposed. The spectrograms have been generated using Discrete Fourier Transform for varying frame sizes with 25% and 50% overlap between speech frames. Feature vector extraction has been done by using the row mean vector of the spectrograms. For feature matching, two distance measures, namely Euclidean distance and Manhattan distance have been used. The results have been computed using two databases: a locally created database and CSLU speaker recognition database. The maximum accuracy is 92.52% for an overlap of 50% between speech frames with Manhattan distance as similarity measure.

Dimensionality reduction of modulation frequency features for speech discrimination

2008

We describe a dimensionality reduction method for modulation spectral features, which keeps the time-varying information of interest to the classification task. Due to the varying degrees of redundancy and discriminative power of the acoustic and modulation frequency subspaces, we first employ a generalization of SVD to tensors (Higher Order SVD) to reduce dimensions. Projection of modulation spectral features on the principal axes with the higher energy in each subspace results in a compact feature set. We further estimate the relevance of these projections to speech discrimination based on mutual information to the target class. Reconstruction of modulation spectrograms from the "best" 22 features back to the initial dimensions, shows that modulation spectral features close to syllable and phoneme rates as well as pitch values of speakers are preserved.

Investigating the use of modulation spectral features within an i-vector framework for far-field automatic speaker verification

2014 International Telecommunications Symposium (ITS), 2014

It is known that channel variability compromises automatic speaker recognition accuracy. However, little attention has been given so far to the detrimental effects encountered under reverberant environments. In this paper, we focus on the issue of automatic speaker verification (ASV) under several levels of room reverberation. Alternative auditory inspired features are explored. Specifically, we investigate whether the performance of the so-called modulation spectral features (MSFs) can overcome the well-known mel-frequency cepstral coefficients (MFCCs). Experiments were conducted with an ASV system based on the state-of-the-art i-vector. The main contribution of this paper is to verify if MSFs combined with i-vectors are able to present the same performance encountered in the literature regarding speech recognition and speaker identification systems in reverberant environment.

Joint Acoustic-Modulation Frequency for Speaker Recognition

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006

We propose a method for computing joint acousticmodulation frequency feature for speaker recognition. This feature describes the amplitude modulation spectrum of each subband, and results in a single feature vector per utterance. This vector is directly used as the speaker's modulation frequency template, excluding the need for a separate training phase. The effects of analysis parameters and pattern matching are studied using the NIST 2001 corpus. When fusing the proposed feature with the baseline MFCC/GMM system, EER is reduced from 18.2 % to 16.7 %.

Speaker Identification using Row Mean of Haar and Kekre’s Transform on Spectrograms of Different Frame Sizes

2011

In this paper, we propose Speaker Identification using two transforms, namely Haar Transform and Kekre's Transform. The speech signal spoken by a particular speaker is converted into a spectrogram by using 25% and 50% overlap between consecutive sample vectors. The two transforms are applied on the spectrogram. The row mean of the transformed matrix forms the feature vector, which is used in the training as well as matching phases. The results of both the transform techniques have been compared. Haar transform gives fairly good results with a maximum accuracy of 69% for both 25% as well as 50% overlap. Kekre's Transform shows much better performance, with a maximum accuracy of 85.7% for 25% overlap and 88.5% accuracy for 50% overlap.

Speaker Recognition from Spectrogram Images

2021 IEEE International Conference on Smart Information Systems and Technologies (SIST), 2021

Speaker identification is used to identify the owner of the voice among many people based on the uniqueness of everyone’s speech style. In this paper, we combine Convolutional Neural Network with Recurrent Neural Network using Long Short-Term Memory models for speaker recognition and implement the deep learning architecture on our dataset of spectrogram images for 77 different non-native speakers reading the same texts in Turkish. Usage of identical text reading eliminates the possible variations and diversities on spectrograms depending on vocabularies. Experiments show that the used method is very effective on recognition rate with satisfying performance and over 98% accuracy.

Performance Comparison of Speaker Identification Using DCT, Walsh, Haar on Full and Row Mean of Spectrogram

International Journal of Computer Applications, 2010

This paper aims to provide different approaches to text dependent speaker identification using various transformation techniques such as DCT, Walsh and Haar transform along with use of spectrograms. Set of spectrograms obtained from speech samples is used as image database for the study undertaken. This image database is then subjected to various transforms. Using Euclidean distance as measure of similarity, most appropriate speaker match is obtained which is declared to be identified speaker. Each transform is applied to spectrograms in two different ways: on full image and on Row Mean of an image. In both the ways, effect of different number of coefficients of transformed image is observed. Further, comparison of all three transformation techniques on spectrograms in both the ways shows that numbers of mathematical computations required for Walsh transform is much lesser than number of mathematical computations required in case of DCT on spectrograms. Whereas, use of Haar transform on spectrograms drastically reduces the number of mathematical computation with almost equal identification rate. Transformation techniques on Row Mean give better identification rate than transformation technique on full image.

A further investigation on speech features for speaker characterization

6th International Conference on Spoken Language Processing (ICSLP 2000)

In this article, we investigate on alternative speech features for speaker characterization. We study Line Spectrum Pairs features, Time-Frequency Principal Components and Discriminant Components of the Spectrum. These alternative features are tested and compared on a task of speaker verification. This task consists in verifying a claimed identity from a speech segment. Systems are evaluated on a subset of the evaluation data of the NIST 1999 speaker recognition campaign. The new speech features are also compared to the classical cepstral coefficients, which remain, in our experiments, the best performing features.

On Feature Selection for Speaker Verification

2002

This paper describes an HMM based speaker verification system, which verifies speakers in their own specific feature space. This 'individual' feature space is determined by a Dynamic Programming (DP) feature selection algorithm. A suitable criterion, correlated with Equal Error Rate (EER) was developed and is used for this feature selection algorithm. The algorithm was evaluated on a text-dependent database. A significant improvement in verification results was demonstrated with the DP selected individual feature space. An EER of 4.8% was achieved when the feature set was the "almost standard" Mel Frequency Cepstrum Coefficients (MFCC) space (12 MFCC + 12 ∆MFCC). Under the same conditions, a system based on the selected feature space yielded an EER of only 2.7%.

Robust speaker recognition using spectro-temporal autoregressive models

Interspeech 2013, 2013

Speaker recognition in noisy environments is challenging when there is a mis-match in the data used for enrollment and verification. In this paper, we propose a robust feature extraction scheme based on spectro-temporal modulation filtering using two-dimensional (2-D) autoregressive (AR) models. The first step is the AR modeling of the sub-band temporal envelopes by the application of the linear prediction on the sub-band discrete cosine transform (DCT) components. These sub-band envelopes are stacked together and used for a second AR modeling step. The spectral envelope across the sub-bands is approximated in this AR model and cepstral features are derived which are used for speaker recognition. The use of AR models emphasizes the focus on the high energy regions which are relatively well preserved in the presence of noise. The degree of modulation filtering is controlled using AR model order parameter. Experiments are performed using noisy versions of NIST 2010 speaker recognition evaluation (SRE) data with a stateof-art speaker recognition system. In these experiments, the proposed features provide significant improvements compared to baseline features (relative improvements of 20% in terms of equal error rate (EER) and 35 % in terms of miss rate at 10 % false alarm).