MFCC Research Papers - Academia.edu (original) (raw)
- by Yu Hua Xiao
- •
- Computer Science, Speech, MFCC, Large Scale
- by Jahangir Alam and +1
- •
- Cognitive Science, Linguistics, Speech Communication, MFCC
This paper describes the development of an efficient speech recognition system using different techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM). This paper explains how... more
This paper describes the development of an efficient speech recognition system using different
techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and
Hidden Markov Model (HMM).
This paper explains how speaker recognition followed by speech recognition is used to recognize the
speech faster, efficiently and accurately. MFCC is used to extract the characteristics from the input
speech signal with respect to a particular word uttered by a particular speaker. Then HMM is used
on Quantized feature vectors to identify the word by evaluating the maximum log likelihood values
for the spoken word.
- by K. Ramakrishnan and +1
- •
- MFCC, HMMs, VQR
Speaker verification using limited data is always a challenge for practical implementation as an application. An analysis on speaker verification studies for an i-vector based method using Mel-Frequency Cepstral Coefficient (MFCC) feature... more
Speaker verification using limited data is always a challenge for practical implementation as an application. An analysis on speaker verification studies for an i-vector based method using Mel-Frequency Cepstral Coefficient (MFCC) feature shows that the performance drops drastically as the duration of test data is reduced. This decrease in performance is due to insufficient phonetic coverage when we capture only the vocal tract feature. However, the same can be improved if some source characteristics are taken into consideration. This paper attempts to improve the speaker verification performance using source characteristics. A recently proposed characterization of the voice source signal called the discrete cosine transform of the integrated linear prediction residual (DCTILPR) has been found to be useful as a speaker-specific feature. Speaker verification is performed over short test utterances in the NIST 2003 database using both the DCTILPR and MFCC features, and their score-level combination is found to give a significant performance improvement over the system using only the MFCC features.
— This thesis presents a new approach to the visualization of sound for deaf assistance that simultaneously illustrates important dynamic sound properties and the recognized sound icons in an easy readable view. .In order to visualize... more
— This thesis presents a new approach to the visualization of sound for deaf assistance that simultaneously illustrates important dynamic sound properties and the recognized sound icons in an easy readable view. .In order to visualize general sounds efficiently, the MFCC sound features was utilized to represent robust discriminant properties of the sound. The problem of visualizing MFCC vector that has 39 dimension was simplified by visualizing one-dimensional value, which is the result of comparing one reference MFCC vector with the input MFCC vector only. New similarity measure for MFCC feature vectors comparison was proposed that outperforms existing local similarity measures due to their problem of one to one attribute value calculation that leaded to incorrect similarity decisions. Classification of input sound was performed and attached to the visualizing system to make the system more usable for users. Each time frame of sound is put under K-NN classification algorithm to detect short sound events. In addition, every one second the input sound is buffered and forwarded to Dynamic Time Warping (DTW) classification algorithm which is designed for dynamic time series classification. Both classifiers works in the same time and deliver their classification results to the visualization model. The application of the system was implemented using Java programming language to work on smartphones that run Android OS, so many considerations related to the complexity of algorithms is taken into account. The system was implemented to utilize the capabilities of the smartphones GPU to guarantee the smoothness and fastness of the rendering. The system design was built based on interviews with five deaf persons taking into account their preferred visualizing system. In addition to that, the same deaf persons tested the system and the evaluation of the system is carried out based on their interaction with the system. Our approach yields more accessible illustrations of sound and more suitable for casual and little expert users.
The automatic identification of person’s identity from their voice is a part of modern telecommunication services. In order to execute the identification task, speech signal has to be transmitted to a remote server. So a performance of... more
The automatic identification of person’s identity from their voice is a part of modern telecommunication services. In order to execute the identification task, speech signal has to be transmitted to a remote server. So a performance of the recognition/identification system can be influenced by various distortions that occur when transmitting speech signal through a communication channel. This paper studies an effect of telecommunication channel, particularly commonly used narrowband (NB) speech codecs in current telecommunication networks, on a performance of automatic speaker recognition in the context of a channel/codec mismatch between enrollment and test utterances. An influence of speech coding on speaker identification is assessed by using the reference GMM-UBM method. The results show that the partially mismatched scenario offers better results than the fully matched scenario when speaker recognition is done on speech utterances degraded by the different NB codecs. Moreover, deploying EVS and G.711 codecs in a training process of the recognition system provides the best success rate in the fully mismatched scenario. It should be noted here that the both EVS and G.711codecs offer the best speech quality among the codecs deployed in this study. This finding also fully corresponds with the finding presented by Janicki & Staroszczyk in [1] focusing on other speech codecs.
- by Ovide Decroly
- •
- MFCC, MFCC features, Lvq
- by Jonathan Darch
- •
- MFCC, Formant
Emotional state recognition through speech is being a very interesting research topic nowadays. Using subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state of the person. One of the... more
Emotional state recognition through speech is being a very interesting research topic nowadays. Using subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state of the person. One of the main problems in the design of automatic emotion recognition systems is the small number of available patterns. This fact makes the learning process more difficult, due to the generalization problems that arise under these conditions. In this work we propose a solution to this problem consisting in enlarging the training set through the creation the new virtual patterns. In the case of emotional speech, most of the emotional information is included in speed and pitch variations. So, a change in the average pitch that does not modify neither the speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in order to create new patterns applying a gender dependent pitch shift modification in the feature extraction process of the classification system. For this purpose, we propose a frequency scaling modification of the Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system and reducing the test error. Results carried out with two different classifiers with different degree of generalization capability demonstrate the suitability of the proposal.
In this paper, we investigate the speech recognition system for Tajweed Rule Checking Tool. We propose a novel Mel-Frequency Cepstral Coefficient andVector Quantization (MFCC-VQ) hybridalgorithm to help students to learn and revise proper... more
In this paper, we investigate the speech recognition system for Tajweed Rule Checking Tool. We propose a novel Mel-Frequency Cepstral Coefficient andVector Quantization (MFCC-VQ) hybridalgorithm to help students to learn and revise proper Al-Quran recitation by themselves. We describe a hybridMFCC-VQ architecture toautomatically point out the mismatch between the students'recitationsandthecorrect recitationverified by the expert. The vectoralgorithm is chosen due to its data reduction capabilities and computationally efficient characteristics.We illustrate our component model and describe the MFCC-VQ proceduretodevelop theTajweed Rule CheckingTool.Two features, i.e., a hybrid algorithm and solely Mel-Frequency Cepstral Coefficientare compared to investigate their effect on the Tajweed Rule CheckingToolperformance. Experiments arecarried out on a dataset to demonstrate that the speed performance of a hybrid MFCC-VQis86.928%, 94.495% and 64.683% faster than theMel-FrequencyCepstral Coefficient for male, female and children respectively.
Automatic Speaker Recognition (ASR) needs a robust acoustic feature for representation of speaker and an efficient modeling scheme to yield high recognition accuracy even at adverse conditions. This paper presents a noise study of an ASR... more
Automatic Speaker Recognition (ASR) needs a robust acoustic feature for representation of speaker and an efficient modeling scheme to yield high recognition accuracy even at adverse conditions. This paper presents a noise study of an ASR system using Mel-Frequency Cepstral Coefficients (MFCC) and an Artificial Neural Network (ANN) classifier. Optimization in feature space using Fisher's F-Ratio score is done in order to develop reduced speaker model in no noise (only ambient room noise is present) as well as in several noisy conditions. A new ranking scheme is also proposed in order to stabilize the rank of features in various noise levels by taking Arithmetic Mean of the F-Ratio scores obtained from various levels of Signal to Noise Ratio (SNR). The result is presented for a Text-Dependent ASR system with 25 speaker database.
Automatic bird species recognition method using their voices is presented in this paper. The selected bird species have been detected by hidden Markov models (HMM) classifier using Mel-frequency cepstral coefficients (MFCC). In order to... more
Automatic bird species recognition method using their voices is presented in this paper. The selected bird species have been detected by hidden Markov models (HMM) classifier using Mel-frequency cepstral coefficients (MFCC). In order to support recognition process, analysed signals have been appropriately filtered before classification in the so called prefiltration process. The prefiltration strategy assumed using n-th order IIR Butterworth filter bank. Each filter from the filter bank was applied for band pass filtration in the bird species-specific and signal type band. Increase of recognition accuracy has been observed in case of prefiltration with properly chosen filter order. Experiments have been carried out on the set of bird voices containing 30 bird species, one of which is endangered with extinction.
Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of... more
Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.
In this paper we report the experiment carried out on recently collected speaker recognition database namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the performance of acoustic and prosodic features for... more
In this paper we report the experiment carried out on recently collected speaker recognition database namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the performance of acoustic and prosodic features for speaker verification task.The speech database consists of speech data recorded from 200 speakers with Arunachali languages of NorthEast India as mother tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model (GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system has been evaluated for both acoustic feature and prosodic feature individually as well as in combination.It has been observed that acoustic feature, when considered individually, provide better performance compared to prosodic features. However, if prosodic features are combined with acoustic feature, performance of the system outperforms both the systems where the features are considered individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where acoustic features are considered individually and nearly 20% improvement with respect to the system where only prosodic features are considered.
- by Hemant Patil
- •
- MFCC
- by Huan Zhao
- •
- Speech Recognition, MFCC
- by Ishan Bhardwaj
- •
- Speech Recognition, Hindi, MFCC, K-means
- by Jonathan Darch
- •
- MFCC, Formant