Mel Frequency Cepstral Coefficient Research Papers (original) (raw)
2025, Neural Computing and Applications
Emotion recognition in speech is a topic on which little research has been done to-date. In this paper, we discuss why emotion recognition in speech is a significant and applicable research topic, and present a system for emotion... more
Emotion recognition in speech is a topic on which little research has been done to-date. In this paper, we discuss why emotion recognition in speech is a significant and applicable research topic, and present a system for emotion recognition using oneclass-in-one neural networks. By using a large database of phoneme balanced words, our system is speaker-and context-independent. We achieve a recognition rate of approximately 50% when testing eight emotions.
2025, International Journal of Automation, Artificial Intelligence and Machine Learning
Biological indictors of ecosystem health often involve the investigation of various species of amphibians. Frogs (Order Anura) generate a variety of vocalizations (calls or croaks) to fend off predators and attract mates that can be... more
Biological indictors of ecosystem health often involve the investigation of various species of amphibians. Frogs (Order Anura) generate a variety of vocalizations (calls or croaks) to fend off predators and attract mates that can be automatically analyzed by utilizing machine learning methods on recordings from large repositories. Hidden Markov Models (HMMs) are widely-used classifiers that have been successfully applied in both human speech processing and bioacoustics to study vocalizations captured by recordings; however, there has been limited usage of HMMs in analyzing large-scale frog datasets, which highlights the need to evaluate their effectiveness for this application. The cepstral coefficients and time derivatives (feature extraction) were extracted from the 1459 vocalizations, and the HMMs were applied to model both the temporal and spectral variations of acoustically comparable vocalizations. Based on the experiments for automatic classification of 9 species of frogs using leave-one-out cross-validation, the classification accuracy ranged from 87.3886% (9 element feature vector, 1275/1459 correct classifications) to 100.0000% (39 element feature vector, 1459/1459 correct classifications). For future work, the HMMs could be applied to other species of frogs for automatic classification and detection of the vocalizations.
2025
Human emotion recognition by a computer system is an active research area for more than a decade now. Inclusion of emotion to an Automatic Speech Recognition (ASR) system which can help to make interaction between human and computer... more
Human emotion recognition by a computer system is an active research area for more than a decade now. Inclusion of emotion to an Automatic Speech Recognition (ASR) system which can help to make interaction between human and computer becomes more natural. A lot of research efforts have been carried out to recognize emotions from speech. The aim of this paper is to give a literature review on what has been addressed in the field of emotion recognition during the last more than a decade. This paper has been presented as a literature review on automatic speech emotion recognition with reference to different types of speech features, databases and classifiers that are used in speech emotion recognition. Speech features such as Mel frequency Cepstral Coefficients (MFCC), linear predictive codes (LPC), and pitch energy are considered as the most prominent and efficient in case of emotion recognition. Different statistical models (GMM, HMM) and some other hybrid models (DNN-HMM, HMM-ANN, GM...
2025
The analysis of infant cry has become more prevalent due to advances in areas such as digital signal processing, pattern recognition and soft computing. The analysis of infant cry has changed the diagnostic ability of physicians to... more
The analysis of infant cry has become more prevalent due to advances in areas such as digital signal processing, pattern recognition and soft computing. The analysis of infant cry has changed the diagnostic ability of physicians to correctly diagnose new-born. This work presents an approach to decode baby talk by classifying infant cry signal. We use normal infant cry signal of ages 1day to six months old. In particular there are fixed cry attributes for a healthy infant cry, which can be classified into five groups such as: Neh, Eh, Owh, Eairh and Heh. The infant cry signal is segmented by using Pitch frequency and features are extracted using MFC (melfrequency cepstrum) coefficients over MATLAB. Statistical properties are calculated for the extracted features of MFCC and KNN classifier is used to classify the cry signal. KNN is the most successful classifiers used for audio data when their temporal structure is not important. This study is based on five different databases such as...
2025
Publication in the conference proceedings of EUSIPCO, Lausanne, Switzerland, 2008
2025, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the... more
This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and nonuniqueness at each instance is also explored.
2025, 2009 IEEE 9th Malaysia International Conference on Communications (MICC)
This paper proposes an emotion recognition system based on the Mel Frequency Cepstral coefficient (MFCC) of the bio-signals. Using gathered data under psychological emotion stimulation experiments, 4 types of emotions, happy, fear, sad... more
This paper proposes an emotion recognition system based on the Mel Frequency Cepstral coefficient (MFCC) of the bio-signals. Using gathered data under psychological emotion stimulation experiments, 4 types of emotions, happy, fear, sad and calm are classified. MLP is used as the classifiers. Experimental results show potential of using this technique to verify emotion based on the EEG signals of accuracy up to 90% can be achieved. This shows the potential of using the MFCC-MLP approach to detect basic emotions using EEG signals from the brain scalp.
2025, Journal of Artificial Intelligence and System Modelling (JAISM)
This study introduces and evaluates novel concepts and techniques for automatic gender identification. Several approaches may be identified, including the utilization of a multilayer perceptron neural network in conjunction with the... more
This study introduces and evaluates novel concepts and techniques for automatic gender identification. Several approaches may be identified, including the utilization of a multilayer perceptron neural network in conjunction with the adaptive neuro-fuzzy inference system algorithm and genetic algorithms to enhance the optimization of network weights. Additionally, the application of neural networks for gender recognition and their integration with the fuzzy C-means method is also noteworthy. The most optimal outcome was achieved through the integration of the ANFIS network with the Fuzzy C-Means algorithm. Furthermore, alternative approaches proved to be highly efficacious. The highest achieved accuracy was observed for the Texas Instruments/Massachusetts Institute of technology dataset (97.5%) and for the Oregon Graduate Institute dataset (96.31%). The demonstrated high accuracy achieved in the analysis of Oregon Graduate Institute data, characterized by its multilingual phone data and low signal-to-noise ratio, indicates the robustness of the suggested approaches in handling variations in speaker language and the suboptimal quality of speech data. Furthermore, employing the genetic neural network methodology facilitated the development of a high-speed network capable of attaining comparable levels of accuracy to the multilayer perceptron neural network, albeit with a significantly reduced number of neurons in the middle layer, expressly limited to three.
2025, Proceedings of the 5th WSEAS …
Spoken language interfaces offer a great potential for enhancing human-computer interaction, speech being the most natural and efficient manner to exchange information for most of us. Such systems consist in automatic speech recognition... more
Spoken language interfaces offer a great potential for enhancing human-computer interaction, speech being the most natural and efficient manner to exchange information for most of us. Such systems consist in automatic speech recognition and text-to-speech ...
2025, IEEE International Conference on Acoustics Speech and Signal Processing
Coping with inter-speaker variability (i.e., differences in the vocal tract characteristics of speakers) is still a major challenge for Automatic Speech Recognizers. In this paper, we discuss a method that compensates for differences in... more
Coping with inter-speaker variability (i.e., differences in the vocal tract characteristics of speakers) is still a major challenge for Automatic Speech Recognizers. In this paper, we discuss a method that compensates for differences in speaker characteristics. In particular, we demonstrate that when continuous density hidden Markov model based system is used as the back-end , a Knowledge-Based Front End (KBFE) can outperform the traditional Mel-Frequency Cepstral Coefficients (MFCCs), particularly when there is a mismatch in the gender and ages of the subjects used to train and test the recognizer.
2025, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding
Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of... more
Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-theart speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.
2025
In Machine Learning (ML) supervised classification problems, it is often beneficial to crunch down the input data, mapping it to an initial pre-processing stage, to improve the performance of pattern recognition systems, such as... more
In Machine Learning (ML) supervised classification problems, it is often beneficial to crunch down the input
data, mapping it to an initial pre-processing stage, to improve the performance of pattern recognition systems,
such as Artificial Neural Networks (ANNs). But the success of this kind of approach relies on a viable choice
of features to better represent the essence of the input data. In this scenario, most ML sound applications make
use of acoustic models, but, regarding machine hearing, this works better if some basic properties of the human
hearing are considered, such as the variable widths of cochlear critical bands with respect to frequency. In this
work, some techniques of cepstral feature extraction are investigated, based on Linear Predictive Coding (LPC),
which can model the source-filter behavior of glottal speech, and Short-Time Fourier Transform (STFT), using
triangular and gammatone shaped filter banks to warp the short-time power spectrum into different scales of
frequency based on human auditory perception. In addition to it, a dimensional reduction of these representations
is further accomplished by applying a Discrete Cosine Transform (DCT). Subsequently, these feature extraction
techniques are applied to a speech-only portion of an emotion recognition data-set, serving as a front-end to
an ML model. This back-end consists of a shallow and fully connected architecture, known as linear classifier,
which has just one input layer and one output layer, and is regularized via the Least Squares (LS) method. Finally,
each model obtained is compared for accuracy. In validation experiments, the GammaTone Cepstral Coefficients
(GTCC) was the most prominent front-end, with 55% overall accuracy. However, in test experiments, the Mel
Frequency Cepstral Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC) and Bark Frequency
Cepstral Coefficients (BFCC) features resulted in improved accuracy, with 98%, 94% and 94%, respectively.
2025, 9th Conference Speech and …
This paper presents a method for extracting MFCC parameters from a normalised power spectrum density. The underlined spectral normalisation method is based on the fact that the speech regions with less energy need more robustness, since... more
This paper presents a method for extracting MFCC parameters from a normalised power spectrum density. The underlined spectral normalisation method is based on the fact that the speech regions with less energy need more robustness, since in these regions the noise is more dominant, thus the speech is more corrupted. Less energy speech regions contain usually sounds of unvoiced nature where are included nearly half of the consonants, and are by nature the least reliable ones due to the effective noise presence even when the speech is acquired under controlled conditions. This spectral normalisation was tested under additive artificial white noise in an Isolated Speech Recogniser and showed very promising results . It is well known that concerned to speech representation, MFCC parameters appear to be more effective than power spectrum based features. This paper shows how the cepstral speech representation can take advantage of the abovereferred spectral normalisation and shows some results in the continuous speech recognition paradigm in clean and artificial noise conditions.
2025, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
Recent developments in large vocabulary continuous speech recognition (LVCSR) have shown the effectiveness of discriminative training approaches, employing the following three representative techniques: discriminative Gaussian training... more
Recent developments in large vocabulary continuous speech recognition (LVCSR) have shown the effectiveness of discriminative training approaches, employing the following three representative techniques: discriminative Gaussian training using the minimum phone error (MPE) criterion, discriminately trained features estimated by multilayer perceptrons (MLPs); and discriminative feature transforms such as feature-level MPE (fMPE). Although MLP features, MPE models, and fMPE transforms have each been shown to improve recognition accuracy, no previous work has applied all three in a single LVCSR system. This paper uses a state-of-the-art Mandarin recognition system as a platform to study the interaction of all three techniques. Experiments in the broadcast news and broadcast conversation domains show that the contribution of each technique is nonredundant, and that the full combination yields the best performance and has good domain generalization.
2025
In this paper, we propose an emotion recognition system from speech signal using both spectral and prosodic features. Most traditional systems have focused on spectral features or prosodic features. Since both the spectral and the... more
In this paper, we propose an emotion recognition system from speech signal using both spectral and prosodic features. Most traditional systems have focused on spectral features or prosodic features. Since both the spectral and the prosodic features contain emotion information, it is believed that combining spectral features and prosodic features will improve the performance of the emotion recognition system. Therefore, we propose to use both spectral and prosodic features. For spectral features, a GMM super vector based SVM is applied. For prosodic features, a set of prosodic features that are clearly correlated with speech emotional states and SVM is also used for emotion recognition. The combination of both spectral features and prosodic features is done and an SVM is trained using the combined feature vector. The emotion recognition accuracy of our experiments allow us to explain which features carry the most emotional information and why. It also allows us to develop criteria to...
2025, IEEE Signal Processing Letters
Usually the mel-frequency cepstral coefficients are estimated either from a periodogram or from a windowed periodogram. We state a general estimator which also includes multitaper estimators. We propose approximations of the variance and... more
Usually the mel-frequency cepstral coefficients are estimated either from a periodogram or from a windowed periodogram. We state a general estimator which also includes multitaper estimators. We propose approximations of the variance and bias of the estimate of each coefficient. By using Monte Carlo computations, we demonstrate that the approximations are accurate. Using the proposed formulas, the peak matched multitaper estimator is shown to have low mean square error (squared bias + variance) on speech-like processes. It is also shown to perform slightly better in the NIST 2006 speaker verification task as compared to the Hamming window conventionally used in this context.
2025, IEEE Transactions on Audio, Speech, and Language Processing
We present an efficient and effective nonlinear featuredomain noise suppression algorithm, motivated by the minimummean-square-error (MMSE) optimization criterion, for noiserobust speech recognition. Distinguishing from the log-MMSE... more
We present an efficient and effective nonlinear featuredomain noise suppression algorithm, motivated by the minimummean-square-error (MMSE) optimization criterion, for noiserobust speech recognition. Distinguishing from the log-MMSE spectral amplitude noise suppressor proposed by Ephraim and Malah (E&M), our new algorithm is aimed to minimize the error expressed explicitly for the Mel-frequency cepstra instead of discrete Fourier transform (DFT) spectra, and it operates on the Mel-frequency filter bank's output. As a consequence, the statistics used to estimate the suppression factor become vastly different from those used in the E&M log-MMSE suppressor. Our algorithm is significantly more efficient than the E&M's log-MMSE suppressor since the number of the channels in the Mel-frequency filter bank is much smaller (23 in our case) than the number of bins (256) in DFT. We have conducted extensive speech recognition experiments on the standard Aurora-3 task. The experimental results demonstrate a reduction of the recognition word error rate by 48% over the standard ICSLP02 baseline, 26% over the cepstral mean normalization baseline, and 13% over the popular E&M's log-MMSE noise suppressor. The experiments also show that our new algorithm performs slightly better than the ETSI advanced front end (AFE) on the well-matched and mid-mismatched settings, and has 8% and 10% fewer errors than our earlier SPLICE (stereo-based piecewise linear compensation for environments) system on these settings, respectively.
2025, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Chroma-based audio features are a well-established tool for analyzing and comparing music data. By identifying spectral components that differ by a musical octave, chroma features show a high degree of invariance to variations in timbre.... more
Chroma-based audio features are a well-established tool for analyzing and comparing music data. By identifying spectral components that differ by a musical octave, chroma features show a high degree of invariance to variations in timbre. In this paper, we describe a novel procedure for making chroma features even more robust to changes in timbre and instrumentation while keeping their discriminative power. Our idea is based on the generally accepted observation that the lower mel-frequency cepstral coefficients (MFCCs) are closely related to timbre. Now, instead of keeping the lower coefficients, we will discard them and only keep the upper coefficients. Furthermore, using a pitch scale instead of a mel scale allows us to project the remaining coefficients onto the twelve chroma bins. Our systematic experiments show that the resulting chroma features have indeed gained a significant boost towards timbre invariance.
2025
Speaker verification systems have shown significant progress and have reached a level of performance that make their use in practical applications possible. Nevertheless, large differences in terms of performance are observed, depending... more
Speaker verification systems have shown significant progress and have reached a level of performance that make their use in practical applications possible. Nevertheless, large differences in terms of performance are observed, depending on the speaker or the speech excerpt used. This context emphasizes the importance of a deeper analysis of the system's performance over average error rate. In this paper, the effect of the training excerpt is investigated using ALIZE/SpkDet on two different corpora: NIST-SRE 08 (conversational speech) and BREF 120 (controlled read speech). The results show that the SVS performance are highly dependent on the voice samples used to train the speaker model: the overall Equal Error Rate (EER) ranges from 4.1% to 29.1% on NIST-SRE 08 and from 1.0% to 33.0% on BREF 120. The hypothesis that such performance differences are explained by phonetic contents of voice samples is studied on BREF 120.
2025
A Novel Approach to Distinguishing Drones from Birds in Real Time examines how advanced machine learning algorithms can improve surveillance system identification. Given the rise of unmanned aerial vehicles, this is essential in military... more
A Novel Approach to Distinguishing Drones from Birds in Real Time examines how advanced machine learning algorithms can improve surveillance system identification. Given the rise of unmanned aerial vehicles, this is essential in military and civilian settings. This research was inspired by UAVs' misidentification as birds, which could pose security problems and steal resources. This study tests advanced machine learning models like deep neural networks, SVMs, random forests, and gradient boosting machines to distinguish drones from birds in different environments. The models are rigorously tested for accuracy, precision, recall, and computational efficiency before being used in real-time security operations. This study found that gradient-boosting machines and deep neural networks function well with high accuracy and low false positive rates. These models were resilient, reducing false alarms and improving monitoring system performance. This study's findings could improve security and pave the way for airborne monitoring technology. This study underlines the need for machine-learning application innovation to address modern security problems, highlighting that increased machinelearning capabilities would be crucial to national and global security initiatives. Global surveillance system efficiency can be improved by successfully implementing these advanced models.
2025
When people communicate, their states of mind are coupled with the explicit content of the messages being transmitted. The implicit information conveyed by mental states is essential to correctly understand and frame the communication... more
When people communicate, their states of mind are coupled with the explicit content of the messages being transmitted. The implicit information conveyed by mental states is essential to correctly understand and frame the communication messages. In mediation, professional mediators include empathy as a fundamental skill when dealing with the relational and emotional aspects of a case. In court environments, emotion analysis intends to point out stress or fear as indicators of the truthfulness of certain asserts. In commercial environments, such as call-centers, automatic emotional analysis through speech is focused to detect deception or frustration. Computational analysis of emotions focuses on gathering information from speech, facial expressions, body poses and movements to predict emotional states. Specifically, speech analysis has been reported as a valuable procedure for emotional state recognition. While some studies focus on the analysis of speech features to classify emotional states, others concentrate on determining the optimal classification performance. In this paper we analyze current approaches to computational analysis of emotions through speech and consider the replication of their techniques and findings in the domains of mediation and legal multimedia.
2025
The systems of automatic words recognitions are not stopping to evolve and to present significant performances. In this paper, we present a treatment of speech intended to the cochlear prostheses. The technique used for this approach is... more
The systems of automatic words recognitions are not stopping to evolve and to present significant performances. In this paper, we present a treatment of speech intended to the cochlear prostheses. The technique used for this approach is the subbands speech recognition, the general principle is to split the whole frequency domain into several subbands on which statistical recognizers are independently applied. We also study which information is really used to recognize speakers. Nevertheless, the extent of the calculations is very important and complex in particular to the classification phase. We are interested to the sturdiness of the parameterisation technique cepstral coding and to the classification by the HMMs models.
2025, International journal of engineering research and technology
Speech synthesis and speech recognition are the area of interest for computer scientists. More and more researchers are working to make computer understand naturally spoken language. For International language like English this technology... more
Speech synthesis and speech recognition are the area of interest for computer scientists. More and more researchers are working to make computer understand naturally spoken language. For International language like English this technology has grown to a matured level. Here in this paper we present a model which recognize Gujarati numeral spoken by speaker and convert it into machine editable text of numeral. The proposed model makes use of Mel-Frequency Cepstral Coefficients (MFCC) as a feature set and K-Nearest Neighbor (K-NN) as classifier. The proposed model achieved average success rate of Gujarati spoken numeral is about 78.13%.
2025
This paper presents an approach to the development of a speaker independent, continuous word Speech Recognition System for a large vocabulary. The feature extraction is based on Mel-scaled Frequency Cepstral Coefficients (MFCC) and... more
This paper presents an approach to the development of a speaker independent, continuous word Speech Recognition System for a large vocabulary. The feature extraction is based on Mel-scaled Frequency Cepstral Coefficients (MFCC) and template matching employs Dynamic Time Warping (DTW). In general, efficiency of the speech recognition system in noise free environment is impressive. But, in the presence of environmental noise the efficiency of the speech recognition system deteriorates drastically. As an attempt to overcome this drawback, Spectral Subtraction (SS) is used for de-noising the speech signal before feature extraction and Convolutional Noise Removal is performed after feature extraction.
2025
A B S T RA CT Speaker identification followed by speech recognition system is developed. The system makes use of MFCC (mel frequency cepstrum coefficients) to process the input signal and extract the features. VQ (Vector quantization) is... more
A B S T RA CT Speaker identification followed by speech recognition system is developed. The system makes use of MFCC (mel frequency cepstrum coefficients) to process the input signal and extract the features. VQ (Vector quantization) is used to identify the speaker. LPC (Linear Predictive Coding) and BNN (Back Propagation Neural Network) technique of hyperbolic tangent function under ANN (Artificial Neural Network) is used for speech recognition system. The implementation is done using MATLAB. The results of the developed system proved to be efficient and faster.
2024, International Journal of Advanced Computer Science and Applications
This paper presents a Bangla (widely used as Bengali) automatic speech recognition system (ASR) by suppressing gender effects. Gender characteristic plays an important role on the performance of ASR. If there is a suppression process that... more
This paper presents a Bangla (widely used as Bengali) automatic speech recognition system (ASR) by suppressing gender effects. Gender characteristic plays an important role on the performance of ASR. If there is a suppression process that represses the decrease of differences in acoustic-likelihood among categories resulted from gender factors, a robust ASR system can be realized. In the proposed method, we have designed a new ASR incorporating the Local Features (LFs) instead of standard mel frequency cepstral coefficients (MFCCs) as an acoustic feature for Bangla by suppressing the gender effects, which embeds three HMM-based classifiers for corresponding male, female and geneder-independent (GI) characteristics. In the experiments on Bangla speech database prepared by us, the proposed system has achieved a significant improvement of word correct rates (WCRs), word accuracies (WAs) and sentence correct rates (SCRs) in comparison with the method that incorporates Standard MFCCs.
2024, Expert Systems with Applications
In the age of digital information, audio data has become an important part in many modern computer applications. Audio classification has been becoming a focus in the research of audio processing and pattern recognition. Automatic audio... more
In the age of digital information, audio data has become an important part in many modern computer applications. Audio classification has been becoming a focus in the research of audio processing and pattern recognition. Automatic audio classification is very useful to audio indexing, content-based audio retrieval and on-line audio distribution, but it is a challenge to extract the most common and salient themes from unstructured raw audio data. In this paper, we propose effective algorithms to automatically classify audio clips into one of six classes: music, news, sports, advertisement, cartoon and movie. For these categories a number of acoustic features that include linear predictive coefficients, linear predictive cepstral coefficients and mel-frequency cepstral coefficients are extracted to characterize the audio content. Support vector machines are applied to classify audio into their respective classes by learning from training data. Then the proposed method extends the application of neural network (RBFNN) for the classification of audio. RBFNN enables nonlinear transformation followed by linear transformation to achieve a higher dimension in the hidden space. The experiments on different genres of the various categories illustrate the results of classification are significant and effective.
2024
Nasa Yuwe is an indigenous language from Colombia (South America), it is, to some extent, an endangered language. Different efforts have been done to revitalize it, the most important of which being the unification of the Nasa Yuwe... more
Nasa Yuwe is an indigenous language from Colombia (South America), it is, to some extent, an endangered language. Different efforts have been done to revitalize it, the most important of which being the unification of the Nasa Yuwe alphabet. The Nasa Yuwe vowel system has 32 vowels contrasting in nasalization, length, aspiration and glottalization, causing great confusion for the learner. In order to support the correct learning of this language, three classifier models (K-nearest neighbor, multilayer neural networks and Hidden Markov Model) have been developed to detect confusion in the pronunciation of the 32 vowels. They were developed in three different experiments in order to reach the best accuracy rates. The selected strategy developed binary classifiers using bagging with adding a number of negatives samples for each vowel, with an accuracy rate of about 85%. With these trained classifiers, a Computer Assisted Language Learning system prototype (CALL) was designed to support the correct pronunciation of the language's vowels. Additionally using this system, the native and non-native speakers score distribution of acceptance was calculated and the confusion of vowels for non-native speaker corpus was evaluated.
2024, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
This study proposes a new set of feature parameters based on subband analysis of the speech signal for classi cation of speech under stress. The new speech features are Scale Energy SE, Autocorrelation-Scale-Energy ACSE, Subband based... more
This study proposes a new set of feature parameters based on subband analysis of the speech signal for classi cation of speech under stress. The new speech features are Scale Energy SE, Autocorrelation-Scale-Energy ACSE, Subband based cepstral parameters SC, and Autocorrelation-SC ACSC. The parameters' ability to capture di erent stress types is compared to widely used Mel-scale cepstrum based representations: Mel-frequency cepstral coe cents MFCC and Autocorrelation-Mel-scale AC-Mel. Next, a feedforward neural network is formulated for speaker-dependent stress classi cation of 10 stress conditions: Angry, Clear, Cond50 70, Fast, Loud, Lombard, Neutral, Question, Slow, and Soft. The classi cation algorithm is evaluated using a previously established stressed speech database SUSAS 4. Subband based features are shown to achieve + 7:3 and +9:1 increase in the classi cation rates over the MFCC based parameters for ungrouped and grouped stress closed vocabulary test scenarios respectively. Moreover the average scores across the simulations of new features are +8:6 and +13:6 higher than MFCC based features for the ungrouped and grouped stress test scenarious respectively.
2024, Speech Communication
Interfering noise severely degrades the performance of a speech recognition system. The Parallel Model Compensation (PMC) technique is one of the most efficient techniques for dealing with such noise. Another approach is to use features... more
Interfering noise severely degrades the performance of a speech recognition system. The Parallel Model Compensation (PMC) technique is one of the most efficient techniques for dealing with such noise. Another approach is to use features local in the frequency domain, such as Mel-Frequency Discrete Wavelet Coefficients (MFDWCs). In this paper, we investigate the use of PMC and MFDWC features to take advantage of both noise compensation and local features (MFDWCs) to decrease the effect of noise on recognition performance. We also introduce a practical weighting technique based on the noise level of each coefficient. We evaluate the performance of several wavelet-schemes using the NOISEX-92 database for various noise types and noise levels. Finally, we compare the performance of these versus Mel-Frequency Cepstral Coefficients (MFCCs), both using PMC. Experimental results show significant performance improvements for MFDWCs versus MFCCs, particularly after compensating the HMMs using the PMC technique. The best feature vector among the six MFDWCs we tried gave 13.72 and 5.29 points performance improvement, on the average, over MFCCs for À6 and 0 dB SNR, respectively. This corresponds to 39.9% and 62.8% error reductions, respectively. Weighting the partial score of each coefficient based on the noise level further improves the performance. The average error rates for the best MFDWCs dropped from 19.57% to 16.71% and from 3.14% to 2.14% for À6 dB and 0 dB noise levels, respectively, using the weighting scheme. These improvements correspond to 14.6% and 31.8% error reductions for À6 dB and 0 dB noise levels, respectively.
2024, research.iiit.ac.in
In this paper, we describe the development of unit selection voice for Tamil language. We describe the build process and address the issue of speech segmentation using HMM based techniques. We report the comparison of automatically... more
In this paper, we describe the development of unit selection voice for Tamil language. We describe the build process and address the issue of speech segmentation using HMM based techniques. We report the comparison of automatically seg-mented labels of ...
2024
The present paper reports on the DFKI entry to the Blizzard challenge 2008. The main difference of our system compared to last year is a new join model inspired by last year's iFlytek paper; the effect seems small, but measurable in the... more
The present paper reports on the DFKI entry to the Blizzard challenge 2008. The main difference of our system compared to last year is a new join model inspired by last year's iFlytek paper; the effect seems small, but measurable in the sense that it leads to the selection of longer chunks of consecutive units. In interpreting the results of the listening test, we correlate the ratings to various measures of the system. This allows us to explain at least some part of the variance in MOS ratings.
2024, Evolution in Electrical and Electronic Engineering
Quran is learned at the early stage of Muslim children and usually taught by the religious teachers. It must be recited with precise and correct tajweed in order to avoid the misunderstanding of its meaning. Sometimes the children recite... more
Quran is learned at the early stage of Muslim children and usually taught by the religious teachers. It must be recited with precise and correct tajweed in order to avoid the misunderstanding of its meaning. Sometimes the children recite Quran without the presence of the teacher which the children tend to recite Quran wrongly since there is no guidance. Besides, different children have different learning style since some are visual learners and others are audio learners. In order to help the children to learn Quran in an attractive way, an Automated Tajweed Checking System for Children in Learning Quran is proposed. This system not intended to replace the role of the teachers but to attract the children in learning Quran and help the children to learn Quran without the presence of the teachers. The method of the project uses the concept of voice recognition. In voice recognition there are a few steps involve which are pre-processing, feature extraction, feature classification and recognition. The feature extraction technique used is Mel-Frequency Cepstral Coefficient (MFCC) while for feature classification and recognition technique used is Hidden Markov Model (HMM). This proposed system is believed to recognize recitation efficiently, thus helping children in learning Quran once completed.
2024, Computer Science & Information Technology ( CS & IT )
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency... more
This paper reports a word modeling algorithm for the Malayalam isolated digit recognition to reduce the search time in the classification process. A recognition experiment is carried out for the 10 Malayalam digits using the Mel Frequency Cepstral Coefficients (MFCC) feature parameters and k-Nearest Neighbor (k-NN) classification algorithm. A word modeling schema using Hidden Markov Model (HMM) algorithm is developed. From the experimental result it is reported that we can reduce the search time for the classification process using the proposed algorithm in telephony application by a factor of 80% for the first digit recognition.
2024
Hermann Ney for his constant support, his valuable advice, and giving me the opportunity to realize this work at the Lehrstuhl für Informatik VI in Aachen. Prof. Dr. phil. nat. Harald Höge from Siemens AG Munich kindly took over the role... more
Hermann Ney for his constant support, his valuable advice, and giving me the opportunity to realize this work at the Lehrstuhl für Informatik VI in Aachen. Prof. Dr. phil. nat. Harald Höge from Siemens AG Munich kindly took over the role of the second supervisor. I would like to thank him for his interest in this work and his suggestions. The joint work of my colleagues from the speech recognition group provided the necessary foundation, on which I could build my research. I would like to express my gratitude for the contributions of
2024
O autismo é uma síndrome clínica bem definida após o segundo ano de vida, porém ainda são escassas as informações sobre o autismo nos dois primeiros anos de vida. O estudo de vídeos caseiros descreveu crianças com autismo durante o... more
O autismo é uma síndrome clínica bem definida após o segundo ano de vida, porém ainda são escassas as informações sobre o autismo nos dois primeiros anos de vida. O estudo de vídeos caseiros descreveu crianças com autismo durante o primeiro ano de vida, que não exibiam o padrão rígido de sintomas. Portanto, fatores desenvolvimentais e ambientais, além de fatores genéticos/biológicos parecem influenciar o aparecimento do autismo. Aqui nós descrevemos (1) uma hipótese que focaliza a possível implicação do empobrecimento do manhês durante a interação mãe/bebê como um possível co-fator, (2) a abordagem metodológica foi utilizada para desenvolver um algoritmo informatizado para detectar o manhês em videos caseiros; (3) o melhor desempenho de configuração do detector na extração do manhês a partir de seqüências de vídeos caseiros (precisão = 82%, em falante independente versus 87,5% em falante-dependente) deve ser utilizado para testar esta hipótese. PALAVRAS-CHAVE: Manhês; Autismo; Detector automático de prosódia.
2024
Emotion recognition and verification is the automated determination of the psychological state of the speaker. This paper discusses the method to extract features from a recorded speech sample, and using those features, to detect the... more
Emotion recognition and verification is the automated determination of the psychological state of the speaker. This paper discusses the method to extract features from a recorded speech sample, and using those features, to detect the emotion of the subject. Mel-Frequency Cepstrum Coefficient (MFCC) method was used to extract these features. Every emotion comprises different vocal parameters exhibiting diverse characteristics of speech. These features result in different MFCC coefficients that are input to the trained Artificial Neural Network (ANN), which will analyze them with the stored database and compare the same to recognize the emotion.
2024
Speech emotion recognition (SER) has an increasingly significant role in the interactions among human beings, as well as between human beings and computers. Emotions are an inherent part of even rational decision making. The correct... more
Speech emotion recognition (SER) has an increasingly significant role in the interactions among human beings, as well as between human beings and computers. Emotions are an inherent part of even rational decision making. The correct recognition of the emotional content of an utterance assumes the same level of significance as the proper understanding of the semantic content and is an essential element of professional success. Prevalent speech emotion recognition methods generally use a large number of features and considerable signal processing effort. On the other hand, this work presents an approach for SER using minimal features extracted from appropriate, sociolinguistically designed and developed emotional speech databases. Whereas most of the reported works in SER are based on acted speech with its exaggerated display of emotions, this work focuses on elicited emotional speech in which emotions are induced. Since female speech is more expressive of emotions, this research inve...
2024, World Journal of Advanced Engineering Technology and Sciences
The convolutional neural networks (CNNs) lead in the domain of Sound Recognition due to its flexibility and ability with different adjusting parameters. The recognition of spoken English Alphabets by different people with deep learning... more
The convolutional neural networks (CNNs) lead in the domain of Sound Recognition due to its flexibility and ability with different adjusting parameters. The recognition of spoken English Alphabets by different people with deep learning techniques attracted the research community. In this paper, we are exploring the use of convolutional neural network (CNN), a deep learner that can automatically learn features directly from the dataset while training for the classification of sounds signals of English alphabets. In this proposed work, we consider two CNN architectures. In first architecture, we propose MFCC based features for pretrained two convolutional layer CNN architecture. In the second architecture, we propose a hybrid feature extraction method to train a block-based CNN architecture. The proposed systems consist of two components namely hybrid feature extraction and CNN classifier. The five auditory features log-Mel spectrogram (LM), MFCC, chroma, spectral contrast and Tonnetz features are extracted and then LM & MFCC are combined as one feature set. LM, MFCC, and CST features are aggregated as another for training to the proposed two CNNs, respectively. The different sound samples of English alphabets are collected from different people of different age groups. The feature sets collected from the hybrid feature extraction methods are presented to both the proposed CNNs and the experimental results are collected. The experimental results indicate that the taxonomic accuracy of the proposed architectures can surpass the existing methods of CNNs with single feature extraction methods. The proposed second architecture performs more effectively over the proposed first CNN architecture.
2024
The evolution of computer technology, including operating systems and applications, resulted in designing intelligent machines that can recognize the spoken word and find out its meaning. Different front-end models have specific... more
The evolution of computer technology, including operating systems and applications, resulted in designing intelligent machines that can recognize the spoken word and find out its meaning. Different front-end models have specific processing time required for calculating the same number of coefficients used for pattern recognition. During the years, it has been significantly improved, not only thanks to improvements in algorithms, but also with more processing power of nowadays computers. In this paper we analyze processing time and reconstructed speech quality of the three common front-end methods (Linear Predictive Coding -LPC, Mel-Frequency Cepstrum -MFC, Perceptual Linear Prediction -PLP) for calculating coefficients. Reconstructed speech quality is measured with Perceptual Evaluation of Speech Quality (PESQ) score. It is visible from our analysis that, if required, higher number of coefficients could be used without significant impact on processing time for MFC and PLP coefficients.
2024
The evolution of computer technology, including operating systems and applications, resulted in designing intelligent machines that can recognize the spoken word and find out its meaning. Different front-end models have specific... more
The evolution of computer technology, including operating systems and applications, resulted in designing intelligent machines that can recognize the spoken word and find out its meaning. Different front-end models have specific processing time required for calculating the same number of coefficients used for pattern recognition. During the years, it has been significantly improved, not only thanks to improvements in algorithms, but also with more processing power of nowadays computers. In this paper we analyze processing time and reconstructed speech quality of the three common front-end methods (Linear Predictive Coding LPC, Mel-Frequency Cepstrum MFC, Perceptual Linear Prediction PLP) for calculating coefficients. Reconstructed speech quality is measured with Perceptual Evaluation of Speech Quality (PESQ) score. It is visible from our analysis that, if required, higher number of coefficients could be used without significant impact on processing time for MFC and PLP coefficients....
2024, SAKARYA UNIVERSITY JOURNAL OF COMPUTER AND INFORMATION SCIENCES
Musical instrument identification (MII) research has been studied as a subfield of the Music Information Retrieval (MIR) field. Conventional MII models are developed based on hierarchical models representing musical instrument families.... more
Musical instrument identification (MII) research has been studied as a subfield of the Music Information Retrieval (MIR) field. Conventional MII models are developed based on hierarchical models representing musical instrument families. However, for MII models to be used in the field of music production, they should be developed based on the arrangement-based functions of instruments in musical styles rather than these hierarchical models. This study investigates how the performance of machine learning based classification algorithms for Guitar, Bass guitar and Drum classes changes with different feature selection algorithms, considering a popular music production scenario. To determine the effect of feature statistics on model performance, Minimum Redundancy Maximum Relevance (mRMR), Chi-sqaure (Chi2), ReliefF, Analysis of Variance (ANOVA) and Kruskal Wallis feature selection algorithms were used. In the end, the neural network algorithm with wide hyperparameters (WNN) achieved the best classification accuracy (91.4%) when using the first 20 statistics suggested by the mRMR and ReliefF feature selection algorithms.
2024, International Journal of Innovative Research in Computer and Communication Engineering
The performance of a text independent Speaker verification (SV) system has degraded when speaker model training is done in one environment while the testing is done in another, due to mismatching of phonetic contents of speech utterances,... more
The performance of a text independent Speaker verification (SV) system has degraded when speaker model training is done in one environment while the testing is done in another, due to mismatching of phonetic contents of speech utterances, recording environment, session variability and sensor variability of training and testing criteria, which are major problems in speaker verification system. The robustness of SV system has been improved by applying different Voice Activity Detection (VAD) techniques like Cepstral Mean Normalization (CMN), Cepstral Variance Normalization (CVN) techniques in features level and score normalization techniques in score level. In this paper we report the experiment carried out on the recently collected speaker recognition database Arunachali Language Speech Database (ALS-DB). The collected database is evaluated with Gaussian mixture model and Universal Background Model (GMM-UBM) and Mel- Frequency Cepstral Coefficients (MFCC) with its first and second or...
2024
Abstract: This study deals with a noise robust distributed speech recognizer for real-world applications by deploying feature parameter compensation technique. To realize this objective, Mel-LP based speech analysis has been used in... more
Abstract: This study deals with a noise robust distributed speech recognizer for real-world applications by deploying feature parameter compensation technique. To realize this objective, Mel-LP based speech analysis has been used in speech coding on the linear frequency scale by applying a first-order all-pass filter instead of a unit delay. To minimize the mismatch between training and test phases, Cepstral Mean Normalization (CMN) and Blind Equalization (BEQ) have been applied to enhance Mel-LP cepstral coefficient as an ...
2024, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS
This paper describes a robust feature extraction technique for continuous speech recognition. Central to the technique is the Minimum Variance Distortionless Response (MVDR) method of spectrum estimation. We incorporate perceptual... more
This paper describes a robust feature extraction technique for continuous speech recognition. Central to the technique is the Minimum Variance Distortionless Response (MVDR) method of spectrum estimation. We incorporate perceptual information directly in to the spectrum estimation. This provides improved robustness and computational efficiency when compared with the previously proposed MVDR-MFCC technique [10]. On an in-car speech recognition task this method, which we refer to as PMCC, is 15% more accurate in WER and requires approximately a factor of 4 times less computation than the MVDR-MFCC technique. On the same task PMCC yields 20% relative improvement over MFCC and 11% relative improvement over PLP frontends. Similar improvements are observed on the Aurora 2 database.
2024, Journal of Applied Sciences
2024
We address the problem of compactly representing the discrete spectral amplitudes of vowel sounds produced by a sinusoidal model. A study of frequency warped all pole model representation of spectral amplitudes has been presented. It has... more
We address the problem of compactly representing the discrete spectral amplitudes of vowel sounds produced by a sinusoidal model. A study of frequency warped all pole model representation of spectral amplitudes has been presented. It has been generally accepted that incorporating Bark scale frequency warping in the all-pole modeling improves the perceived accuracy of the modeled sound. However our study suggests that whether such frequency warped all-pole modeling would improve the modeling accuracy depends on the nature of the vowel as well as the voice. We propose an alternative warping function which may be used to improve the modeling accuracy more universally.
2024, arXiv (Cornell University)
The automatic speaker identification procedure is used to extract features that help to identify the components of the acoustic signal by discarding all the other stuff like background noise, emotion, hesitation, etc. The acoustic signal... more
The automatic speaker identification procedure is used to extract features that help to identify the components of the acoustic signal by discarding all the other stuff like background noise, emotion, hesitation, etc. The acoustic signal is generated by a human that is filtered by the shape of the vocal tract, including tongue, teeth, etc. The shape of the vocal tract determines and produced, what signal comes out in real time. The analytically develops shape of the vocal tract, which exhibits envelop for the short time power spectrum. The ASR needs efficient way of extracting features from the acoustic signal that is used effectively to makes the shape of the individual vocal tract. To identify any acoustic signal in the large collection of acoustic signal i.e. corpora, it needs dimension compactness of total variability space by using the GMM mean supervector. This work presents the efficient way to implement dimension compactness in total variability space and using cosine distance scoring to predict a fast output score for small size utterance.
2024, Interspeech 2006
In this paper, we describe a prototype speaker identification system using auto-associative neural network (AANN) and formant features. Our experiments demonstrate that formants extracted from difference spectrum perform significantly... more
In this paper, we describe a prototype speaker identification system using auto-associative neural network (AANN) and formant features. Our experiments demonstrate that formants extracted from difference spectrum perform significantly better than formants extracted from normal spectrum for the task of speaker identification. We also demonstrate that formants from difference spectrum provide comparable speaker identification performance with that of features such as weighted linear predictive Cepstral coefficients and Mel-Frequency Cepstral coefficients. Finally, we combine the results of formant based system and linear predictive Cepstral coefficients based system to achieve 100% identification performance.
2024, International Conference on Electronic Design
It is possible to identify voice disorders using certain features of speech signals. A complementary technique could be acoustic analysis of the speech signal, which is shown to be a potentially useful tool to detect voice diseases[2].... more
It is possible to identify voice disorders using certain features of speech signals. A complementary technique could be acoustic analysis of the speech signal, which is shown to be a potentially useful tool to detect voice diseases[2]. The focus of this study is to compare the performances of mel-frequency cepstral coefficients (MFCC) and linear predictive cepstral coefficients (LPCC) features in the detection of vocal fold pathology and also bring out scale to