MFCC Research Papers - Academia.edu (original) (raw)
an emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behavior. Emotions are subjective experiences, or experienced from an individual point of view. Emotion is often associated with... more
an emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behavior. Emotions are subjective experiences, or experienced from an individual point of view. Emotion is often associated with mood, temperament, personality, and disposition. Hence, in this paper method for detection of human emotions is discussed based on the acoustic features like pitch, energy etc. The proposed system is using the traditional MFCC approach [2] and then using nearest neighbor algorithm for the classification. Emotions has been classified separately for male and female based on the fact male and female voice has altogether different range [1][4] so MFCC varies considerably for the two. Keywords— Emotion Recognition from Speech, Fourier Transform, MelFilterBank, MFCC, Modern MFCC Approach, Nearest Neighbor Algorithm
We are surrounded by sounds, we hear various types of sounds on a day to day basis whether it is music sound, different noises, etc. The urban life is filled with such sounds, which makes it important and highly useful for us to work on... more
We are surrounded by sounds, we hear various types of sounds on a day to day basis whether it is music sound, different noises, etc. The urban life is filled with such sounds, which makes it important and highly useful for us to work on these sounds and get some useful information from it so that we can use it efficiently. These sounds are continuously processed by human minds to decipher information about the environment. The same can be done by a machine learning model. It has been seen that convolutional neural networks have been really successful in classifying images, so it becomes a question of interest that how good do they work with different sounds. In this paper, we have worked upon using different deep learning models to see which can be used for the purpose of sound classification. We have used the Urbansound8K dataset which contains 8732 sound excerpts of urban sounds from 10 classes.
This paper describes the development of an efficient speech recognition system using different techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and Hidden Markov Model (HMM). This paper explains how... more
This paper describes the development of an efficient speech recognition system using different
techniques such as Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization (VQ) and
Hidden Markov Model (HMM).
This paper explains how speaker recognition followed by speech recognition is used to recognize the
speech faster, efficiently and accurately. MFCC is used to extract the characteristics from the input
speech signal with respect to a particular word uttered by a particular speaker. Then HMM is used
on Quantized feature vectors to identify the word by evaluating the maximum log likelihood values
for the spoken word.
- by K. Ramakrishnan and +1
- •
- MFCC, HMMs, VQR
In this work, Classical Turkish Music songs are classified into six makams. Makam is a modal framework for melodic development in Classical Turkish Music. The effect of the sound clip length on the system performance was also evaluated.... more
In this work, Classical Turkish Music songs are classified into six makams. Makam is a modal framework for melodic development in Classical Turkish Music. The effect of the sound clip length on the system performance was also evaluated. The Mel Frequency Cepstral Coefficients (MFCC) were used as features. Obtained data were classified by using Probabilistic Neural Network. The best correct recognition ratio was obtained as 89,4% by using a clip length of 6 s.
- by Réda Adjoudj and +1
- •
- Speaker Recognition, Support Vector Machines, Databases, Speech
Standard speaker recognition system employs a pre-processed form of an acoustic signal, which provides information about the distribution of signal energy across time and frequency. However, different signal representations may be... more
Standard speaker recognition system employs a pre-processed form of an acoustic signal, which provides information about the distribution of signal energy across time and frequency. However, different signal representations may be employed, either as genuine alternatives to the acoustic representation, or as additional sources of information. Voice articulator encourages a viability and a potential of the speech signal representation especially in a Thai speaker recognition system. Applying the biometrical voice articulator additionally with a Backpropagation multilayered perceptron attains a high recognition accuracy. LPC and MFCC with several coefficient orders have been performed comparatively. The highest percentage of recognition accuracy with an efficient computational time is 97 .24% belonging to the Bilabial articulator from the l6'n coefficient order of MFCC. 1. Introduction Speaker recognition in a recognition area of speech processing, is one of be biometric identific...
In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals... more
In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals with SER (speech emotion recognition) only. For emotion recognition emotional speech database is essential. In this paper we have proposed emotional database which is developed in Gujarati language, one of the official’s language of India. The proposed speech corpus bifurcate six emotional states as: sadness, surprise, anger, disgust, fear, happiness. To observe effect of different emotions, analysis of proposed Gujarati speech database is carried out using efficient speech parameters like pitch, energy and MFCC using MATLAB Software.
- by Dr. S.K. Hadia
- •
- Phonology, Phonetics, Energy, MFCC
This paper describes preliminary research results on speech recognition of speech impaired children. A several Polish phonemes most confusing for speech impaired children were investigated. The records included utterances being the... more
This paper describes preliminary research results on speech recognition of speech impaired children. A several Polish phonemes most confusing for speech impaired children were investigated. The records included utterances being the examples of pathological speech. Part of the recorded material was artificially noised by procedure generating white noise. Two most promising types of cepstral coefficients: standard (MFCC) as well as human factor (HFCC) were used for tracking of speech content in frequency domain. For mispronounced phoneme recognition embedded in the word a classical dynamic time warping (DTW) algorithm as well as HMM method were exploited. A phoneme-based approach in DTW method has been proposed. Optimal parameters of HFCC adjusted to the stated recognition task have been found. The superior HFCC performance during conducted recognition experiments especially in strongly noised environment has been observed. The results of the research can be useful for modern logopedic therapy.
A Brain Computer Interface (BCI) is a system which allows direct communication between the brain and a computer. It can be used to allow paralyzed as well as healthy individuals to interact with and control the surrounding environment or... more
A Brain Computer Interface (BCI) is a system which allows direct communication between the brain and a computer. It can be used to allow paralyzed as well as healthy individuals to interact with and control the surrounding environment or to communicate simply by the conscious modulation of thought patterns. Although Noninvasive Electroencephalogram-based (EEG-based) BCI is showing a lot of promise, it is faced with a number of difficult challenges, especially from the perspective of signal processing, because the signals being observed by EEG are extremely weak and typically contain very high levels of noise.
The asynchronous multiclass BCI problem is particular importance because it closely matches realistic operating conditions (as opposed to synchronous problems).
However, it is a challenging problem and requires the development of appropriate machine learning and signal processing tools.
This thesis develops an asynchronous multiclass noninvasive EEG-based BCI system which is based on a novel feature extraction method for asynchronous BCI. A set of prepossessing, feature extraction and classification methods were implemented to help support and validate the proposed system and to benchmark it against the latest developed systems in BCI literature.
The proposed system is tested on two separate datasets. The first is a well-known and publicly available dataset, while the second dataset was collected locally using a retail EEG kit.
The developed system showed robust and accurate classification results for both datasets.
— This thesis presents a new approach to the visualization of sound for deaf assistance that simultaneously illustrates important dynamic sound properties and the recognized sound icons in an easy readable view. .In order to visualize... more
— This thesis presents a new approach to the visualization of sound for deaf assistance that simultaneously illustrates important dynamic sound properties and the recognized sound icons in an easy readable view. .In order to visualize general sounds efficiently, the MFCC sound features was utilized to represent robust discriminant properties of the sound. The problem of visualizing MFCC vector that has 39 dimension was simplified by visualizing one-dimensional value, which is the result of comparing one reference MFCC vector with the input MFCC vector only. New similarity measure for MFCC feature vectors comparison was proposed that outperforms existing local similarity measures due to their problem of one to one attribute value calculation that leaded to incorrect similarity decisions. Classification of input sound was performed and attached to the visualizing system to make the system more usable for users. Each time frame of sound is put under K-NN classification algorithm to detect short sound events. In addition, every one second the input sound is buffered and forwarded to Dynamic Time Warping (DTW) classification algorithm which is designed for dynamic time series classification. Both classifiers works in the same time and deliver their classification results to the visualization model. The application of the system was implemented using Java programming language to work on smartphones that run Android OS, so many considerations related to the complexity of algorithms is taken into account. The system was implemented to utilize the capabilities of the smartphones GPU to guarantee the smoothness and fastness of the rendering. The system design was built based on interviews with five deaf persons taking into account their preferred visualizing system. In addition to that, the same deaf persons tested the system and the evaluation of the system is carried out based on their interaction with the system. Our approach yields more accessible illustrations of sound and more suitable for casual and little expert users.
In this paper, we investigate the speech recognition system for Tajweed Rule Checking Tool. We propose a novel Mel-Frequency Cepstral Coefficient andVector Quantization (MFCC-VQ) hybridalgorithm to help students to learn and revise proper... more
In this paper, we investigate the speech recognition system for Tajweed Rule Checking Tool. We propose a novel Mel-Frequency Cepstral Coefficient andVector Quantization (MFCC-VQ) hybridalgorithm to help students to learn and revise proper Al-Quran recitation by themselves. We describe a hybridMFCC-VQ architecture toautomatically point out the mismatch between the students'recitationsandthecorrect recitationverified by the expert. The vectoralgorithm is chosen due to its data reduction capabilities and computationally efficient characteristics.We illustrate our component model and describe the MFCC-VQ proceduretodevelop theTajweed Rule CheckingTool.Two features, i.e., a hybrid algorithm and solely Mel-Frequency Cepstral Coefficientare compared to investigate their effect on the Tajweed Rule CheckingToolperformance. Experiments arecarried out on a dataset to demonstrate that the speed performance of a hybrid MFCC-VQis86.928%, 94.495% and 64.683% faster than theMel-FrequencyCepstral Coefficient for male, female and children respectively.
- by Hemant Patil
- •
- MFCC
Results from preliminary research on recognition of Polish birds’ species are presented in the paper. Bird voices were recorded in a highly noised municipal environment. High 96 kHz sampling frequency has been used. As a feature set... more
Results from preliminary research on recognition of Polish birds’ species are presented in the paper. Bird voices were recorded in a highly noised municipal environment. High 96 kHz sampling frequency has been used. As a feature set standard mel-frequency cepstral coefficients (MFCC) and recently proposed human-factor cepstral coefficients (HFCC) parameters were selected. Superior performance of the HFCC features over MFCC ones has been observed. Proper limiting of the maximal frequency during HFCC feature extraction results in increasing accuracy of birds’ species recognition. Good initial results are very promising for practical application of the methods described in the paper in monitoring of protected birds’ area.
- by Robert Wielgat and +1
- •
- Technology, Bioacoustics, Signal Processing, Human Factors
W artykule opisano wyniki badań, dotyczące automatycznego rozpoznawania mowy zaburzonej. Badania przeprowadzono dla kilku polskich fonemów sprawiających największe problemy dzieciom z wadami wymowy. Zbadano trzy rodzaje współczynników... more
W artykule opisano wyniki badań, dotyczące automatycznego rozpoznawania mowy zaburzonej. Badania przeprowadzono dla kilku polskich fonemów sprawiających największe problemy dzieciom z wadami wymowy. Zbadano trzy rodzaje współczynników cepstralnych: standardowe (CC), mel-cepstralne MFCC oraz współczynniki HFCC jako cechy sygnału mowy. Jako klasyfikatorów użyto klasycznego algorytmu nieliniowej transformacji czasowej (ang. Dynamic Time Warping) oraz średniego wektora cech. Zastosowanie cech HFCC wpłynęło na znaczącą poprawę wyników rozpoznawania. Przebadano szeroki zakres wartości parametrów w procesie obliczania HFCC w celu znalezienia ich optymalnych wartości dla różnych zadań rozpoznawania.
- by SABIQ P V
- •
- Speaker Verification, MFCC, Hmm, Svm
In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals... more
In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals with SER (speech emotion recognition) only. For emotion recognition emotional speech database is essential. In this paper we have proposed emotional database which is developed in Gujarati language, one of the official's language of India. The proposed speech corpus bifurcate six emotional states as: sadness, surprise, anger, disgust, fear, happiness. To observe effect of different emotions, analysis of proposed Gujarati speech database is carried out using efficient speech parameters like pitch, energy and MFCC using MATLAB Software.
The automatic identification of person’s identity from their voice is a part of modern telecommunication services. In order to execute the identification task, speech signal has to be transmitted to a remote server. So a performance of... more
The automatic identification of person’s identity from their voice is a part of modern telecommunication services. In order to execute the identification task, speech signal has to be transmitted to a remote server. So a performance of the recognition/identification system can be influenced by various distortions that occur when transmitting speech signal through a communication channel. This paper studies an effect of telecommunication channel, particularly commonly used narrowband (NB) speech codecs in current telecommunication networks, on a performance of automatic speaker recognition in the context of a channel/codec mismatch between enrollment and test utterances. An influence of speech coding on speaker identification is assessed by using the reference GMM-UBM method. The results show that the partially mismatched scenario offers better results than the fully matched scenario when speaker recognition is done on speech utterances degraded by the different NB codecs. Moreover, deploying EVS and G.711 codecs in a training process of the recognition system provides the best success rate in the fully mismatched scenario. It should be noted here that the both EVS and G.711codecs offer the best speech quality among the codecs deployed in this study. This finding also fully corresponds with the finding presented by Janicki & Staroszczyk in [1] focusing on other speech codecs.
Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based... more
Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based on information in speech signal. Automatic Speech Recognition (ASR) system takes a human speech utterance as an input and requites a string of words as output. This paper introduce a brief survey on Automatic Speech Recognition and discuss the major subjects and improvements made in the past 60 years of research, that provides technological outlook and a respect of the fundamental achievement that has been accomplished in this important area of speech communication. Definition of various types of speech classes, feature extraction techniques, speech classifiers and performance evaluation are issues that requires attention in designing of speech recognition system. The objective of this review paper is to summarize some of the well known methods use...
Speech has evolved as a primary form of communication between humans. The advent of digital technology, gave us highly versatile digital processors with high speed, low cost and high power which enable researchers to transform the analog... more
Speech has evolved as a primary form of communication between humans. The advent of digital technology, gave us highly versatile digital processors with high speed, low cost and high power which enable researchers to transform the analog speech signals in to digital speech signals that can be scientifically studied. Achieving higher recognition accuracy, low word error rate and addressing the issues of sources of variability are the major considerations for developing an efficient Automatic Speech Recognition system. In speech recognition, feature extraction requires much attention because recognition performance depends heavily on this phase. In this paper, an effort has been made to highlight the progress made so far in the feature extraction phase of speech recognition system and an overview of technological perspective of an Automatic Speech Recognition system are discussed.
In the paper recently proposed Human Factor Cepstral Coefficients (HFCC) are used to automatic recognition of pathological phoneme pronunciation in speech of impaired children and efficiency of this approach is compared to application of... more
In the paper recently proposed Human Factor Cepstral Coefficients (HFCC) are used to automatic recognition of pathological phoneme pronunciation in speech of impaired children and efficiency of this approach is compared to application of the standard Mel-Frequency Cepstral Coefficients (MFCC) as a feature vector. Both dynamic time warping (DTW), working on whole words or embedded phoneme patterns, and hidden Markov models (HMM) are used as classifiers in the presented research. Obtained results demonstrate superiority of combining HFCC features and modified phoneme-based DTW classifier.
Automatic speaker recognition may achieve remarkable performance in matched training and test conditions. Conversely, results drop significantly in incompatible noisy conditions. Furthermore, feature extraction significantly affects... more
Automatic speaker recognition may achieve remarkable performance in matched training and test conditions. Conversely, results drop significantly in incompatible noisy conditions. Furthermore, feature extraction significantly affects performance. Mel-frequency cepstral coefficients MFCCs are most commonly used in this field of study. The literature has reported that the conditions for training and testing are highly correlated. Taken together, these facts support strong recommendations for using MFCC features in similar environmental conditions (train/test) for speaker recognition. However, with noise and reverberation present, MFCC performance is not reliable. To address this, we propose a new feature 'entrocy' for accurate and robust speaker recognition, which we mainly employ to support MFCC coefficients in noisy environments. Entrocy is the fourier transform of the entropy, a measure of the fluctuation of the information in sound segments over time. Entrocy features are combined with MFCCs to generate a composite feature set which is tested using the gaussian mixture model (GMM) speaker recognition method. The proposed method shows improved recognition accuracy over a range of signal-to-noise ratios.
- by Ishan Bhardwaj
- •
- Speech Recognition, Hindi, MFCC, K-means
- by Ovide Decroly
- •
- MFCC, MFCC features, Lvq
Speech is the natural voice and primary means of speech. Communication. Communication. Speech is easy, hand-free, quick and it doesn't require anything. Any technical know-how. Communicating with your computer using Speech is a simple and... more
Speech is the natural voice and primary means of speech. Communication. Communication. Speech is easy, hand-free, quick and it doesn't require anything. Any technical know-how. Communicating with your computer using Speech is a simple and comfortable way for human beings. Speech to the This was made possible by the recognition system. Language and acoustics There is a model for this method, but mostly in English language. There are so many groups in India that can't Comprehend English or speak it. So the device of speech recognition in To these people, the English language is of little use. Here we have implemented Isolated method of recognition of Hindi words, which is part of System for Automated Speech Recognition (ASR)]. The primary purpose of The ASR system recognizes a voice through a device or microphone and To perform the necessary process, it is translated into text. In this article, As a feature extraction technique, we used Mel frequency cepstral coefficients (MFCC), Gaussian Mixture Model(GMM) Vector quantization (VQ) for Recognition of words separated from Hindi. We have practical research for The Hindi word speech dataset of various males and men was prepared.
- by IJRASET Publication
- •
- MFCC, ASR, GMM, Am
The speech disordered persons are able to produce speech which sounds like they are whispering. The main objective of this work is to reconstruct the abnormal to a normal sounding speech by using MFCC coefficients to extract the feature... more
The speech disordered persons are able to produce speech which sounds like they are whispering. The main objective of this
work is to reconstruct the abnormal to a normal sounding speech by using MFCC coefficients to extract the feature and use
these to train cascaded Gaussian Mixture Model (GMM) and Objective measures are used to evaluate the performance of the
work. The data used for the work are from WTIMIT online corpus and the speech signals recorded from speech impaired
subjects. In this work STRAIGHT toolbox is not employed for its complexity and muffled voice. The obtained SNR is reduced
W artykule przedstawiono wyniki badań dotyczących automatycznej detekcji wybranych wad wymowy u dzieci za pomocą automatycznego rozpoznawania mowy. Detekcja wady wymowy może być przeprowadzana pod kątem diagnozy lub terapii logopedycznej.... more
W artykule przedstawiono wyniki badań dotyczących automatycznej detekcji wybranych wad wymowy u dzieci za pomocą automatycznego rozpoznawania mowy. Detekcja wady wymowy może być przeprowadzana pod kątem diagnozy lub terapii logopedycznej. Jednym z często spotykanych typów wad wymowy jest substytucja polegająca na podstawianiu prawidłowego fonemu w słowie innym fonemem tego samego języka. W pracy rozważano automatyczną detekcję substytucji pod kątem zastosowań do terapii logopedycznej. W przypadku terapii zadanie automatycznego rozpoznawania mowy upraszcza się do rozpoznawania dwóch głosek: prawidłowej lub nieprawidłowej, ustalonych na podstawie wcześniejszej diagnozy. Jako materiał badawczy przyjęto następujące pary fonemów będące częściami wypowiedzi języka polskiego: {s, sz}, {si, sz}, {c, cz}, {ci, cz}, {dz, drz}, {dzi, drz}. Nagrania pochodziły od dzieci z wadami wymowy oraz od osób, które naśladowały określone wady wymowy. Proces rozpoznawania wady wymowy składał się z dwóch zasadniczych etapów: ekstrakcji cech z sygnału mowy oraz klasyfikacji. W przypadku ekstrakcji cech zbadano dwie metody: standardową metodę MFCC (ang. Mel-frequency cepstral coefficients) oraz stosunkowo niedawno wprowadzoną metodę HFCC (ang. Human-factor cepstral coefficients). Na etapie klasyfikacji przebadano skuteczność rozpoznawania wad wymowy za pomocą czterech metod. Pierwszą badaną metodą była metoda nieliniowej transformacji czasowej (ang. dynamic time warping – DTW). Standardowa metoda DTW jest oparta na modelach całych słów. W przypadku rozważanego problemu terapii substytucji słowa są rozróżnianie jedynie za pomocą jednego fonemu. W takiej sytuacji standardowa metoda DTW często zawodzi zwłaszcza, że fonemy rozróżniające dwa słowa są najczęściej podobne do siebie akustycznie. Ponadto segmenty poza obszarem rozróżniających słowa fonemów często podlegają różnym zniekształceniom lub zakłóceniom, co może dać w rezultacie większą niż zwykle odległość DTW pomiędzy słowami tej samej klasy. Zaproponowano modyfikację standardowej metody DTW polegającą na obliczaniu odległości DTW jedynie między fonemami będącymi składowymi danego słowa. W proponowanym rozwiązaniu zakłada się, że klasa rozpoznawanego słowa jest znana oraz, że słowo może być wypowiedziane prawidłowo lub nieprawidłowo zgodnie z postawioną wcześniej diagnozą. Oprócz metod opartych na nieliniowej transformacji czasowej zbadano również metodę klasyfikacji bazującą na niejawnych modelach Markowa (ang. hidden Markov models – HMM). Metodę HMM badano dla modeli całych słów oraz dla modeli fonemów. W przeprowadzonych badaniach zaobserwowano następujące tendencje: • metoda DTW rozpoznająca fonemy dała wyższe skuteczności niż metoda DTW rozpoznająca słowa • metoda HMM dla modeli fonemów dała lepsze rezultaty w porównaniu z metodą HMM dla całych słów • Skuteczności rozpoznawania w oparciu o cechy HFCC były wyższe w porównaniu ze standardowymi cechami MFCC • W porównaniu z klasyfikatorem DTW, metoda HMM dała nieznacznie gorsze wyniki, jednak problem ten wymaga dalszych badań. Opracowane metody rozpoznawania mogą znaleźć zastosowanie w diagnostyce i terapii wad wymowy u dzieci. W szczególności można z ich wykorzystaniem dokonywać wykrywania substytucji następujących par fonemów: sz-s, cz-c oraz drz-dz do celów terapii logopedycznej. Dla par fonemów sz-si, cz-ci, drz-dzi muszą zostać opracowane inne, bardziej skuteczne metody. Potencjalne przyszłe kierunki badawcze obejmują zastosowanie metody analizy składowych głównych oraz analizy dyskryminacyjnej zarówno w metodzie DTW jak i HMM. Planuje się również dalszą optymalizację parametrów metody HMM. Opisywane w pracy badania były sponsorowane z grantu MNiI nr 1 H01F 046 28.
Biometric identification of individuals has been widely used as a security mechanism for accessing computer systems or restricted environments. Biometric systems have been developed to perform identification through fingerprint, iris, or... more
Biometric identification of individuals has been widely used as a security mechanism for accessing computer systems or restricted environments. Biometric systems have been developed to perform identification through fingerprint, iris, or voice, for example. Using the voice as a biometric identifier has been increasingly possible due to significant advances in digital processing of speech signals area.
This research aims to evaluate the efficiency of mel-frequency cepstral coefficients in the representation of the characteristics of a speaker in an automatic speaker verification. The techniques used to construct the automatic speaker verification
system aiming at a hardware implementation included the use of: (i) mel-frequency cepstral coefficients, like feature vector; (ii) vector quantization, in patterning modelling; and (iii) a decision rule, based on Euclidean distance. The system used for evaluation in the representation of the characteristics of a speaker is a modification of another automatic speaker verification system using linear predictive coding coefficients for the representation of the vocal characteristics of
a speaker. It was implemented using C++ for the training phase, and SystemVerilog for the verification phase. The results using mel-frequency cepstral coefficients were 99.34% in the hit rate, 0.17% to error rate and 0.49% to unknown response rate, compared respectively to 96.52% in success rate, 0.90%
to error rate and 2.58% to unknown rate using the linear predictive coding coefficients.
— Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person... more
— Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based on information in speech signal. Automatic Speech Recognition (ASR) system takes a human speech utterance as an input and requites a string of words as output. This paper introduce a brief survey on Automatic Speech Recognition and discuss the major subjects and improvements made in the past 60 years of research, that provides technological outlook and a respect of the fundamental achievement that has been accomplished in this important area of speech communication. Definition of various types of speech classes, feature extraction techniques, speech classifiers and performance evaluation are issues that requires attention in designing of speech recognition system. The objective of this review paper is to summarize some of the well known methods u...
Feature level based monomodal biometric systems perform person recognition based on a multiple sources of biometric information and are affected by problems like integration of evidence obtained from multiple cues and normalization of... more
Feature level based monomodal biometric systems perform person recognition based on a multiple sources of biometric information and are affected by problems like integration of evidence obtained from multiple cues and normalization of features codes since they are heterogeneous, in addition of monomodal biometric systems problems like noisy sensor data, non-universality and lack of individuality of the chosen biometric trait, absence of an invariant representation for the biometric trait and susceptibility to circumvention. Some of these problems can be alleviated by using multimodal biometric systems that consolidate evidence from scores of multiple biometric systems. In this work, we address two important issues related to score level fusion. We have studied the performance of a score level fusion based multimodal biometric system against different monomodal biometric system based on voice, fingerprint modalities and a bimodal biometric system based on feature level fusion of the ...
In this work, Classical Turkish Music songs are classified into six makams. Makam is a modal framework for melodic development in Classical Turkish Music. The effect of the sound clip length on the system performance was also evaluated.... more
In this work, Classical Turkish Music songs are classified into six makams. Makam is a modal framework for melodic development in Classical Turkish Music. The effect of the sound clip length on the system performance was also evaluated. The Mel Frequency Cepstral Coefficients (MFCC) were used as features. Obtained data were classified by using Probabilistic Neural Network. The best correct recognition ratio was obtained as 89,4% by using a clip length of 6 s.
The goal of this project is to design a system that reads the face and voice of a person in conjunction to detect the sentimental and emotional state of a person based on that data. Humans are said to express almost 50% of what they want... more
The goal of this project is to design a system that reads the face and voice of a person in conjunction to detect the sentimental and emotional state of a person based on that data. Humans are said to express almost 50% of what they want to convey in nonverbal cues. This concept can be used to analyze on the tone and facial expressions both of them nonverbal cues that can be used to detect the sentimental and emotional state of a person based on his facial expressions and speech features. Here make use preexisting data bases, classify them according to their emotions and create classifiers for the purpose of emotion recognition of new input data. There are many data sets available from previous surveys. The database we use in here is Extended Cohn Kanade database for the facial expressions and SAVEE database for speech data. We then make use of the SVM classifier to classify the respective emotional labels appropriate for each of the image and then create an SVM classifier which is able to classify the input given to it. The emotional state of the face and the voice of the user is found out and then we read them both in conjunction to get a more accurate representation of the user's emotional state. We then play an appropriate music depending on the emotion of the user.