Feature bandwidth extension for Persian conversational telephone speech recognition (original) (raw)

Improving the performance of MFCC for Persian robust speech recognition

Journal of Artificial Intelligence and Data Mining, 2015

The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to the noisy original speech signal. The pre-emphasized original speech segmented into overlapping time frames, then it is windowed by a modified hamming window .Higher order autocorrelation coefficients are extracted. The next step is to eliminate the lower order of the autocorrelation coefficients. The consequence pass from FFT block and then power spectrum of output is calculated. A Gaussian shape filter bank is applied to the results. Logarithm and two compensator blocks form which one is mean subtraction and the other one are root block applied to the results and DCT transformation is the last step. We use MLP neural network to evaluate the performance of proposed MFCC method and to classify the results. Some speech recognition experiments for various tasks indicate that the proposed algorithm is more robust than traditional ones in noisy condition.

Persian Vowel recognition with MFCC and ANN on PCVC speech dataset

ArXiv, 2018

In this paper a new method for recognition of consonant-vowel phonemes combination on a new Persian speech dataset titled as PCVC (Persian Consonant-Vowel Combination) is proposed which is used to recognize Persian phonemes. In PCVC dataset, there are 20 sets of audio samples from 10 speakers which are combinations of 23 consonant and 6 vowel phonemes of Persian language. In each sample, there is a combination of one vowel and one consonant. First, the consonant phoneme is pronounced and just after it, the vowel phoneme is pronounced. Each sound sample is a frame of 2 seconds of audio. In every 2 seconds, there is an average of 0.5 second speech and the rest is silence. In this paper, the proposed method is the implementations of the MFCC (Mel Frequency Cepstrum Coefficients) on every partitioned sound sample. Then, every train sample of MFCC vector is given to a multilayer perceptron feed-forward ANN (Artificial Neural Network) for training process. At the end, the test samples are...

MFCCs and Gabor Features for Improving Continuous Arabic Speech Recognition in Mobile Communication Modified

2018

We argue that the improved the performance of automatic speech recognition (ASR) systems in mobiles communication system, we have achieved by two modules front-end or feature extractor used and a back-end or recognizer. the front-end we have used Gabor features GF-MFCC, are the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals so to reduce spectral variations and correlations. In the back-end we have investigated systems of classification in the field of speech using: CHMM (continues hidden Markov models). Our findings show that HMM can achieve consistently almost 1.93% of clean speech, 5.23% of AMR-NB coder and 1.1% of DSR coders. The system was trained using the 440 sentences with 20 speakers with labels generated by Viterbi alignment from a maximum likelihood ML trained CHMM system using the HTK toolkits.

Effect of Gender on Improving Speech Recognition System

International Journal of Computer Applications

Speech is the output of a time varying excitation excited by a time varying system. It generates pulses with fundamental frequency F0. This time varying impulse trained as one of the features, characterized by fundamental frequencyF0and its formant frequencies. These features vary from one speaker to another speaker and from gender to gender also. In this paper the effect of gender on improving speech recognition is considered. Variation in F0 and formant frequencies is the main features that characterize variation in a speaker. The variation becomes very less within speaker, medium within the same gender and very high among different genders. This variation in information can be exploited to recognize gender type and to improve performance of speech recognition system through modeling separate models based on gender type information. Five sentences are selected for training. Each of the sentences are spoken and recorded by 20 female's speakers and 20 male speakers. The speech corpus wills be

DeepMine Speech Processing Database: Text-Dependent and Independent Speaker Verification and Speech Recognition in Persian and English

Odyssey 2018 The Speaker and Language Recognition Workshop

In this paper, we introduce a new database for text-dependent, text-prompted and text-independent speaker recognition, as well as for speech recognition. DeepMine is a large-scale database in Persian and English, with its current version containing more than 1300 speakers and 360 thousand recordings overall. DeepMine has several appealing characteristics which make it unique of its kind. First of all, it is the first large-scale speaker recognition database in Persian, enabling the development of voice biometrics applications in the native language of about 110 million people. Second, it is the largest textdependent and text-prompted speaker recognition database in English, facilitating research on deep learning and other data demanding approaches. Third, its unique combination of Persian and English makes it suitable for exploring domain adaptation and transfer learning approaches, which constitute some of the emerging tasks in speech and speaker recognition. Finally, the extensive annotation with respect to age, gender, province, and educational level, combined with the inherent variability of the Persian language in terms of different accents are ideal for exploring the use of attribute information in utterance and speaker modeling. The presentation of the database is accompanied with several experiments using state-of-the-art algorithms. More specifically, we conduct experiments using HMM-based i-vectors, and we reaffirm their effectiveness in text-dependent speaker recognition. Furthermore, we conduct speech recognition experiments using the annotated text-independent part of the database for training and testing, and we demonstrate that the database can also serve for training robust speech recognition models in Persian.

Comparison of voice features for Arabic speech recognition

2011

Selection of the speech feature for speech recognition has been investigated for languages other than Arabic. Arabic Language has its own characteristics hence some speech features may be more suited for Arabic speech recognition than the others. In this paper, some feature extraction techniques are explored to find the features that will give the highest speech recognition rate. Our investigation in this paper showed that Mel-Frequency Cepstral Coefficients (MFCC) gave the best result. We also look at using an operator well know in image processing field to modify the way we calculate MFCC, this results in a new feature that we call LBPCC. We propose the way we use this operator. Then we conduct some experiments to test the proposed feature.

A phone-based approach to non-linguistic speech feature identification

Computer Speech & Language, 1995

In this paper we present a general approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. The basic idea is to process the unknown speech signal by feature-specific phone model sets in parallel, and to hypothesize the feature value associated with the model set having the highest likelihood. This technique is shown to be effective for text-independent gender, speaker, and language identification. Text-independent speaker identification accuracies of 98.8% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. Experiments in which speaker-specific models were estimated without using of the phonetic transcriptions for the TIMIT speakers had the same identification accuracies obtained with the use of the transcriptions. French/English language identification is better than 99% with 2s of read, laboratory speech. On spontaneous telephone speech from the OGI corpus, the language can be identified as French or English with 82% accuracy with 10s of speech. 10 language identification using the OGI corpus is 59.7% with 10s of signal.

Speech Feature Evaluation for Bangla Automatic Speech Recognition

Technical Challenges and Design Issues in Bangla Language Processing, 2013

This chapter presents Bangla (widely known as Bengali) Automatic Speech Recognition (ASR) techniques by evaluating the different speech features, such as Mel Frequency Cepstral Coefficients (MFCCs), Local Features (LFs), phoneme probabilities extracted by time delay artificial neural networks of different architectures. Moreover, canonicalization of speech features is also performed for Gender-Independent (GI) ASR. In the canonicalization process, the authors have designed three classifiers by male, female, and GI speakers, and extracted the output probabilities from these classifiers for measuring the maximum. The maximization of output probabilities for each speech file provides higher correctness and accuracies for GI speech recognition. Besides, dynamic parameters (velocity and acceleration coefficients) are also used in the experiments for obtaining higher accuracy in phoneme recognition. From the experiments, it is also shown that dynamic parameters with hybrid features also i...

MirasVoice: A bilingual (English-Persian) speech corpus

2018

Speech and speaker recognition is one of the most important research and development areas and has received quite a lot of attention in recent years. The desire to produce a natural form of communication between humans and machines can be considered the motivating factor behind such developments. Speech has the potential to influence numerous fields of research and development. In this paper, MirasVoice which is a bilingual (English-Farsi) speech corpus is presented. Over 50 native Iranian speakers who were able to speak in both the Farsi and English languages have volunteered to help create this bilingual corpus. The volunteers read text documents and then had to answer questions spontaneously in both English and Farsi. The text-independent GMM-UBM speaker verification engine was designed in this study for validating and exploring the performance of this corpus. This multilingual speech corpus could be used in a variety of language dependent and independent applications. For exampl...