Performance Evaluation of Deep Convolutional Maxout Neural Network in Speech Recognition (original) (raw)
Related papers
MAXOUT BASED DEEP NEURAL NETWORKS FOR ARABIC PHONEMES RECOGNITION
Abstract—Arabic is widely articulated by Malay race due to several factors such as; performing worship and reciting the Holy book of Muslims. Newly, Maxout deep neural networks have conveyed substantial perfections to speech recognition systems. Hence, in this paper, a fully connected feed-forward neural network with Maxout units is introduced. The proposed deep neural network involves three hidden layers, 500 Maxout units and 2 neurons for each unit along with Mel-Frequency Cepstral Coefficients (MFCC) as feature extraction of the phonemes waveforms. Further, the deep neural network is trained and tested over a corpus comprised of consonant Arabic phonemes recorded from 20 Malay speakers. Each person is required to pronounce the twenty eight consonant phonemes within the three chances given to each subjects articulate all the letters. Conversely, continuous recording has been established to record all the letters in each chance. The recording process is accomplished using SAMSON C03U USB multi-pattern condenser microphone. Here, the data are divided into five waveforms for training the proposed Maxout network and fifteen waveforms for testing. Experimentally, the proposed Dropout function for training has shown considerable performance over Sigmoid and Rectified Linear Unit (ReLU) functions. Eventually, testing Maxout network has shown considerable outcome compare to Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Convolutional Neural Network (CNN), the conventional feedforward neural network (NN) and Convolutional Auto- Encoder (CAE).
Automatic Speech Recognition using different Neural Network Architectures-A Survey
Speech is the vocalized form of communication based on lexical syntax. Each spoken word is a phonetic combination of vowels and consonants. Automatic Speech Recognition can be defined as computer-driven transcriptions of speech into human readable text. As it is an emerging technique many researchers are attracted to this and achieved progress to a certain extent in recent years. This survey paper aims at explaining the architecture of Deep Neural Network, Convolutional Neural Network and Recurrent Neural Network and their performance in the field of Automatic Speech Recognition. We also summarise main contributions of various researchers during 2010-2016 on Acoustic Modeling and Language Modeling (main components of Automatic Speech Recognition) using these architectures and pointing out their impact in ASR. We conclude this paper with a comparative study regarding the advantages of the architectures discussed during the survey with respect to Word Error Rate (WER), Phone Error Rat...
The convolutional neural networks for Amazigh speech recognition system
TELKOMNIKA Telecommunication, Computing, Electronics and Control, 2021
In this paper, we present an approach based on convolutional neural networks to build an automatic speech recognition system for the Amazigh language. This system is built with TensorFlow and uses mel frequency cepstral coefficient (MFCC) to extract features. In order to test the effect of the speaker's gender and age on the accuracy of the model, the system was trained and tested on several datasets. The first experiment the dataset consists of 9240 audio files. The second experiment the dataset consists of 9240 audio files distributed between females and males' speakers. The last experiment 3 the dataset consists of 13860 audio files distributed between age 9-15, age 16-30, and age 30+. The result shows that the model trained on a dataset of adult speaker's age +30 categories generates the best accuracy with 93.9%.
Improving Large Vocabulary Urdu Speech Recognition System Using Deep Neural Networks
Interspeech 2019
Development of Large Vocabulary Continuous Speech Recognition (LVCSR) system is a cumbersome task, especially for low resource languages. Urdu is the national language and lingua franca of Pakistan, with 100 million speakers worldwide. Due to resource scarcity, limited work has been done in the domain of Urdu speech recognition. In this paper, collection of Urdu speech corpus and development of Urdu speech recognition system is presented. Urdu LVCSR is developed using 300 hours of read speech data with a vocabulary size of 199K words. Microphone speech is recorded from 1671 Urdu and Punjabi speakers in both indoor and outdoor environments. Different acoustic modeling techniques such as Gaussian Mixture Models based Hidden Markov Models (GMM-HMM), Time Delay Neural Networks (TDNN), Long-Short Term Memory (LSTM) and Bidirectional Long-Short Term Memory (BLSTM) networks are investigated. Cross entropy and Lattice Free Maximum Mutual Information (LF-MMI) objective functions are employed during acoustic modeling. In addition, Recurrent Neural Network Language Model (RNNLM) is also being used for re-scoring. Developed speech recognition system has been evaluated on 9.5 hours of collected test data and a minimum Word Error Rate (%WER) of 13.50% is achieved.
Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition
2012
Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and maxpooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.
Convolutional Neural Network for Arabic Speech Recognition
The Egyptian Journal of Language Engineering
This work is focused on single word Arabic automatic speech recognition (AASR). Two techniques are used during the feature extraction phase; Log frequency spectral coefficients (MFSC) and Gammatone-frequency cepstral coefficients (GFCC) with their first and second-order derivatives. The convolutional neural network (CNN) is mainly used to execute feature learning and classification process. CNN achieved performance enhancement in automatic speech recognition (ASR). Local connectivity, weight sharing, and pooling are the crucial properties of CNNs that have the potential to improve ASR. We tested the CNN model using an Arabic speech corpus of isolated words. The used corpus is synthetically augmented by applying different transformations such as changing the pitch, the speed, the dynamic range, adding noise, and forward and backward shift in time. It was found that the maximum accuracy obtained when using GFCC with CNN is 99.77 %. The outcome results of this work are compared to previous reports and indicate that CNN achieved better performance in AASR.
Turkish Speech Recognition Based On Deep Neural Networks
Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 2018
In this paper we develop a Turkish speech recognition (SR) system using deep neural networks and compare it with the previous state-of-the-art traditional Gaussian mixture model-hidden Markov model (GMM-HMM) method using the same Turkish speech dataset and the same large vocabulary Turkish corpus. Nowadays most SR systems deployed worldwide and particularly in Turkey use Hidden Markov Models to deal with the speech temporal variations. Gaussian mixture models are used to estimate the amount at which each state of each HMM fits a short frame of coefficients which is the representation of an acoustic input. A deep neural network consisting of feed-forward neural network is another way to estimate the fit; this neural network takes as input several frames of coefficients and gives as output posterior probabilities over HMM states. It has been shown that the use of deep neural networks can outperform the traditional GMM-HMM in other languages such as English and German. The fact that Turkish language is an agglutinative language and the lack of a huge amount of speech data complicate the design of a performant SR system. By making use of deep neural networks we will obviously improve the performance but still we will not achieve better result than English language due to the difference in the availability of speech data. We present various architectural and training techniques for the Turkish DNN-based models. The models are tested using a Turkish database collected from mobile devices. In the experiments, we observe that the Turkish DNN-HMM system have decreased the word error rate approximately 2.5% when compared to the GMM-HMM traditional system.
Performance Evaluation of Deep Neural Networks Applied to Speech Recognition: RNN, LSTM and GRU
Journal of Artificial Intelligence and Soft Computing Research, 2019
Deep Neural Networks (DNN) are nothing but neural networks with many hidden layers. DNNs are becoming popular in automatic speech recognition tasks which combines a good acoustic with a language model. Standard feedforward neural networks cannot handle speech data well since they do not have a way to feed information from a later layer back to an earlier layer. Thus, Recurrent Neural Networks (RNNs) have been introduced to take temporal dependencies into account. However, the shortcoming of RNNs is that long-term dependencies due to the vanishing/exploding gradient problem cannot be handled. Therefore, Long Short-Term Memory (LSTM) networks were introduced, which are a special case of RNNs, that takes long-term dependencies in a speech in addition to shortterm dependencies into account. Similarily, GRU (Gated Recurrent Unit) networks are an improvement of LSTM networks also taking long-term dependencies into consideration. Thus, in this paper, we evaluate RNN, LSTM, and GRU to compare their performances on a reduced TED-LIUM speech data set. The results show that LSTM achieves the best word error rates, however, the GRU optimization is faster while achieving word error rates close to LSTM.
Performance analysis on speech recognition using neural networks
Proceedings of The IEEE, 2000
Abstract. The paper presents a neural network approach,for speech recognition tasks in Romanian language. We describe the structure of a speaker-independent system for isolated word recognition, based,on a neural network,paradigm,combined,with a dynamic,programming,algorithm. The experimental,results demonstrates,that a hybrid model,leads to higher recognition rates than the classic technologies. Keywords: speech recognition, neural networks, hybrid systems, acoustic modeling.
SPEECH RECOGNITION USING THE NN
IAEME PUBLICATION, 2020
Speech is a natural and primary mode of communication for people. Speech Recognition Technology gives machines the ability to identify and respond to spoken commands. . Deep Learning could be a subfield of machine learning uses neural networks for recognizing spoken words and converts them to text. Hidden Markov Models (HMM), states utilize a combination of Gaussian to model a spectral illustration of the wave. HMM was utilized for speech recognition with poor accuracy and less efficient in the way of the time domain. In proposed system, HMM can be replaced by Convolutional 1Dimensional Neural Networks (CNN) to increase efficiency and accuracy. HMMs are least efficient which led to the use of Deep Neural Networks (DNN). DNN methods can handle nonlinear data for speech analysis has ability to minimize error rate. The speech recognition model selects the best speech signal illustration by feature extraction of the audio signal within the Time domain as speech is single-dimensional will be sometimes processed victimisation sliding windows that are fed into a network. Conv1D handles speech signals by providing a full frequency feature vector at every instant which completely describe the sample