Design and development of phonetically rich Urdu speech corpus (original) (raw)
Related papers
Large vocabulary continuous speech recognition for Urdu
Proceedings of the 8th International Conference on Frontiers of Information Technology - FIT '10, 2010
In this article, we focus on creating a large vocabulary speech recognition system for the Slovenian language. Currently, state-of-the-art recognition systems are able to use vocabularies with sizes of 20,000 to 100,000 words. These systems have mostly been developed for English, which belongs to a group of uninflectional languages. Slovenian, as a Slavic language, belongs to a group of inflectional languages. Its rich morphology presents a major problem in large vocabulary speech recognition. Compared to English, the Slovenian language requires a vocabulary approx. ten times greater for the same degree of text coverage. Consequently, the difference in vocabulary size causes a high degree of OOV (out-of-vocabulary words). Therefore OOV words have a direct impact on recognizer efficiency. The characteristics of inflectional languages have been considered when developing a new search algorithm with a method for restricting the correct order of sub-word units, and to use separate language models based on sub-words. This search algorithm combines the properties of sub-wordbased models (reduced OOV) and word-based models (the length of context). The algorithm also enables better search-space limitation for sub word models. Using sub-word models, we increase recognizer accuracy and achieve a comparable search space to that of a standard word-based recognizer. Our methods were evaluated in experiments on a SNABI speech database.
Speech Corpus Development for a Speaker Independent Spontaneous Urdu Speech Recognition System
2010
This paper reports the design and development of an 82 speaker Urdu speech corpus for speaker independent spontaneous speech recognition using the CMU Sphinx Open Source Toolkit for Speech Recognition. The corpus consists of 45 hours of spontaneous and read speech data from 82 speakers (42 male and 40 female), recorded over a microphone and a telephone line. The speech was collected from speakers ranging from 20 to 55 years of age. Recording sessions were conducted in office and home environments.
A Medium Vocabulary Urdu Isolated Words Balanced Corpus for Automatic Speech Recognition
Abstract—The role of a standard database in conducting and evaluating the speech recognition research is two-fold. Firstly, it provides a standard platform for the research by providing a balance amongst various aspects of speech recognition such as gender, dialect, and age. Secondly, it provides a common platform for comparing the performance of various speech recognition approaches. This paper presents the development of a Medium Vocabulary Speech Corpus for Urdu Language.
Urdu Speech Corpus and Preliminary Results on Speech Recognition
Language resources for Urdu language are not well developed. In this work, we summarize our work on the development of Urdu speech corpus for isolated words. The Corpus comprises of 250 isolated words of Urdu recorded by ten individuals. The speakers include both native and non-native, male and female individuals. The corpus can be used for both speech and speaker recognition tasks. We also report our results on automatic speech recognition task for the said corpus. The framework extracts Mel Frequency Cepstral Coefficients along with the velocity and acceleration coefficients, which are then fed to different classifiers to perform recognition task. The classifiers used are Support Vector Machines, Random Forest and Linear Discriminant Analysis. Experimental results show that the best results are provided by the Support Vector Machines with a test set accuracy of 73%. The results reported in this work may provide a useful baseline for future research on automatic speech recognition of Urdu.
A Speech Recognition System for Urdu Language
Communications in Computer and Information Science, 2009
This paper presents a speech processing and recognition system for individually spoken Urdu language words. The speech feature extraction was based on a dataset of 150 different samples collected from 15 different speakers. The data was pre-processed using normalization and by transformation into frequency domain by (discrete Fourier transform). The speech recognition feed-forward neural models were developed in MATLAB. The models exhibited reasonably high training and testing accuracies. Details of MATLAB implementation are included in the paper for use by other researchers in this field. Our ongoing work involves use of linear predictive coding and cepstrum analysis for alternative neural models. Potential applications of the proposed system include telecommunications, multi-media, and voice-activated tele-customer services.
Urdu speech corpus for travel domain
2016
Speech corpus is a collection of recorded speech data. It is a fundamental component to develop any speech based applications. This paper presents the design and development of an Urdu speech corpus for travel domain. The corpus consists of a total of 250 vocabulary items including city names, days, time and numbers. The corpus has been used to develop a speech recognition system with a laboratory accuracy of 95.6% and a field accuracy of 87.21%. The corpus can be used for domestic flight queries, domestic flight reservation, train inquiry service and reservation and bus information and reservation.
Design of an Urdu speech recognizer based upon acoustic phonetic modeling approach
2004
Airtonlatic Speech Kecognifion (ASR) is one of (he most developing.fields qf ihe mudem science. It has nzan!) impurtunt upp/icariuns in oiir saciai I@ UJ weli as nzuizy weus uf scientijc disciplines like compriter science, and media. The miin aim ofthis paper is to analyze and implement different techniques ofspeech Recognition like Pattern-Matching or Acoustic phonetic Modeling to Urdu Lungtiage. A profotype has heen developed which recognises the cnnfznirorts Urd1 Speech with 55 In 60% acctcrucy.
2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), 2020
This paper presents first step towards Large Vocabulary Continuous Speech Recognition (LVCSR) system for Urdu-English code-switched conversational speech. Urdu is the national language and lingua franca of Pakistan, with 100 million speakers worldwide. English, on the other hand, is official language of Pakistan and commonly mixed with Urdu in daily communication. Urdu, being under-resourced language, have no substantial Urdu-English code-switched corpus in hand to develop speech recognition system. In this research, readily available spontaneous Urdu speech corpus (25 hours) is revised to use it for enhancement of read speech Urdu LVCSR to recognize code-switched speech. This data set is split into 20 hours of train and 5 hours of test set. 10 hours of Urdu BroadCast (BC) data are collected and annotated in a semi-supervised way to enhance the system further. For acoustic modeling, state-of-the-art DNN-HMM modeling technique is used without any prior GMM-HMM training and alignments...
Speaker Independent Urdu Speech Recognition Using HMM
Automatic Speech Recognition (ASR) is one of the advanced fields of Natural Language Processing (NLP). Recent past has witnessed valuable research activities in ASR in English, European and East Asian languages. But unfortunately South Asian Languages in general and " Urdu " in particular have received very less attention. In this paper we present an approach to develop an ASR system for Urdu language. The proposed system is based on an open source speech recognition framework called Sphinx4 which uses statistical based approach (Hidden Markov Model) for developing ASR system. We present a Speaker Independent ASR system for small sized vocabulary, i.e. fifty two isolated most spoken Urdu words and suggest that this research work will form the basis to develop medium and large size vocabulary Urdu speech recognition system.
The development of isolated words corpus of Pashto for the automatic speech recognition research
2012 International Conference of Robotics and Artificial Intelligence, 2012
The availability of standard speech database is of paramount importance in the automatic speech recognition (ASR) research in the context of providing a baseline for comparing the performance of automatic speech recognition approaches. This paper presents the development of a Medium-Vocabulary Speech Corpus for Pashto language. The vocabulary encompasses 161 isolated words of Pashto language, consisting of most frequently used words of Pashto language, names of the days of the week and digits from 0 to 25. The words were uttered by 30 speakers of different ages and genders, including both native and non-native speakers of Pashto language. Recording of the corpus was performed in a noise free office environment. The Corpus developed is then used for the development of an automatic speech recognition system for Pashto language.