Code Switched and Code Mixed Speech Recognition for Indic languages (original) (raw)
Related papers
Multilingual and code-switching ASR challenges for low resource Indian languages
2021
Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today’s world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages , namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose,...
Adapting monolingual resources for code-mixed hindi-english speech recognition
2017 International Conference on Asian Language Processing (IALP), 2017
The paper presents an automatic speech recognition (ASR) system for code-mixed read speech in Hindi-English, developed upon the extrapolation of monolingual training resources. A monolingual Hindi acoustic model, mixed with code-mixed speech data has been implemented to train a neural network based speech recognition framework. The testing corpus also follows a similar structure: containing data from both monolingual and code-mixed speech. The shared phonetic transcription, captured in WX notation has been exploited to harness the commonality between the pooled phonesets of Hindi and English. The experiments have been conducted in two separate formulations of a trigram based language model 1) In the first experiment, the language model contains no out-of-vocabulary words, as the test utterances are included in the training of the language model. The word error rate in this case has been obtained to be 10.63 %. 2) In the second experiment, the testing utterances have been excluded fr...
Hindi-English Code-Switching Speech Corpus
2018
Code-switching refers to the usage of two languages within a sentence or discourse. It is a global phenomenon among multilingual communities and has emerged as an independent area of research. With the increasing demand for the code-switching automatic speech recognition (ASR) systems, the development of a code-switching speech corpus has become highly desirable. However, for training such systems, very limited code-switched resources are available as yet. In this work, we present our first efforts in building a code-switching ASR system in the Indian context. For that purpose, we have created a Hindi-English code-switching speech database. The database not only contains the speech utterances with code-switching properties but also covers the session and the speaker variations like pronunciation, accent, age, gender, etc. This database can be applied in several speech signal processing applications, such as code-switching ASR, language identification, language modeling, speech synth...
Improving Speech Recognition for Indic Languages using Language Model
ArXiv, 2022
We study the effect of applying a language model (LM) on the output of Automatic Speech Recognition (ASR) systems for Indic languages. We fine-tune wav2vec 2 . 0 models for 18 Indic languages and adjust the results with language models trained on text derived from a variety of sources. Our findings demonstrate that the average Character Error Rate (CER) decreases by over 28 % and the average Word Error Rate (WER) decreases by about 36 % after decoding with LM. We show that a large LM may not provide a substantial improvement as compared to a diverse one. We also demonstrate that high quality transcrip-tions can be obtained on domain-specific data without retraining the ASR model and show results on biomedical domain.
Modeling code-Switching speech on under-resourced languages for language identification
2014
This paper presents an integration of phonotactic information to perform language identification (LID) in a mixed-language speech. A single-pass front-end recognition system is employed to convert the spoken utterances into a statistical occurrence of phone sequences. To process such phone sequences, a hidden Markov model (HMM) is utilized to build robust acoustic models that can handle multiple languages within an utterance. A supervised Support Vector Machine (SVM) learns the language transition of the phonotactic information given the recognized phone sequences. The back-end SVM-based decision classifies language identity given the likelihood scores phone occurrences. The experiments are conducted on commonly mixed-language Northern Sotho and English speech utterances. We evaluate the system measuring the performance of the phone recognition and LID portions separately. We obtained a phone error rate of 15.7% when a data-driven phoneme mapping approach is modeled with 16 Gaussian...
Dual Script E2E framework for Multilingual and Code-Switching ASR
2021
India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in text-to-speech synthesis, in this work, we use an in-house rule-based phoneme-level common label set (CLS) representation to train multilingual and code-switching ASR for Indian languages. We propose two end-to-end (E2E) ASR systems. In the first system, the E2E model is trained on the CLS representation, and we use a novel data-driven back-end to recover the native language script. In the second system, we propose a modification to the E2E model, wherein the CLS representation and the native language characters are used simultaneously for training. We show our results on the multilingual and...
Multilingual Speech Recognition and Language Identification
Automatic speech recognition (ASR) is an important technology to enable and improve the human-human and human-computer interactions. Todays, speech recognition technology is mature enough to be useful in many applications.in this present multilingual ASR, both LID and ASR base on DNN.DNN importance to train ASR for many languages with avoids the extra latency that a nearly language decision would introduce, and benefits from the extra scores from the recognizers to better decide which result to return to the user. These benefits come however with an increased processing cost since the input is recognized multiple times. This architecture support multiple languages, allowing users to naturally interact with the system in several languages.
A study of multilingual speech recognition
1997
This paper describes our work in developing multilingual (Swedish and English) speech recognition systems in the ATIS domain. The acoustic component of the multilingual systems is realized through sharing Gaussian codebooks across Swedish and English allophones. The language model (LM) components are constructed by training a statistical bigram model, with a common backoff node, on bilingual texts, and by combining two monolingual LMs into a probabilistic finite state grammar. This system uses a single decoder for Swedish and English sentences, and is capable of recognizing sentences with words from both languages. Preliminary experiments show that sharing acoustic models across the two languages has not resulted in improved performance, while sharing a backoff node at the LM component provides flexibility and ease in recognizing bilingual sentences at the expense of a slight increase in word error rate in some cases. As a by-product, the bilingual decoder also achieves good performance on language identification (LID).
IEEE Access, 2020
The end-to-end (E2E) framework has emerged as a viable alternative to conventional hybrid systems in automatic speech recognition (ASR) domain. Unlike the monolingual case, the challenges faced by an E2E system in code-switching ASR task include (i) the expansion of target set to account for multiple languages involved, (ii) the requirement of a robust target-to-word (T2W) transduction, and (iii) the need for more effective context modeling. In this paper, we aim to address those challenges for reliable training of the E2E ASR system on a limited amount of code-switching data. The main contribution of this work lies in the E2E target set reduction by exploiting the acoustic similarity and the proposal of a novel context-dependent T2W transduction scheme. Additionally, a novel textual feature has been proposed to enhance the context modeling in the case of code-switching data. The experiments are performed on a recently created Hindi-English code-switching corpus. For contrast purposes, the existing combined target set based system is also evaluated. The proposed system outperforms the existing one and yields a target error rate of 18.1% along with a word error rate of 29.79%. INDEX TERMS Code-switching, speech recognition, end-to-end system, factored language model, targetto-word transduction.
Interspeech 2018, 2018
Although isiZulu speakers code-switch with English as a matter of course, extremely little appropriate data is available for acoustic modelling. Recently, a small five-language corpus of code-switched South African soap opera speech was compiled. We used this corpus to evaluate the application of multilingual neural network acoustic modelling to English-isiZulu code-switched speech recognition. Our aim was to determine whether English-isiZulu speech recognition accuracy can be improved by incorporating three other language pairs in the corpus: English-isiXhosa, English-Setswana and English-Sesotho. Since isiXhosa, like isiZulu, belongs to the Nguni language family, while Setswana and Sesotho belong to the more distant Sotho family, we could also investigate the merits of additional data from within and across language groups. Our experiments using both fully connected DNN and TDNN-LSTM architectures show that English-isiZulu speech recognition accuracy as well as language identification after code-switching is improved more by the incorporation of English-isiXhosa data than by the incorporation of the other language pairs. However additional data from the more distant language group remained beneficial, and the best overall performance was always achieved with a multilingual neural network trained on all four language pairs.