Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition (original) (raw)
Related papers
Learning to Recognize Code-switched Speech Without Forgetting Monolingual Speech Recognition
2020
Recently, there has been significant progress made in Automatic Speech Recognition (ASR) of code-switched speech, leading to gains in accuracy on code-switched datasets in many language pairs. Code-switched speech co-occurs with monolingual speech in one or both languages being mixed. In this work, we show that fine-tuning ASR models on code-switched speech harms performance on monolingual speech. We point out the need to optimize models for code-switching while also ensuring that monolingual performance is not sacrificed. Monolingual models may be trained on thousands of hours of speech which may not be available for re-training a new model. We propose using the Learning Without Forgetting (LWF) framework for code-switched ASR when we only have access to a monolingual model and do not have the data it was trained on. We show that it is possible to train models using this framework that perform well on both code-switched and monolingual test sets. In cases where we have access to mo...
ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of 56% in English translation over the state-of-the-art and 9.3% in Arabic translation. Since codeswitching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources.
Token Level Identification of Linguistic Code Switching
2013
Typically native speakers of Arabic mix dialectal Arabic and Modern Standard Arabic in the same utterance. This phenomenon is known as linguistic code switching (LCS). It is a very challenging task to identify these LCS points in written text where we don’t have an accompanying speech signal. In this paper, we address automatic identification of LCS points in Arabic social media text by identifying token level dialectal words. We present an unsupervised approach that employs a set of dictionaries, sound-change rules, and language models to tackle this problem. We tune and test the performance of our approach against human-annotated Egyptian and Levantine discussion fora datasets. Two types of annotations on the token level are obtained for each dataset: context sensitive and context insensitive annotation. We achieve a token level F β=1 score of 74 % and 72.4 % on the context-sensitive development and test datasets, respectively. On the context insensitive annotated data, we achieve...
A Survey of Current Datasets for Code-Switching Research
2020
Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.
CEASR: A Corpus for Evaluating Automatic Speech Recognition
2020
In this paper, we present CEASR, a Corpus for Evaluating the quality of Automatic Speech Recognition (ASR). It is a data set based on public speech corpora, containing metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems, with normalised transcript texts and metadata. We use CEASR to evaluate the quality of ASR systems by calculating an average Word Error Rate (WER) per corpus, per system and per corpus-system pair. Our experiments show a substantial difference in accuracy between commercial versus open-source ASR tools as well as differences up to a factor ten for single systems on different corpora. Using CEASR allowed us to very efficiently and easily obtain these results. Our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort, i.e. without the need to collect, process and transcribe...
CECOS: A Chinese-English code-switching speech database
2011
With the increase on the demands for code-switching automatic speech recognition (ASR), the design and development of a code-switching speech database becomes highly desirable. However, it is not easy to collect sufficient code-switched utterances for model training for code-switching ASR. This study presents the procedure and experience for the design and development of a Chinese-English COde-switching Speech database (CECOS). Two different methods for collecting Chinese-English code-switched utterances are employed in this work. The applications of the collected database are also introduced. The CECOS database not only contains the speech data with code-switch properties but also accents due to non-native speakers. This database can be applied to several applications, such as code-switching speech recognition, language identification, named entity detection, etc.
IEEE Access, 2020
The end-to-end (E2E) framework has emerged as a viable alternative to conventional hybrid systems in automatic speech recognition (ASR) domain. Unlike the monolingual case, the challenges faced by an E2E system in code-switching ASR task include (i) the expansion of target set to account for multiple languages involved, (ii) the requirement of a robust target-to-word (T2W) transduction, and (iii) the need for more effective context modeling. In this paper, we aim to address those challenges for reliable training of the E2E ASR system on a limited amount of code-switching data. The main contribution of this work lies in the E2E target set reduction by exploiting the acoustic similarity and the proposal of a novel context-dependent T2W transduction scheme. Additionally, a novel textual feature has been proposed to enhance the context modeling in the case of code-switching data. The experiments are performed on a recently created Hindi-English code-switching corpus. For contrast purposes, the existing combined target set based system is also evaluated. The proposed system outperforms the existing one and yields a target error rate of 18.1% along with a word error rate of 29.79%. INDEX TERMS Code-switching, speech recognition, end-to-end system, factored language model, targetto-word transduction.
Code-switching input for machine translation: a case study of Vietnamese–English data
International Journal of Multilingualism, 2023
Multilingualism presents both a challenge and an opportunity for Natural Language Processing, with code-switching representing a particularly interesting problem for computational models trained on monolingual datasets. In this paper, we explore how codeswitched data affects the task of Machine Translation, a task which only recently has started to tackle the challenge of multilingual data. We test three Machine Translation systems on data from the Canberra Vietnamese-English Codeswitching Natural Speech Corpus (CanVEC) and evaluate translation output using both automatic and human metrics. We find that, perhaps counter-intuitively, Machine Translation performs better on codeswitching input than monolingual input. In particular, comparison of human and automatic evaluation suggests that codeswitching input may boost the semantic faithfulness of the translation output, an effect we term lexico-semantic enrichment. We also report two cases where this effect is most and least clear in Vietnamese-English, namely gender-neutral 3SG pronouns and interrogative constructions respectively. Overall, we suggest that Machine Translation, and Natural Language Processing more generally, ought to view multilingualism as an opportunity rather than an obstacle.
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
arXiv (Cornell University), 2022
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics. 1