Fully Convolutional ASR for Less-Resourced Endangered Languages (original) (raw)

Leveraging Pre-Trained Representations to Improve Access to Untranscribed Speech from Endangered Languages

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021

Pre-trained speech representations like wav2vec 2.0 are a powerful tool for automatic speech recognition (ASR). Yet many endangered languages lack sufficient data for pretraining such models, or are predominantly oral vernaculars without a standardised writing system, precluding finetuning. Query-by-example spoken term detection (QbE-STD) offers an alternative for iteratively indexing untranscribed speech corpora by locating spoken query terms. Using data from 7 Australian Aboriginal languages and a regional variety of Dutch, all of which are endangered or vulnerable, we show that QbE-STD can be improved by leveraging representations developed for ASR (wav2vec 2.0: the English monolingual model and XLSR53 multilingual model). Surprisingly, the English model outperformed the multilingual model on 4 Australian language datasets, raising questions around how to optimally leverage self-supervised speech representations for QbE-STD. Nevertheless, we find that wav2vec 2.0 representations (either English or XLSR53) offer large improvements (56-86% relative) over state-of-the-art approaches on our endangered language datasets.

Deep neural network features and semi-supervised training for low resource speech recognition

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

We propose a new technique for training deep neural networks (DNNs) as data-driven feature front-ends for large vocabulary continuous speech recognition (LVCSR) in low resource settings. To circumvent the lack of sufficient training data for acoustic modeling in these scenarios, we use transcribed multilingual data and semi-supervised training to build the proposed feature front-ends. In our experiments, the proposed features provide an absolute improvement of 16% in a low-resource LVCSR setting with only one hour of in-domain training data. While close to three-fourths of these gains come from DNN-based features, the remaining are from semi-supervised training.

Recent Progresses in Deep Learning Based Acoustic Models

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) that can effectively exploit variablelength contextual information, and their various combination with other models. We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification (CTC) criterion, and the attention-based sequenceto-sequence translation model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.

Enhancing ASR Systems for Under-Resourced Languages through a Novel Unsupervised Acoustic Model Training Technique

Advances in Electrical and Computer Engineering, 2015

1 1 Abstract-Statistical speech and language processing techniques, requiring large amounts of training data, are currently state-of-the-art in automatic speech recognition. For high-resourced, international languages this data is widely available, while for under-resourced languages the lack of data poses serious problems. Unsupervised acoustic modeling can offer a cost and time effective way of creating a solid acoustic model for any under-resourced language. This study describes a novel unsupervised acoustic model training method and evaluates it on speech data in an under-resourced language: Romanian. The key novel factor of the method is the usage of two complementary seed ASR systems to produce high quality transcriptions, with a Character Error Rate (ChER) < 5%, for initially untranscribed speech data. The methodology leads to a relative Word Error Rate (WER) improvement of more than 10% when 100 hours of untranscribed speech are used.

Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments

ArXiv, 2021

There is growing interest in ASR systems that can recognize phones in a language-independent fashion. There is additionally interest in building language technologies for low-resource and endangered languages. However, there is a paucity of realistic data that can be used to test such systems and technologies. This paper presents a publicly available, phonetically transcribed corpus of 2255 utterances (words and short phrases) in the endangered Tangkhulic language East Tusom (no ISO 639-3 code), a Tibeto-Burman language variety spoken mostly in India. Because the dataset is transcribed in terms of phones, rather than phonemes, it is a better match for universal phone recognition systems than many larger (phonemically transcribed) datasets. This paper describes the dataset and the methodology used to produce it. It further presents basic benchmarks of state-of-the-art universal phone recognition systems on the dataset as baselines for future experiments.

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Interspeech 2022

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crúbadán: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crúbadán statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

Deep Convolutional Neural Networks for Large-scale Speech Tasks

Neural Networks, 2015

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora

Journal of Communication and Information Systems, 2020

In this work, we present a baseline end-to-end system based on deep learning for automatic speech recognition in Brazilian Portuguese. To build such a model, we employ a speech corpus containing 158 hours of annotated speech by assembling four individual datasets, three of them publicly available, and a text corpus containing 10.2 millions of sentences. We train an acoustic model based on the DeepSpeech 2 network, with two convolutional and five bidirectional recurrent layers. By adding a newly trained 15-gram language model at the character level, we achieve a character error rate of only 10.49% and a word error rate of 25.45%, which are on a par with other works in different languages using a similar amount of training data.

Transfer Learning and Distillation Techniques to Improve the Acoustic Modeling of Low Resource Languages

Interspeech 2017

Deep neural networks (DNN) require large amount of training data to build robust acoustic models for speech recognition tasks. Our work is intended in improving the low-resource language acoustic model to reach a performance comparable to that of a high-resource scenario with the help of data/model parameters from other high-resource languages. we explore transfer learning and distillation methods, where a complex high resource model guides or supervises the training of low resource model. The techniques include (i) multilingual framework of borrowing data from high-resource language while training the low-resource acoustic model. The KL divergence based constraints are added to make the model biased towards low-resource language, (ii) distilling knowledge from the complex high-resource model to improve the low-resource acoustic model. The experiments were performed on three Indian languages namely Hindi, Tamil and Kannada. All the techniques gave improved performance and the multilingual framework with KL divergence regularization giving the best results. In all the three languages a performance close to or better than highresource scenario was obtained.

A Combination of Deep Neural Networks for Acoustic modeling of Vietnamese LVCSR

In this work, we propose a deep neural network architecture with the combination of two popular applications of deep neural networks for Vietnamese large vocabulary continuous speech recognition. First, a deep neural network is trained to extract bottleneck features from frames of a combination of Mel frequency cepstral coefficient (MFCC) and tonal feature. This network is then applied as a nonlinear discriminative feature-space transformation for hybrid network training where acoustic modeling is performed by denoising auto-encoder pre-training and back-propagation algorithms. The experiments are carried out on the dataset containing speeches on Voice of Vietnam channel (VOV). The results show that the performance of the system using combined deep neural network architecture obtained relative improvements over the best hybrid HMM/DNN system by 4.1% and over baseline system by 51.4%. Adding tonal feature as input feature of the network reached around 18% relative recognition performance.