On the use of phone-gram units in recurrent neural networks for language identification (original) (raw)
Related papers
Smart Technologies, Systems and Applications, 2020
Language Identification (LID) is an essential research topic in the Automatic Recognition Speech area. One of the most important characteristics relative to language is context information. In this article, considering a phonotactic approach where the phonetic units called "phone-grams" are used, in order to introduce such context information, a novel technique is proposed. Language discriminative information has been incorporated in the Recurrent Neural Network Language Models generation (RNNLMs) in the weights initialization stage to improve the Language Identification task. This technique has been evaluated using KALAKA-3 database that contains 108 h of audios of six languages to be recognized. The metric used in this work has been the Average Detection Cost metric C avg. In relation to the phonetic units called "phone-grams" used in order to incorporate context information in the features used to train the RNNLM, it has been considered phone-grams of two elements "2phone-grams" and three elements "3phone-grams", obtaining a relative improvement up to 17% and 15,44% respectively compared to the results obtaining using RNNLMs.
On the use of Phone-based Embeddings for Language Recognition
IberSPEECH 2018, 2018
Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i-Vector system provides up to 34,1% improvement over the acoustic system alone.
Phonotactic language identification using high quality phoneme recognition
… European Conference on …, 2005
Phoneme Recognizers followed by Language Modeling (PRLM) have consistently yielded top performance in language identification (LID) task. Parallel ordering of PRLMs (PPRLM) improves performance even more. Since tokenizer is the most important part of LID system the high quality phoneme recognizer is employed. Two different multilingual databases for training phoneme recognizers are compared and the amount of sufficient training data is studied. Reported results are on data from NIST 2003 LID evaluation. Our four PRLM systems have Equal Error Rate (EER) of 2.4% on 12 languages task. This result compares favorably to the best known result from this task.
Language Identification with Language-independent Acoustic Models in: Proc
In this paper we explore the use of languageindependent acoustic models for language identi cation (LID). The phone sequence output by a single language-independent phone recognizer is rescored with language-dependent phonotactic models approximated by phone bigrams. The language-independent phoneme inventory was obtained by Agglomerative Hierarchical Clustering, using a measure of similarity between phones. This system is compared with a parallel language-dependent phone architecture, which uses optimally the acoustic log likelihood and the phonotactic score for language identi cation. Experiments were carried out on the 4-language telephone speech corpus IDEAL, containing calls in British English, Spanish, French and German. Results show that the language-independent approach performs as well as the language-dependent one: 9% versus 10% of error rate on 10 second chunks, for the 4-language task.
Language identification with language-independent acoustic models
Fifth European …, 1997
In this paper we explore the use of languageindependent acoustic models for language identi cation (LID). The phone sequence output by a single language-independent phone recognizer is rescored with language-dependent phonotactic models approximated by phone bigrams. The language-independent phoneme inventory was obtained by Agglomerative Hierarchical Clustering, using a measure of similarity between phones. This system is compared with a parallel language-dependent phone architecture, which uses optimally the acoustic log likelihood and the phonotactic score for language identi cation. Experiments were carried out on the 4-language telephone speech corpus IDEAL, containing calls in British English, Spanish, French and German. Results show that the language-independent approach performs as well as the language-dependent one: 9% versus 10% of error rate on 10 second chunks, for the 4-language task.
Context-dependent phone models and models adaptation for phonotactic language recognition
Ninth Annual Conference …, 2008
The performance of a PPRLM language recognition system depends on the quality and the consistency of phone decoders. To improve the performance of the decoders, this paper investigates the use of context-dependent instead of contextindependent phone models, and the use of CMLLR for model adaptation. This paper also discusses several improvements to the LIMSI 2007 NIST LRE system, including the use of a 4gram language model, score calibration and fusion using the FoCal Multi-class toolkit (with large development data) and better decoding parameters such as phone insertion penalty. The improved system is evaluated on the NIST LRE-2005 and the LRE-2007 evaluation data sets. Despite its simplicity, the system achieves for the 30s condition a Cavg of 2.4% and 1.6% on these data sets, respectively.
Interspeech 2022
We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of "phonotactic" embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multitask optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.
Improved n-gram phonotactic models for language recognition
2010
This paper investigates various techniques to improve the estimation of n-gram phonotactic models for language recognition using single-best phone transcriptions and phone lattices. More precisely, we first report on the impact of the so-called acoustic scale factor on the system accuracy when using latticebased training, and then we report on the use of n-gram cutoff and entropy pruning techniques. Several system configurations are explored, such as the use of context-independent and context-dependent phone models, the use of single-best phone hypotheses versus phone lattices, and the use of various n-gram orders. Experiments are conducted using the LRE 2007 evaluation data and the results are reported using the a posteriori EER. The results show that the impact of these techniques on the system accuracy is highly dependent on the training conditions and that careful optimization can lead to performance improvements.
Language Recognition on Albayzin 2010 LRE using PLLR features
Resumen: Los así denominados Phone Log-Likelihood Ratios (PLLR), han sido introducidos como características alternativas a los MFCC-SDC para sistemas de Reconocimiento de la Lengua (RL) mediante iVectors. En este artículo, tras una breve descripción de estas características, se proporcionan nuevas evidencias de su utilidad para tareas de RL, con un nuevo conjunto de experimentos sobre la base de datos Albayzin 2010 LRE, que contiene habla multi-locutor de banda ancha en seis lenguas diferentes: euskera, catalán, gallego, español, portugués e inglés. Los sistemas de iVectors entrenados con PLLRs obtienen mejoras relativas significativas respecto a los sistemas fonotácticos y sistemas de iVectors entrenados con características MFCC-SDC, tanto en condiciones de habla limpia como con habla ruidosa. Las fusiones de los sistemas PLLR con los sistemas fonotácticos y/o sistemas basados en MFCC-SDC proporcionan mejoras adicionales en el rendimiento, lo que revela que las características PLLR aportan información complementaria en ambos casos.
Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
This paper presents new techniques with relevant improvements added to the primary system presented by our group to the Albayzin 2012 LRE competition, where the use of any additional corpora for training or optimizing the models was forbidden. In this work, we present the incorporation of an additional phonotactic subsystem based on the use of phone log-likelihood ratio features (PLLR) extracted from different phonotactic recognizers that contributes to improve the accuracy of the system in a 21.4% in terms of C avg (we also present results for the official metric during the evaluation, F act ). We will present how using these features at the phone state level provides significant improvements, when used together with dimensionality reduction techniques, especially PCA. We have also experimented with applying alternative SDC-like configurations on these PLLR features with additional improvements. Also, we will describe some modifications to the MFCC-based acoustic i-vector system which have also contributed to additional improvements. The final fused system outperformed the baseline in 27.4% in C avg .