ISIS and NISIS: New bilingual dual-channel speech corpora for robust speaker recognition (original) (raw)

Text Independent Speaker Identification in Multilingual Environments

2008

Speaker identification and verification systems have a poor performance when model training is done in one language while the testing is done in another. This situation is not unusual in multilingual environments, where people should be able to access the system in any language he or she prefers in each moment, without noticing a performance drop. In this work we study the possibility of using features derived from prosodic parameters in order to reinforce the language robustness of these systems. First the features' properties in terms of language and session variability are studied, predicting an increase in the language robustness when frame-wise intonation and energy values are combined with traditional MFCC features. The experimental results confirm that these features provide an improvement in the speaker recognition rates under language-mismatch conditions. The whole study is carried out in the Basque Country, a bilingual region in which Basque and Spanish languages co-exist.

The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data

This paper describes efforts to create corpora to support and evaluate systems that perform speaker recognition where channel and language may vary. Beyond the ongoing evaluation of speaker recognition systems, these corpora are aimed at the bilingual and cross channel dimensions. We report on specific data collection efforts at the Linguistic Data Consortium and the research ongoing at the US Federal Bureau of Investigation and MIT Lincoln Laboratories. We cover the design and requirements, the collections and final properties of the corpus integrating discussions of the data preparation, research, technology development and evaluation on a grand scale.

The MMSR Bilingual and Crosschannel Corpora for Speaker Recognition Research and Evaluation

… -The Speaker and …, 2004

1 MIT Lincoln Laboratory, Lexington, MA, USA 2 Federal Bureau of Investigation, Quantico, VA, USA 3 University of Pennsylvania, Linguistic Data Consortium, Philadelphia, PA, USA 4 National Institute of Standards and Technology, Gaithersburg, MD, USA j.campbell@ieee. ...

Investigating the use of multiple languages for crisp and fuzzy speaker identification

11th International Conference of Pattern Recognition Systems (ICPRS 2021), 2021

The use of speech for system identification is an important and relevant topic. There are several ways of doing it, but most are dependent on the language the user speaks. However, if the idea is to create an all-inclusive and reliable system that uses speech as its input, we must take into account that people can and will speak different languages and have different accents. Thus, this research evaluates speaker identification systems on a multilingual setup. Our experiments are performed using three widely spoken languages which are Portuguese, English, and Chinese. Initial tests indicated the systems have certain robustness on multiple languages. Results with more languages decreases our accuracy, but our investigation suggests these impacts are related to the number of classes.

NIST Speaker Recognition Evaluations Utilizing the Mixer Corpora—2004, 2005, 2006

IEEE Transactions on Audio, Speech and Language Processing, 2007

NIST has coordinated annual evaluations of text-independent speaker recognition from 1996 to 2006. This paper discusses the last three of these, which utilized conversational speech data from the Mixer Corpora recently collected by the Linguistic Data Consortium. We review the evaluation procedures, the matrix of test conditions included, and the performance trends observed. While most of the data is collected over telephone channels, one multichannel test condition utilizes a subset of Mixer conversations recorded simultaneously over multiple microphone channels and a telephone line. The corpus also includes some non-English conversations involving bilingual speakers, allowing an examination of the effect of language on performance results. On the various test conditions involving English language conversational telephone data, considerable performance gains are observed over the past three years.

Call My Net Corpus: A Multilingual Corpus for Evaluation of Speaker Recognition Technology

Interspeech 2017

The Call My Net 2015 (CMN15) corpus presents a new resource for Speaker Recognition Evaluation and related technologies. The corpus includes conversational telephone speech recordings for a total of 220 speakers spanning 4 languages: Tagalog, Cantonese, Mandarin and Cebuano. The corpus includes 10 calls per speaker made under a variety of noise conditions. Calls were manually audited for language, speaker identity and overall quality. The resulting data has been used in the NIST 2016 SRE Evaluation and will be published in the Linguistic Data Consortium catalog. We describe the goals of the CMN15 corpus, including details of the collection protocol and auditing procedure and discussion of the unique properties of this corpus compared to prior NIST SRE evaluation corpora.

Speaker and Session Variability in GMM-Based Speaker Verification

IEEE Transactions on Audio, Speech and Language Processing, 2000

We present a corpus-based approach to speaker verification in which maximum likelihood II criteria are used to train a large scale generative model of speaker and session variability which we call joint factor analysis. Enrolling a target speaker consists in calculating the posterior distribution of the hidden variables in the factor analysis model and verification tests are conducted using a new type of likelihood II ratio statistic. Using the NIST 1999 and 2000 speaker recognition evaluation data sets, we show that the effectiveness of this approach depends on the availability of a training corpus which is well matched with the evaluation set used for testing. Experiments on the NIST 1999 evaluation set using a mismatched corpus to train factor analysis models did not result in any improvement over standard methods but we found that, even with this type of mismatch, feature warping performs extremely well in conjunction with the factor analysis model and this enabled us to obtain very good results (equal error rates of about 6.2%).

A Study of Interspeaker Variability in Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing, 2000

We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10-15% reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task.

Speech2Phone: A Multilingual and Text Independent Speaker Identification Model

ArXiv, 2020

Voice recognition is an area with a wide application potential. Speaker identification is useful in several voice recognition tasks, as seen in voice-based authentication, transcription systems and intelligent personal assistants. Some tasks benefit from open-set models which can handle new speakers without the need of retraining. Audio embeddings for speaker identification is a proposal to solve this issue. However, choosing a suitable model is a difficult task, especially when the training resources are scarce. Besides, it is not always clear whether embeddings are as good as more traditional methods. In this work, we propose the Speech2Phone and compare several embedding models for open-set speaker identification, as well as traditional closed-set models. The models were investigated in the scenario of small datasets, which makes them more applicable to languages in which data scarceness is an issue. The results show that embeddings generated by artificial neural networks are com...

Corpora for the Evaluation of Robust Speaker Recognition Systems

Interspeech 2016, 2016

The goal of this paper is to describe significant corpora available to support speaker recognition research and evaluation, along with details about the corpora collection and design. We describe the attributes of high-quality speaker recognition corpora. Considerations of the application, domain, and performance metrics are also discussed. Additionally, a literature survey of corpora used in speaker recognition research over the last 10 years is presented. Finally we show the most common corpora used in the research community and review them on their success in enabling meaningful speaker recognition research.