Javier Franco-Pedroso | Universidad Autónoma de Madrid (original) (raw)

Uploads

Papers by Javier Franco-Pedroso

Research paper thumbnail of Gaussian Mixture Models of Between-Source Variation for Likelihood Ratio Computation from Multivariate Data

PloS one, 2016

In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated fro... more In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribu...

Research paper thumbnail of Linguistically-constrained formant-based i-vectors for automatic speaker recognition

Speech Communication, 2016

Research paper thumbnail of Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains

EURASIP Journal on Audio, Speech, and Music Processing, 2015

Research paper thumbnail of Atvs-Uam Nist Lre 2011 System Description

Research paper thumbnail of What are we missing with i-vectors? A perceptual analysis of i-vector-based falsely accepted trials

Research paper thumbnail of ATVS-QUT NIST SRE 2012 System Description

Research paper thumbnail of Fine-grained automatic speaker recognition using cepstral trajectories in pone units

In this paper, the contributions to speaker identity from different phone units are explored thro... more In this paper, the contributions to speaker identity from different phone units are explored through the analysis of the temporal trajectories of their Mel-Frequency Cepstral Coefficients (MFCC). Inspired in successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system, pulling up together both forensic and automatic speaker recognition worlds. The combination of MFCC feature extraction and variable-length unit-dependent trajectories coding provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications as forensics), and at a coarsegrained level, we combine the individual contributions of different units to obtain a single calibrated likelihood ratio per trial. With development data extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets, the proposed approach has been tested with NIST SRE 2006 dataset and protocol consisting of 9,720 English-only 1side-1side trials from 219 male speakers.

Research paper thumbnail of Atvs-Qut Submission Overview

Research paper thumbnail of Calibration and weight of the evidence by human listeners. The ATVS-UAM submission to NIST HUMAN-aided speaker recognition 2010

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

This work analyzes the performance of speaker recognition when carried out by human lay listeners... more This work analyzes the performance of speaker recognition when carried out by human lay listeners. In forensics, judges and jurors usually manifest intuition that people is proficient to distinguish other people from their voices, and therefore opinions are easily elicited about speech evidence just by listening to it, or by means of panels of listeners. There is a danger, however, since little attention has been paid to scientifically measure the performance of human listeners, as well as to the strength with which they should elicit their opinions. In this work we perform such a rigorous analysis in the context of NIST Human-Aided Speaker Recognition 2010 (HASR). We have recruited a panel of listeners who have elicited opinions in the form of scores. Then, we have calibrated such scores using a development set, in order to generate calibrated likelihood ratios. Thus, the discriminating power and the strength with which human lay listeners should express their opinions about the speech evidence can be assessed, giving a measure of the amount of information given by human listeners to the speaker recognition process.

Research paper thumbnail of Cepstral Trajectories in Linguistic Units for Text-Independent Speaker Recognition

Communications in Computer and Information Science, 2012

In this paper, the contributions of different linguistic units to the speaker recognition task ar... more In this paper, the contributions of different linguistic units to the speaker recognition task are explored by means of temporal trajectories of their MFCC features. Inspired by successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system that puts together both forensic and automatic speaker recognition worlds. The combination of MFCC features and unit-dependent trajectories provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications such as forensics), and at a coarse-grained level, we combine the individual contributions of the different units to obtain a highly discriminative single system. This approach has been tested with NIST SRE 2006 datasets and protocols, consisting of 9,720 trials from 219 male speakers for the 1side-1side English-only task, and development data being extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets.

Research paper thumbnail of Formant trajectories in linguistic units for text-independent speaker recognition

2013 International Conference on Biometrics (ICB), 2013

Research paper thumbnail of A linguistically-motivated speaker recognition front-end through session variability compensated cepstral trajectories in phone units

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major perfor... more ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major performance improvements from the use of session variability compensated cepstral trajectories in phone units. Extending our recent work on temporal contours in linguistic units (TCLU), we have combined the potential of those unit-dependent trajectories with the ability of feature domain factor analysis techniques to compensate session variability effects, which has resulted in consistent and discriminant phone-dependent trajectories across different recording sessions. Evaluating with NIST SRE04 English-only 1s1s task, we report EERs as low as 5.40% from the trajectories in a single phone, with 29 different phones producing each of them EERs smaller than 10%, and additionally showing an excellent calibration performance per unit. The combination of different units shows significant complementarity reporting EERs as 1.63% (100×DCF=0.732) from a simple sum fusion of 23 best phones, or 0.68% (100×DCF=0.304) when fusing them through logistic regression.

Research paper thumbnail of Eficiencia Computacional y Alto Rendimiento en Reconocimiento Automático de Locutor: el Sistema ATVS-UAM en NIST SRE 2010

Research paper thumbnail of Multilevel and channel-compensated language recognition: ATVS-UAM systems at NIST LRE 2009

... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ign... more ... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ignacio.lopez, javier.franco, daniel.ramos, doroteo.torre, joaquin.gonzalez} @uam.es ... has been used in order to train logistic regression (http://niko.brummer.googlepages.com/focal). ...

Research paper thumbnail of Atvs-Uam Nist Sre 2010 System

This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recogni... more This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recognition Evaluation (SRE). Instead of focusing on multiple, complex and heavy systems, our submission is based on a fast, light and efficient single system. Sample development results with English SRE08 data (data used in the previous evaluation in 2008) are 0.53% EER (Equal Error Rate) in tel-tel (telephone data used for training and testing) male data (optimistic evaluation), going up to 3.5% (tel-tel) and 5.1% EER (tel-mic, telephone data for training and microphone data for testing) in pessimistic cross-validation experiments. These results are achieved with an extremely light system in computational resources, running 77 times faster than real time.

Research paper thumbnail of ATVS-UAM System Description for the Audio Segmentation and Speaker Diarization Albayzin 2010 Evaluation

This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diariza... more This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diarization Albayzin 2010 Evaluation. The ATVS-UAM audio segmentation system is based on a 5-GMM-MMI-state HMM model. Testing utterances are aligned with the model by means of the Viterbi algorithm. Spurious changes in the state sequence were removed by mode-filtering step. Finally, too sort segments were removed. The ATVS-UAM speaker diarization system is a novelty approach based on the cosine distance clustering of the Total Variability speech factors -the so-called iVectorsperformed in two steps, followed by a Viterbi decodification of the probabilities based on the distances between the candidate speaker centroids and the iVectors stream.

Research paper thumbnail of Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009

Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 M... more Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009 ...

Research paper thumbnail of The 2013 speaker recognition evaluation in mobile environment

2013 International Conference on Biometrics (ICB), 2013

Research paper thumbnail of Gaussian Mixture Models of Between-Source Variation for Likelihood Ratio Computation from Multivariate Data

PloS one, 2016

In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated fro... more In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribu...

Research paper thumbnail of Linguistically-constrained formant-based i-vectors for automatic speaker recognition

Speech Communication, 2016

Research paper thumbnail of Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains

EURASIP Journal on Audio, Speech, and Music Processing, 2015

Research paper thumbnail of Atvs-Uam Nist Lre 2011 System Description

Research paper thumbnail of What are we missing with i-vectors? A perceptual analysis of i-vector-based falsely accepted trials

Research paper thumbnail of ATVS-QUT NIST SRE 2012 System Description

Research paper thumbnail of Fine-grained automatic speaker recognition using cepstral trajectories in pone units

In this paper, the contributions to speaker identity from different phone units are explored thro... more In this paper, the contributions to speaker identity from different phone units are explored through the analysis of the temporal trajectories of their Mel-Frequency Cepstral Coefficients (MFCC). Inspired in successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system, pulling up together both forensic and automatic speaker recognition worlds. The combination of MFCC feature extraction and variable-length unit-dependent trajectories coding provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications as forensics), and at a coarsegrained level, we combine the individual contributions of different units to obtain a single calibrated likelihood ratio per trial. With development data extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets, the proposed approach has been tested with NIST SRE 2006 dataset and protocol consisting of 9,720 English-only 1side-1side trials from 219 male speakers.

Research paper thumbnail of Atvs-Qut Submission Overview

Research paper thumbnail of Calibration and weight of the evidence by human listeners. The ATVS-UAM submission to NIST HUMAN-aided speaker recognition 2010

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

This work analyzes the performance of speaker recognition when carried out by human lay listeners... more This work analyzes the performance of speaker recognition when carried out by human lay listeners. In forensics, judges and jurors usually manifest intuition that people is proficient to distinguish other people from their voices, and therefore opinions are easily elicited about speech evidence just by listening to it, or by means of panels of listeners. There is a danger, however, since little attention has been paid to scientifically measure the performance of human listeners, as well as to the strength with which they should elicit their opinions. In this work we perform such a rigorous analysis in the context of NIST Human-Aided Speaker Recognition 2010 (HASR). We have recruited a panel of listeners who have elicited opinions in the form of scores. Then, we have calibrated such scores using a development set, in order to generate calibrated likelihood ratios. Thus, the discriminating power and the strength with which human lay listeners should express their opinions about the speech evidence can be assessed, giving a measure of the amount of information given by human listeners to the speaker recognition process.

Research paper thumbnail of Cepstral Trajectories in Linguistic Units for Text-Independent Speaker Recognition

Communications in Computer and Information Science, 2012

In this paper, the contributions of different linguistic units to the speaker recognition task ar... more In this paper, the contributions of different linguistic units to the speaker recognition task are explored by means of temporal trajectories of their MFCC features. Inspired by successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system that puts together both forensic and automatic speaker recognition worlds. The combination of MFCC features and unit-dependent trajectories provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications such as forensics), and at a coarse-grained level, we combine the individual contributions of the different units to obtain a highly discriminative single system. This approach has been tested with NIST SRE 2006 datasets and protocols, consisting of 9,720 trials from 219 male speakers for the 1side-1side English-only task, and development data being extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets.

Research paper thumbnail of Formant trajectories in linguistic units for text-independent speaker recognition

2013 International Conference on Biometrics (ICB), 2013

Research paper thumbnail of A linguistically-motivated speaker recognition front-end through session variability compensated cepstral trajectories in phone units

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major perfor... more ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major performance improvements from the use of session variability compensated cepstral trajectories in phone units. Extending our recent work on temporal contours in linguistic units (TCLU), we have combined the potential of those unit-dependent trajectories with the ability of feature domain factor analysis techniques to compensate session variability effects, which has resulted in consistent and discriminant phone-dependent trajectories across different recording sessions. Evaluating with NIST SRE04 English-only 1s1s task, we report EERs as low as 5.40% from the trajectories in a single phone, with 29 different phones producing each of them EERs smaller than 10%, and additionally showing an excellent calibration performance per unit. The combination of different units shows significant complementarity reporting EERs as 1.63% (100×DCF=0.732) from a simple sum fusion of 23 best phones, or 0.68% (100×DCF=0.304) when fusing them through logistic regression.

Research paper thumbnail of Eficiencia Computacional y Alto Rendimiento en Reconocimiento Automático de Locutor: el Sistema ATVS-UAM en NIST SRE 2010

Research paper thumbnail of Multilevel and channel-compensated language recognition: ATVS-UAM systems at NIST LRE 2009

... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ign... more ... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ignacio.lopez, javier.franco, daniel.ramos, doroteo.torre, joaquin.gonzalez} @uam.es ... has been used in order to train logistic regression (http://niko.brummer.googlepages.com/focal). ...

Research paper thumbnail of Atvs-Uam Nist Sre 2010 System

This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recogni... more This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recognition Evaluation (SRE). Instead of focusing on multiple, complex and heavy systems, our submission is based on a fast, light and efficient single system. Sample development results with English SRE08 data (data used in the previous evaluation in 2008) are 0.53% EER (Equal Error Rate) in tel-tel (telephone data used for training and testing) male data (optimistic evaluation), going up to 3.5% (tel-tel) and 5.1% EER (tel-mic, telephone data for training and microphone data for testing) in pessimistic cross-validation experiments. These results are achieved with an extremely light system in computational resources, running 77 times faster than real time.

Research paper thumbnail of ATVS-UAM System Description for the Audio Segmentation and Speaker Diarization Albayzin 2010 Evaluation

This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diariza... more This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diarization Albayzin 2010 Evaluation. The ATVS-UAM audio segmentation system is based on a 5-GMM-MMI-state HMM model. Testing utterances are aligned with the model by means of the Viterbi algorithm. Spurious changes in the state sequence were removed by mode-filtering step. Finally, too sort segments were removed. The ATVS-UAM speaker diarization system is a novelty approach based on the cosine distance clustering of the Total Variability speech factors -the so-called iVectorsperformed in two steps, followed by a Viterbi decodification of the probabilities based on the distances between the candidate speaker centroids and the iVectors stream.

Research paper thumbnail of Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009

Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 M... more Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009 ...

Research paper thumbnail of The 2013 speaker recognition evaluation in mobile environment

2013 International Conference on Biometrics (ICB), 2013