Javier Franco-Pedroso | Universidad Autónoma de Madrid (original) (raw)
Uploads
Papers by Javier Franco-Pedroso
PloS one, 2016
In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated fro... more In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribu...
Speech Communication, 2016
EURASIP Journal on Audio, Speech, and Music Processing, 2015
In this paper, the contributions to speaker identity from different phone units are explored thro... more In this paper, the contributions to speaker identity from different phone units are explored through the analysis of the temporal trajectories of their Mel-Frequency Cepstral Coefficients (MFCC). Inspired in successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system, pulling up together both forensic and automatic speaker recognition worlds. The combination of MFCC feature extraction and variable-length unit-dependent trajectories coding provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications as forensics), and at a coarsegrained level, we combine the individual contributions of different units to obtain a single calibrated likelihood ratio per trial. With development data extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets, the proposed approach has been tested with NIST SRE 2006 dataset and protocol consisting of 9,720 English-only 1side-1side trials from 219 male speakers.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This work analyzes the performance of speaker recognition when carried out by human lay listeners... more This work analyzes the performance of speaker recognition when carried out by human lay listeners. In forensics, judges and jurors usually manifest intuition that people is proficient to distinguish other people from their voices, and therefore opinions are easily elicited about speech evidence just by listening to it, or by means of panels of listeners. There is a danger, however, since little attention has been paid to scientifically measure the performance of human listeners, as well as to the strength with which they should elicit their opinions. In this work we perform such a rigorous analysis in the context of NIST Human-Aided Speaker Recognition 2010 (HASR). We have recruited a panel of listeners who have elicited opinions in the form of scores. Then, we have calibrated such scores using a development set, in order to generate calibrated likelihood ratios. Thus, the discriminating power and the strength with which human lay listeners should express their opinions about the speech evidence can be assessed, giving a measure of the amount of information given by human listeners to the speaker recognition process.
Communications in Computer and Information Science, 2012
In this paper, the contributions of different linguistic units to the speaker recognition task ar... more In this paper, the contributions of different linguistic units to the speaker recognition task are explored by means of temporal trajectories of their MFCC features. Inspired by successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system that puts together both forensic and automatic speaker recognition worlds. The combination of MFCC features and unit-dependent trajectories provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications such as forensics), and at a coarse-grained level, we combine the individual contributions of the different units to obtain a highly discriminative single system. This approach has been tested with NIST SRE 2006 datasets and protocols, consisting of 9,720 trials from 219 male speakers for the 1side-1side English-only task, and development data being extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets.
2013 International Conference on Biometrics (ICB), 2013
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major perfor... more ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major performance improvements from the use of session variability compensated cepstral trajectories in phone units. Extending our recent work on temporal contours in linguistic units (TCLU), we have combined the potential of those unit-dependent trajectories with the ability of feature domain factor analysis techniques to compensate session variability effects, which has resulted in consistent and discriminant phone-dependent trajectories across different recording sessions. Evaluating with NIST SRE04 English-only 1s1s task, we report EERs as low as 5.40% from the trajectories in a single phone, with 29 different phones producing each of them EERs smaller than 10%, and additionally showing an excellent calibration performance per unit. The combination of different units shows significant complementarity reporting EERs as 1.63% (100×DCF=0.732) from a simple sum fusion of 23 best phones, or 0.68% (100×DCF=0.304) when fusing them through logistic regression.
... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ign... more ... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ignacio.lopez, javier.franco, daniel.ramos, doroteo.torre, joaquin.gonzalez} @uam.es ... has been used in order to train logistic regression (http://niko.brummer.googlepages.com/focal). ...
This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recogni... more This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recognition Evaluation (SRE). Instead of focusing on multiple, complex and heavy systems, our submission is based on a fast, light and efficient single system. Sample development results with English SRE08 data (data used in the previous evaluation in 2008) are 0.53% EER (Equal Error Rate) in tel-tel (telephone data used for training and testing) male data (optimistic evaluation), going up to 3.5% (tel-tel) and 5.1% EER (tel-mic, telephone data for training and microphone data for testing) in pessimistic cross-validation experiments. These results are achieved with an extremely light system in computational resources, running 77 times faster than real time.
This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diariza... more This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diarization Albayzin 2010 Evaluation. The ATVS-UAM audio segmentation system is based on a 5-GMM-MMI-state HMM model. Testing utterances are aligned with the model by means of the Viterbi algorithm. Spurious changes in the state sequence were removed by mode-filtering step. Finally, too sort segments were removed. The ATVS-UAM speaker diarization system is a novelty approach based on the cosine distance clustering of the Total Variability speech factors -the so-called iVectorsperformed in two steps, followed by a Viterbi decodification of the probabilities based on the distances between the candidate speaker centroids and the iVectors stream.
Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 M... more Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009 ...
2013 International Conference on Biometrics (ICB), 2013
PloS one, 2016
In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated fro... more In forensic science, trace evidence found at a crime scene and on suspect has to be evaluated from the measurements performed on them, usually in the form of multivariate data (for example, several chemical compound or physical characteristics). In order to assess the strength of that evidence, the likelihood ratio framework is being increasingly adopted. Several methods have been derived in order to obtain likelihood ratios directly from univariate or multivariate data by modelling both the variation appearing between observations (or features) coming from the same source (within-source variation) and that appearing between observations coming from different sources (between-source variation). In the widely used multivariate kernel likelihood-ratio, the within-source distribution is assumed to be normally distributed and constant among different sources and the between-source variation is modelled through a kernel density function (KDF). In order to better fit the observed distribu...
Speech Communication, 2016
EURASIP Journal on Audio, Speech, and Music Processing, 2015
In this paper, the contributions to speaker identity from different phone units are explored thro... more In this paper, the contributions to speaker identity from different phone units are explored through the analysis of the temporal trajectories of their Mel-Frequency Cepstral Coefficients (MFCC). Inspired in successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system, pulling up together both forensic and automatic speaker recognition worlds. The combination of MFCC feature extraction and variable-length unit-dependent trajectories coding provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications as forensics), and at a coarsegrained level, we combine the individual contributions of different units to obtain a single calibrated likelihood ratio per trial. With development data extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets, the proposed approach has been tested with NIST SRE 2006 dataset and protocol consisting of 9,720 English-only 1side-1side trials from 219 male speakers.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This work analyzes the performance of speaker recognition when carried out by human lay listeners... more This work analyzes the performance of speaker recognition when carried out by human lay listeners. In forensics, judges and jurors usually manifest intuition that people is proficient to distinguish other people from their voices, and therefore opinions are easily elicited about speech evidence just by listening to it, or by means of panels of listeners. There is a danger, however, since little attention has been paid to scientifically measure the performance of human listeners, as well as to the strength with which they should elicit their opinions. In this work we perform such a rigorous analysis in the context of NIST Human-Aided Speaker Recognition 2010 (HASR). We have recruited a panel of listeners who have elicited opinions in the form of scores. Then, we have calibrated such scores using a development set, in order to generate calibrated likelihood ratios. Thus, the discriminating power and the strength with which human lay listeners should express their opinions about the speech evidence can be assessed, giving a measure of the amount of information given by human listeners to the speaker recognition process.
Communications in Computer and Information Science, 2012
In this paper, the contributions of different linguistic units to the speaker recognition task ar... more In this paper, the contributions of different linguistic units to the speaker recognition task are explored by means of temporal trajectories of their MFCC features. Inspired by successful work in forensic speaker identification, we extend the approach based on temporal contours of formant frequencies in linguistic units to design a fully automatic system that puts together both forensic and automatic speaker recognition worlds. The combination of MFCC features and unit-dependent trajectories provides a powerful tool to extract individualizing information. At a fine-grained level, we provide a calibrated likelihood ratio per linguistic unit under analysis (extremely useful in applications such as forensics), and at a coarse-grained level, we combine the individual contributions of the different units to obtain a highly discriminative single system. This approach has been tested with NIST SRE 2006 datasets and protocols, consisting of 9,720 trials from 219 male speakers for the 1side-1side English-only task, and development data being extracted from 367 male speakers from 1,808 conversations from NIST SRE 2004 and 2005 datasets.
2013 International Conference on Biometrics (ICB), 2013
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major perfor... more ABSTRACT In this paper a new linguistically-motivated front-end is presented showing major performance improvements from the use of session variability compensated cepstral trajectories in phone units. Extending our recent work on temporal contours in linguistic units (TCLU), we have combined the potential of those unit-dependent trajectories with the ability of feature domain factor analysis techniques to compensate session variability effects, which has resulted in consistent and discriminant phone-dependent trajectories across different recording sessions. Evaluating with NIST SRE04 English-only 1s1s task, we report EERs as low as 5.40% from the trajectories in a single phone, with 29 different phones producing each of them EERs smaller than 10%, and additionally showing an excellent calibration performance per unit. The combination of different units shows significant complementarity reporting EERs as 1.63% (100×DCF=0.732) from a simple sum fusion of 23 best phones, or 0.68% (100×DCF=0.304) when fusing them through logistic regression.
... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ign... more ... ATVS Biometric Recognition Group, Universidad Autonoma de Madrid, Spain {javier.gonzalez, ignacio.lopez, javier.franco, daniel.ramos, doroteo.torre, joaquin.gonzalez} @uam.es ... has been used in order to train logistic regression (http://niko.brummer.googlepages.com/focal). ...
This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recogni... more This paper describes the system submitted by ATVS-UAM to the 2010 edition of NIST Speaker Recognition Evaluation (SRE). Instead of focusing on multiple, complex and heavy systems, our submission is based on a fast, light and efficient single system. Sample development results with English SRE08 data (data used in the previous evaluation in 2008) are 0.53% EER (Equal Error Rate) in tel-tel (telephone data used for training and testing) male data (optimistic evaluation), going up to 3.5% (tel-tel) and 5.1% EER (tel-mic, telephone data for training and microphone data for testing) in pessimistic cross-validation experiments. These results are achieved with an extremely light system in computational resources, running 77 times faster than real time.
This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diariza... more This paper describes the ATVS-UAM systems submitted to the Audio Segmentation and Speaker Diarization Albayzin 2010 Evaluation. The ATVS-UAM audio segmentation system is based on a 5-GMM-MMI-state HMM model. Testing utterances are aligned with the model by means of the Viterbi algorithm. Spurious changes in the state sequence were removed by mode-filtering step. Finally, too sort segments were removed. The ATVS-UAM speaker diarization system is a novelty approach based on the cosine distance clustering of the Total Variability speech factors -the so-called iVectorsperformed in two steps, followed by a Viterbi decodification of the probabilities based on the distances between the candidate speaker centroids and the iVectors stream.
Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 M... more Page 1. 1084 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009 ...
2013 International Conference on Biometrics (ICB), 2013