Roy Wallace - Academia.edu (original) (raw)
Papers by Roy Wallace
Bob is a free signal processing and machine learning toolbox originally developed by the Biometri... more Bob is a free signal processing and machine learning toolbox originally developed by the Biometrics group at Idiap Research Institute, Switzerland. The toolbox is designed to meet the needs of researchers by reducing development time and efficiently processing data. Firstly, Bob provides a researcher-friendly Python environment for rapid development. Secondly, efficient processing of large amounts of multimedia data is provided by fast C ++ implementations of identified bottlenecks. The Python environment is integrated seamlessly with the C ++ library, which ensures the library is easy to use and extensible. Thirdly, Bob supports reproducible research through its integrated experimental protocols for several databases. Finally, a strong emphasis is placed on code clarity, documentation, and thorough unit testing. Bob is thus an attractive resource for researchers due to this unique combination of ease of use, efficiency, extensibility and transparency. Bob is an open-source library and an ongoing community effort.
Bob is a free signal processing and machine learning toolbox originally developed by the Biometri... more Bob is a free signal processing and machine learning toolbox originally developed by the Biometrics group at Idiap Research Institute, Switzerland. The toolbox is designed to meet the needs of researchers by reducing development time and efficiently processing data. Firstly, Bob provides a researcher-friendly Python environment for rapid development. Secondly, efficient processing of large amounts of multimedia data is provided by fast C ++ implementations of identified bottlenecks. The Python environment is integrated seamlessly with the C ++ library, which ensures the library is easy to use and extensible. Thirdly, Bob supports reproducible research through its integrated experimental protocols for several databases. Finally, a strong emphasis is placed on code clarity, documentation, and thorough unit testing. Bob is thus an attractive resource for researchers due to this unique combination of ease of use, efficiency, extensibility and transparency. Bob is an open-source library and an ongoing community effort.
2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
The use of the PC and Internet for placing telephone calls will present new opportunities to capt... more The use of the PC and Internet for placing telephone calls will present new opportunities to capture vast amounts of un-transcribed speech for a particular speaker. This paper investigates how to best exploit this data for speaker-dependent speech recognition. Super- vised and unsupervised experiments in acoustic model and language model adaptation are presented. Using one hour of automatically transcribed speech per speaker with a word error rate of 36.0%, unsupervised adaptation resulted in an absolute gain of 6.3%, equiv- alent to 70% of the gain from the supervised case, with additional adaptation data likely to yield further improvements. LM adapta- tion experiments suggested that although there seems to be a small degree of speaker idiolect, adaptation to the speaker alone, without considering the topic of the conversation, is in itself unlikely to improve transcription accuracy.
IET Biometrics, 2012
ABSTRACT This study presents the first detailed study of total variability modelling (TVM) for fa... more ABSTRACT This study presents the first detailed study of total variability modelling (TVM) for face verification. TVM was originally proposed for speaker verification, where it has been accepted as state-of-the-art technology. Also referred to as front-end factor analysis, TVM uses a probabilistic model to represent a speech recording as a low-dimensional vector known as an `i-vector'. This representation has been successfully applied to a wide variety of speech-related pattern recognition applications, and remains a hot topic in biometrics. In this work, the authors extend the application of i-vectors beyond the domain of speech to a novel representation of facial images for the purpose of face verification. Extensive experimentation on several challenging and publicly available face recognition databases demonstrates that TVM generalises well to this modality, providing between 17 and 39% relative reduction in verification error rate compared to a baseline Gaussian mixture model system. Several i-vector session compensation and scoring techniques were evaluated including source-normalised linear discriminant analysis (SN-LDA), probabilistic LDA and within-class covariance normalisation. Finally, this study provides a detailed comparison of the complexity of TVM, highlighting some important computational advantages with respect to related state-of-the-art techniques.
Proceedings of the third workshop on Searching spontaneous conversational speech - SSCS '09, 2009
Spoken term detection (STD) popularly involves performing word or sub-word level speech recogniti... more Spoken term detection (STD) popularly involves performing word or sub-word level speech recognition and indexing the result. This work challenges the assumption that improved speech recognition accuracy implies better indexing for STD. Using an index derived from phone lattices, this paper examines the effect of language model selection on the relationship between phone recognition accuracy and STD accuracy. Results suggest
2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
While spoken term detection (STD) systems based on word indices provide good accuracy, there are ... more While spoken term detection (STD) systems based on word indices provide good accuracy, there are several practical applications where it is infeasible or too costly to employ an LVCSR engine. An STD system is presented, which is designed to incorporate a fast phonetic decoding front-end and be robust to decoding errors whilst still al- lowing for rapid search speeds. This
2011 International Joint Conference on Biometrics (IJCB), 2011
This paper applies inter-session variability modelling and joint factor analysis to face authenti... more This paper applies inter-session variability modelling and joint factor analysis to face authentication using Gaussian mixture models. These techniques, originally developed for speaker authentication, aim to explicitly model and remove detrimental within-client (inter-session) variation from client models. We apply the techniques to face authentication on the publicly-available BANCA, SCface and MO-BIO databases. We propose a face authentication protocol for the challenging SCface database, and provide the first results on the MOBIO still face protocol. The techniques provide relative reductions in error rate of up to 44%, using only limited training data. On the BANCA database, our results represent a 31% reduction in error rate when benchmarked against previous work. * The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7) under grant agreements 238803 (BBfor2) and 257289 (TABULA RASA).
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
This paper introduces a novel technique to directly optimise the Figure of Merit (FOM) for phonet... more This paper introduces a novel technique to directly optimise the Figure of Merit (FOM) for phonetic spoken term detection. The FOM is a popular measure of STD accuracy, making it an ideal candidate for use as an objective function. A simple linear model is introduced to transform the phone log-posterior probabilities output by a phone classifier to produce enhanced log-posterior
2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI), 2011
The design and evaluation of subword-based spoken term detection (STD) systems depends on various... more The design and evaluation of subword-based spoken term detection (STD) systems depends on various factors, such as language, type of the speech to be searched and application scenario. The choice of the subword unit and search approach, however, is oftentimes made regardless of these factors. Therefore, we evaluate two subword STD systems across two data sets with varying properties to
IET Biometrics, 2014
This paper presents an evaluation of the verification and calibration performance of a face recog... more This paper presents an evaluation of the verification and calibration performance of a face recognition system based on inter-session variability modeling. As an extension to calibration through linear transformation of scores, categorical calibration is introduced as a way to include additional information about images for calibration. The cost of likelihood ratio, which is a well-known measure in the speaker recognition field, is used as a calibration performance metric. Results on the challenging MOBIO and SCface databases indicate that linearly calibrated face recognition scores are less misleading in their likelihood ratio interpretation than uncalibrated scores. In addition, the categorical calibration experiments show that calibration can be used not only to improve the likelihood ratio interpretation of scores, but also to improve the verification performance of a face recognition system.
IET Biometrics, 2013
ABSTRACT This study examines session variability modelling for face authentication using Gaussian... more ABSTRACT This study examines session variability modelling for face authentication using Gaussian mixture models. Session variability modelling aims to explicitly model and suppress detrimental within-class (inter-session) variation. The authors examine two techniques to do this, inter-session variability modelling (ISV) and joint factor analysis (JFA), which were initially developed for speaker authentication. We present a self-contained description of these two techniques and demonstrate that they can be successfully applied to face authentication. In particular, they show that using ISV leads to significant error rate reductions of, on average, 26% on the challenging and publicly available databases SCface, BANCA, MOBIO and multi-PIE. Finally, the authors show that a limitation of both ISV and JFA for face authentication is that the session variability model captures and suppresses a significant portion of between-class variation.
IEEE Transactions on Information Forensics and Security, 2000
This paper applies score and feature normalisation techniques to parts-based Gaussian mixture mod... more This paper applies score and feature normalisation techniques to parts-based Gaussian mixture model (GMM) face authentication. In particular, we propose to utilise techniques that are well established in state-of-the-art speaker authentication, and apply them to the face authentication task. For score normalisation, T-, Z-and ZT-norm techniques are evaluated. For feature normalisation, we propose a generalisation of feature warping to 2D images, which is applied to discrete cosine transform (DCT) features prior to modelling. Evaluation is performed on a range of challenging databases relevant to forensics and security, including surveillance and access control scenarios. The normalisation techniques are shown to generalise well to the face authentication task, resulting in relative improvements in half total error rate (HTER) of between 17% and 62%.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
This paper proposes to improve spoken term detec- tion (STD) accuracy by optimizing the figure of... more This paper proposes to improve spoken term detec- tion (STD) accuracy by optimizing the figure of merit (FOM). In this paper, the index takes the form of a phonetic posterior-feature matrix. Accuracy is improved by formulating STD as a discrimi- native training problem and directly optimizing the FOM, through its use as an objective function to train a transformation of
This paper details the submission from the Speech and Audio Research Lab of Queensland University... more This paper details the submission from the Speech and Audio Research Lab of Queensland University of Technology (QUT) to the inaugural 2006 NIST Spoken Term Detection Evaluation. The task involved accurately locating the occurrences of a specified list of English terms in a given corpus of broadcast news and conversational telephone speech. The QUT system uses phonetic decoding and Dynamic Match Lattice Spotting to rapidly locate search terms, combined with a neural networkbased verification stage. The use of phonetic search means the system is open vocabulary and performs usefully (Actual Term-Weighted Value of 0.23) whilst avoiding the cost of a large vocabulary speech recognition engine.
We present a state-of-the-art bi-modal authentication system for mobile environments, using sessi... more We present a state-of-the-art bi-modal authentication system for mobile environments, using session variability modelling. We examine inter-session variability modelling (ISV) and joint factor analysis (JFA) for both face and speaker authentication and evaluate our system on the largest bi-modal mobile authentication database available, the MOBIO database, with over 61 hours of audio-visual data captured by 150 people in uncontrolled environments on a mobile phone. Our system achieves 2.6% and 9.7% half total error rate for male and female trials respectively -relative improvements of 78% and 27% compared to previous results.
Lecture Notes in Computer Science, 2012
In this paper we introduce the facereclib, the first software library that allows to compare a va... more In this paper we introduce the facereclib, the first software library that allows to compare a variety of face recognition algorithms on most of the known facial image databases and that permits rapid prototyping of novel ideas and testing of meta-parameters of face recognition algorithms. The facereclib is built on the open source signal processing and machine learning library Bob. It uses well-specified face recognition protocols to ensure that results are comparable and reproducible. We show that the face recognition algorithms implemented in Bob as well as third party face recognition libraries can be used to run face recognition experiments within the framework of the facereclib. As a proof of concept, we execute four different state-of-the-art face recognition algorithms: local Gabor binary pattern histogram sequences (LGBPHS), Gabor graph comparisons with a Gabor phase based similarity measure, inter-session variability modeling (ISV) of DCT block features, and the linear discriminant analysis on two different color channels (LDA-IR) on two different databases: The Good, The Bad, & The Ugly, and the BANCA database, in all cases using their fixed protocols. The results show that there is not one face recognition algorithm that outperforms all others, but rather that the results are strongly dependent on the employed database.
Bob is a free signal processing and machine learning toolbox originally developed by the Biometri... more Bob is a free signal processing and machine learning toolbox originally developed by the Biometrics group at Idiap Research Institute, Switzerland. The toolbox is designed to meet the needs of researchers by reducing development time and efficiently processing data. Firstly, Bob provides a researcher-friendly Python environment for rapid development. Secondly, efficient processing of large amounts of multimedia data is provided by fast C ++ implementations of identified bottlenecks. The Python environment is integrated seamlessly with the C ++ library, which ensures the library is easy to use and extensible. Thirdly, Bob supports reproducible research through its integrated experimental protocols for several databases. Finally, a strong emphasis is placed on code clarity, documentation, and thorough unit testing. Bob is thus an attractive resource for researchers due to this unique combination of ease of use, efficiency, extensibility and transparency. Bob is an open-source library and an ongoing community effort.
Bob is a free signal processing and machine learning toolbox originally developed by the Biometri... more Bob is a free signal processing and machine learning toolbox originally developed by the Biometrics group at Idiap Research Institute, Switzerland. The toolbox is designed to meet the needs of researchers by reducing development time and efficiently processing data. Firstly, Bob provides a researcher-friendly Python environment for rapid development. Secondly, efficient processing of large amounts of multimedia data is provided by fast C ++ implementations of identified bottlenecks. The Python environment is integrated seamlessly with the C ++ library, which ensures the library is easy to use and extensible. Thirdly, Bob supports reproducible research through its integrated experimental protocols for several databases. Finally, a strong emphasis is placed on code clarity, documentation, and thorough unit testing. Bob is thus an attractive resource for researchers due to this unique combination of ease of use, efficiency, extensibility and transparency. Bob is an open-source library and an ongoing community effort.
2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
The use of the PC and Internet for placing telephone calls will present new opportunities to capt... more The use of the PC and Internet for placing telephone calls will present new opportunities to capture vast amounts of un-transcribed speech for a particular speaker. This paper investigates how to best exploit this data for speaker-dependent speech recognition. Super- vised and unsupervised experiments in acoustic model and language model adaptation are presented. Using one hour of automatically transcribed speech per speaker with a word error rate of 36.0%, unsupervised adaptation resulted in an absolute gain of 6.3%, equiv- alent to 70% of the gain from the supervised case, with additional adaptation data likely to yield further improvements. LM adapta- tion experiments suggested that although there seems to be a small degree of speaker idiolect, adaptation to the speaker alone, without considering the topic of the conversation, is in itself unlikely to improve transcription accuracy.
IET Biometrics, 2012
ABSTRACT This study presents the first detailed study of total variability modelling (TVM) for fa... more ABSTRACT This study presents the first detailed study of total variability modelling (TVM) for face verification. TVM was originally proposed for speaker verification, where it has been accepted as state-of-the-art technology. Also referred to as front-end factor analysis, TVM uses a probabilistic model to represent a speech recording as a low-dimensional vector known as an `i-vector'. This representation has been successfully applied to a wide variety of speech-related pattern recognition applications, and remains a hot topic in biometrics. In this work, the authors extend the application of i-vectors beyond the domain of speech to a novel representation of facial images for the purpose of face verification. Extensive experimentation on several challenging and publicly available face recognition databases demonstrates that TVM generalises well to this modality, providing between 17 and 39% relative reduction in verification error rate compared to a baseline Gaussian mixture model system. Several i-vector session compensation and scoring techniques were evaluated including source-normalised linear discriminant analysis (SN-LDA), probabilistic LDA and within-class covariance normalisation. Finally, this study provides a detailed comparison of the complexity of TVM, highlighting some important computational advantages with respect to related state-of-the-art techniques.
Proceedings of the third workshop on Searching spontaneous conversational speech - SSCS '09, 2009
Spoken term detection (STD) popularly involves performing word or sub-word level speech recogniti... more Spoken term detection (STD) popularly involves performing word or sub-word level speech recognition and indexing the result. This work challenges the assumption that improved speech recognition accuracy implies better indexing for STD. Using an index derived from phone lattices, this paper examines the effect of language model selection on the relationship between phone recognition accuracy and STD accuracy. Results suggest
2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009
While spoken term detection (STD) systems based on word indices provide good accuracy, there are ... more While spoken term detection (STD) systems based on word indices provide good accuracy, there are several practical applications where it is infeasible or too costly to employ an LVCSR engine. An STD system is presented, which is designed to incorporate a fast phonetic decoding front-end and be robust to decoding errors whilst still al- lowing for rapid search speeds. This
2011 International Joint Conference on Biometrics (IJCB), 2011
This paper applies inter-session variability modelling and joint factor analysis to face authenti... more This paper applies inter-session variability modelling and joint factor analysis to face authentication using Gaussian mixture models. These techniques, originally developed for speaker authentication, aim to explicitly model and remove detrimental within-client (inter-session) variation from client models. We apply the techniques to face authentication on the publicly-available BANCA, SCface and MO-BIO databases. We propose a face authentication protocol for the challenging SCface database, and provide the first results on the MOBIO still face protocol. The techniques provide relative reductions in error rate of up to 44%, using only limited training data. On the BANCA database, our results represent a 31% reduction in error rate when benchmarked against previous work. * The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7) under grant agreements 238803 (BBfor2) and 257289 (TABULA RASA).
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
This paper introduces a novel technique to directly optimise the Figure of Merit (FOM) for phonet... more This paper introduces a novel technique to directly optimise the Figure of Merit (FOM) for phonetic spoken term detection. The FOM is a popular measure of STD accuracy, making it an ideal candidate for use as an objective function. A simple linear model is introduced to transform the phone log-posterior probabilities output by a phone classifier to produce enhanced log-posterior
2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI), 2011
The design and evaluation of subword-based spoken term detection (STD) systems depends on various... more The design and evaluation of subword-based spoken term detection (STD) systems depends on various factors, such as language, type of the speech to be searched and application scenario. The choice of the subword unit and search approach, however, is oftentimes made regardless of these factors. Therefore, we evaluate two subword STD systems across two data sets with varying properties to
IET Biometrics, 2014
This paper presents an evaluation of the verification and calibration performance of a face recog... more This paper presents an evaluation of the verification and calibration performance of a face recognition system based on inter-session variability modeling. As an extension to calibration through linear transformation of scores, categorical calibration is introduced as a way to include additional information about images for calibration. The cost of likelihood ratio, which is a well-known measure in the speaker recognition field, is used as a calibration performance metric. Results on the challenging MOBIO and SCface databases indicate that linearly calibrated face recognition scores are less misleading in their likelihood ratio interpretation than uncalibrated scores. In addition, the categorical calibration experiments show that calibration can be used not only to improve the likelihood ratio interpretation of scores, but also to improve the verification performance of a face recognition system.
IET Biometrics, 2013
ABSTRACT This study examines session variability modelling for face authentication using Gaussian... more ABSTRACT This study examines session variability modelling for face authentication using Gaussian mixture models. Session variability modelling aims to explicitly model and suppress detrimental within-class (inter-session) variation. The authors examine two techniques to do this, inter-session variability modelling (ISV) and joint factor analysis (JFA), which were initially developed for speaker authentication. We present a self-contained description of these two techniques and demonstrate that they can be successfully applied to face authentication. In particular, they show that using ISV leads to significant error rate reductions of, on average, 26% on the challenging and publicly available databases SCface, BANCA, MOBIO and multi-PIE. Finally, the authors show that a limitation of both ISV and JFA for face authentication is that the session variability model captures and suppresses a significant portion of between-class variation.
IEEE Transactions on Information Forensics and Security, 2000
This paper applies score and feature normalisation techniques to parts-based Gaussian mixture mod... more This paper applies score and feature normalisation techniques to parts-based Gaussian mixture model (GMM) face authentication. In particular, we propose to utilise techniques that are well established in state-of-the-art speaker authentication, and apply them to the face authentication task. For score normalisation, T-, Z-and ZT-norm techniques are evaluated. For feature normalisation, we propose a generalisation of feature warping to 2D images, which is applied to discrete cosine transform (DCT) features prior to modelling. Evaluation is performed on a range of challenging databases relevant to forensics and security, including surveillance and access control scenarios. The normalisation techniques are shown to generalise well to the face authentication task, resulting in relative improvements in half total error rate (HTER) of between 17% and 62%.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
This paper proposes to improve spoken term detec- tion (STD) accuracy by optimizing the figure of... more This paper proposes to improve spoken term detec- tion (STD) accuracy by optimizing the figure of merit (FOM). In this paper, the index takes the form of a phonetic posterior-feature matrix. Accuracy is improved by formulating STD as a discrimi- native training problem and directly optimizing the FOM, through its use as an objective function to train a transformation of
This paper details the submission from the Speech and Audio Research Lab of Queensland University... more This paper details the submission from the Speech and Audio Research Lab of Queensland University of Technology (QUT) to the inaugural 2006 NIST Spoken Term Detection Evaluation. The task involved accurately locating the occurrences of a specified list of English terms in a given corpus of broadcast news and conversational telephone speech. The QUT system uses phonetic decoding and Dynamic Match Lattice Spotting to rapidly locate search terms, combined with a neural networkbased verification stage. The use of phonetic search means the system is open vocabulary and performs usefully (Actual Term-Weighted Value of 0.23) whilst avoiding the cost of a large vocabulary speech recognition engine.
We present a state-of-the-art bi-modal authentication system for mobile environments, using sessi... more We present a state-of-the-art bi-modal authentication system for mobile environments, using session variability modelling. We examine inter-session variability modelling (ISV) and joint factor analysis (JFA) for both face and speaker authentication and evaluate our system on the largest bi-modal mobile authentication database available, the MOBIO database, with over 61 hours of audio-visual data captured by 150 people in uncontrolled environments on a mobile phone. Our system achieves 2.6% and 9.7% half total error rate for male and female trials respectively -relative improvements of 78% and 27% compared to previous results.
Lecture Notes in Computer Science, 2012
In this paper we introduce the facereclib, the first software library that allows to compare a va... more In this paper we introduce the facereclib, the first software library that allows to compare a variety of face recognition algorithms on most of the known facial image databases and that permits rapid prototyping of novel ideas and testing of meta-parameters of face recognition algorithms. The facereclib is built on the open source signal processing and machine learning library Bob. It uses well-specified face recognition protocols to ensure that results are comparable and reproducible. We show that the face recognition algorithms implemented in Bob as well as third party face recognition libraries can be used to run face recognition experiments within the framework of the facereclib. As a proof of concept, we execute four different state-of-the-art face recognition algorithms: local Gabor binary pattern histogram sequences (LGBPHS), Gabor graph comparisons with a Gabor phase based similarity measure, inter-session variability modeling (ISV) of DCT block features, and the linear discriminant analysis on two different color channels (LDA-IR) on two different databases: The Good, The Bad, & The Ugly, and the BANCA database, in all cases using their fixed protocols. The results show that there is not one face recognition algorithm that outperforms all others, but rather that the results are strongly dependent on the employed database.