driss matrouf - Academia.edu (original) (raw)
Papers by driss matrouf
Cornell University - arXiv, May 12, 2016
In this paper, we propose a speaker-verification system based on maximum likelihood linear regres... more In this paper, we propose a speaker-verification system based on maximum likelihood linear regression (MLLR) super-vectors, for which speakers are characterized by m-vectors. These vectors are obtained by a uniform segmentation of the speaker MLLR super-vector using an overlapped sliding window. We consider three approaches for MLLR transformation, based on the conventional 1-best automatic transcription, on the lattice word transcription, or on a simple global universal background model (UBM). Session variability compensation is performed in a post-processing module with probabilistic linear discriminant analysis (PLDA) or the eigen factor radial (EFR). Alternatively, we propose a cascade post-processing for the MLLR super-vector based speaker-verification system. In this case, the m-vectors or MLLR super-vectors are first projected onto a lower-dimensional vector space generated by linear discriminant analysis (LDA). Next, PLDA session variability compensation and scoring is applied to the reduced-dimensional vectors. This approach combines the advantages of both techniques and makes the estimation of PLDA parameters easier. Experimental results on telephone conversations of the NIST 2008 and 2010 speaker recognition evaluation (SRE) indicate that the proposed m-vector system performs significantly better than the conventional system based on the full MLLR super-vectors. Cascade post-processing further reduces the error rate in all cases. Finally, we present the results of fusion with a standard i-vector system in the feature, as well as in the score domain, demonstrating that the m-vector system is both competitive and complementary with it.
2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2012
In this paper, we propose a speaker verification system called m-vector system, where speakers ar... more In this paper, we propose a speaker verification system called m-vector system, where speakers are represented by uniform segmentation of their Maximum Likelihood Linear Regression (MLLR) super-vectors, denoted m-vectors. The MLLR super-vectors are extracted with respect to Universal Background Model (UBM) with MLLR adaptation using the speakers data. Two criterion are followed to segment the MLLR super-vector: one is disjoint segmentation technique and other one is overlapped windows. Afterward, m-vectors are conditioned by our recently proposed [1] session variability compensation algorithm before calculating score during test phase. However, the proposed method is not based on any total variability space concept and uses simple MLLR transformation for extracting m-vector without considering any transcription of the speech segment. The proposed system shows promising performance compared to the conventional i-vector system. This indicates that session variability compensation play...
ArXiv, 2019
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric... more Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as ''presentation attacks.'' These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing att...
Interspeech 2009, 2009
Video genre classification is a challenging task in a global context of fast growing video collec... more Video genre classification is a challenging task in a global context of fast growing video collections available on the Internet. This paper presents a new method for video genre identification by audio analysis. Our approach relies on the combination of low and high level audio features. We investigate the discriminative capacity of features related to acoustic instability, speaker interactivity, speech quality and acoustic space characterization. The genre identification is performed on these features by using a SVM classifier. Experiments are conducted on a corpus composed from cartoons, movies, news, commercials and musics on which we obtain an identification rate of 91%.
Interspeech 2007, 2007
Spoken document retrieval (SDR) systems must be vocabulary-free in order to deal with arbitrary q... more Spoken document retrieval (SDR) systems must be vocabulary-free in order to deal with arbitrary query words because a user often searches the section where a query word is spoken, and query words are liable to be special terms that are not included in a speech recognizer's dictionary. We have previously proposed new subword models, such as the 1/2 phone model, the 1/3 phone model, and the sub-phonetic segment (SPS) model, and have confirmed the effectiveness of these models for SDR [1]. These models are more sophisticated on the time axis than phoneme models such as the triphone model. The present paper proposes an integration method of plural retrieval results that are obtained from each subword model and demonstrates the performance improvement through experiments using an actual presentation speech corpus.
The Speaker and Language Recognition Workshop (Odyssey 2020), 2020
Using deep learning methods has led to significant improvement in speaker recognition systems. In... more Using deep learning methods has led to significant improvement in speaker recognition systems. Introducing xvectors as a speaker modeling method has made these systems more robust. Since, in challenging environments with noise and reverberation, the performance of x-vectors systems degrades significantly, the demand for denoising techniques remains as before. In this paper, for the first time, we try to denoise the xvectors speaker embedding. Our focus is on additive noise. Firstly, we use the i-MAP method which considers that both noise and clean x-vectors have a Gaussian distribution. Then, leveraging denoising autoencoders (DAE) we try to reconstruct the clean x-vector from the corrupted version. After that, we propose two hybrid systems composed of statistical i-MAP and DAE. Finally, we propose a novel DAE architecture, named Deep Stacked DAE, composed of several DAEs where each DAE receives as input the output of its predecessor DAE concatenated with the difference between noisy x-vectors and its predecessor's output. The experiments on Fabiol corpus show that the results given by the hybrid DAE i-MAP method in several cases outperforms the conventional DAE and i-MAP methods. Also, the results for Deep Stacked DAE in most cases is better than the other proposed methods. For utterances longer than 12 seconds we achieved a 51% improvement in terms of EER with Deep Stacked DAE, and for utterances shorter than 2 seconds, Deep Stacked DAE gives 18% improvements compared to the baseline system.
Computer Speech & Language, 2020
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric... more Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof challenge initiative was created to foster research on anti-spoofing and to provide common platforms for the assessment and comparison of spoofing countermeasures. The first edition, ASVspoof 2015, focused upon the study of countermeasures for detecting of text-to-speech synthesis (TTS) and voice conversion (VC) attacks. The second edition, ASVspoof 2017, focused instead upon replay spoofing attacks and countermeasures. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona fide utterances even by human subjects. It is expected that the ASVspoof 2019 database, with its varied coverage of different types of spoofing data, could further foster research on anti-spoofing.
Computer Speech & Language, 2019
Speech recordings are a rich source of personal, sensitive data that can be used to support a ple... more Speech recordings are a rich source of personal, sensitive data that can be used to support a plethora of diverse applications, from health profiling to biometric recognition. It is therefore essential that speech recordings are adequately protected so that they cannot be misused. Such protection, in the form of privacy-preserving technologies, is required to ensure that: (i) the biometric profiles of a given individual (e.g., across different biometric service operators) are unlinkable; (ii) leaked, encrypted biometric information is irreversible, and that (iii) biometric references are renewable. Whereas many privacy-preserving technologies have been developed for other biometric characteristics, very few solutions have been proposed to protect privacy in the case of speech signals. Despite privacy preservation this is now being mandated by recent European and international data protection regulations. With the aim of fostering progress and collaboration between researchers in the speech, biometrics and applied cryptography communities, this survey article provides an introduction to the field, starting with a legal perspective on privacy preservation in the case of speech data. It then establishes the requirements for effective privacy preservation, reviews generic cryptography-based solutions, followed by specific techniques that are applicable to speaker characterisation (biometric applications) and speech characterisation (non-biometric applications). Glancing at non-biometrics, methods are presented to avoid function creep, preventing the exploitation of biometric information, e.g., to single out an identity in speech-assisted health care via I Recent advances in speaker and language recognition and characterisation.
Computer Speech & Language, 2017
Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniqu... more Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniques have been proposed to deal with additive noise within this framework. Due to the complexity of its effect in the i-vector space, a lot of effort has been put into dealing with noise in other domains (speech enhancement, feature compensation, robust i-vector extraction and robust scoring). As far as we know, there was no serious attempt to handle the noise problem directly in the i-vector space without relying on data distributions computed on a prior domain. The aim of this paper is twofold. First, it proposes a fullcovariance Gaussian modeling of the clean i-vectors and noise distribution in the i-vector space and introduces a technique to estimate a clean i-vector given the noisy version and the noise density function using the MAP approach. Based on NIST data, we show that it is possible to improve by up to 60% the baseline system performance. Second, in order to make this algorithm usable in a real application and reduce the computational time needed by i-MAP, we propose an extension that requires building a noise distribution database in the i-vector space in an off-line step and using it later in the test phase. We show that it is possible to achieve comparable results using this approach (up to 57% of relative EER improvement) with a sufficiently large noise distribution database.
Encyclopedia of Biometrics, 2009
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Speech recognition applications are known to require a significant amount of resources (memory, c... more Speech recognition applications are known to require a significant amount of resources (memory, computing power). However, embedded speech recognition systems, such as in mobile phones, only authorizes few KB of memory and few MIPS. In the context of HMM-based speech recognizers, each HMMstate distribution is modeled independently from to the other and has a large amount of parameters. In spite of using statetying techniques, the size of the acoustic models stays large and certain redundancy remains between states. In this paper, we investigate the capacity of the Subspace Gaussian Mixture approach to reduce the acoustic models size while keeping good performances. We introduce a simplification concerning state specific Gaussians weights estimation, which is a very complex and time consuming procedure in the original approach. With this approach, we show that the acoustic model size can be reduced by 92% with almost the same performance as the standard acoustic modeling.
Lecture Notes in Computer Science
The LIA developed a speech recognition toolkit providing most of the components required by speec... more The LIA developed a speech recognition toolkit providing most of the components required by speech-to-text systems. This toolbox allowed to build a Broadcast News (BN) transcription system was involved in the ESTER evaluation campaign ([3]), on unconstrained transcription and real-time transcription tasks. In this paper, we describe the techniques we used to reach the real-time, starting from our baseline 10xRT system. We focus on some aspects of the A* search algorithm which are critical for both efficiency and accuracy. Then, we evaluate the impact of the different system components (lexicon, language models and acoustic models) to the trade-off between efficiency and accuracy. Experiments are carried out in framework of the ESTER evaluation campaign. Our results show that the real time system reaches performance on about 5.6% absolute WER whorses than the standard 10xRT system, with an absolute WER (Word Error Rate) of about 26.8%.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
A method is described for predicting acoustic feature variability by analyzing the consensus and ... more A method is described for predicting acoustic feature variability by analyzing the consensus and relative entropy of phoneme posterior probability distributions obtained with different acoustic models having the same type of observations. Variability prediction is used for diagnosis of automatic speech recognition (ASR) systems. When errors are likely to occur, different feature sets are considered for correcting recognition results. Experimental results are provided on the CH1 Italian portion of AURORA3.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
State-of-the-art speaker recognition systems performance degrades considerably in noisy environme... more State-of-the-art speaker recognition systems performance degrades considerably in noisy environments even though they achieve very good results in clean conditions. In order to deal with this strong limitation, we aim in this work to remove the noisy part of an i-vector directly in the i-vector space. Our approach offers the advantage to operate only at the i-vector extraction level, letting the other steps of the system unchanged. A maximum a posteriori (MAP) procedure is applied in order to obtain clean version of the noisy i-vectors taking advantage of prior knowledge about clean i-vectors distribution. To perform this MAP estimation, Gaussian assumptions over clean and noise i-vectors distributions are made. Operating on NIST 2008 data, we show a relative improvement up to 60% compared with baseline system. Our approach also outperforms the "multi-style" backend training technique. The efficiency of the proposed method is obtained at the price of relative high computational cost. We present at the end some ideas to improve this aspect.
Encyclopedia of Biometrics, 2015
All biometric recognition systems are based on similarity metrics that enable decisions of "same"... more All biometric recognition systems are based on similarity metrics that enable decisions of "same" or "different" to be made. Such metrics require normalizations in order to make them commensurable across comparison cases that may differ greatly in the quantity of data available or in the quality of the data. Is a "perfect match" based only on a small amount of data better or worse than a less perfect match based on more data? Another need for score normalization arises when interpreting the best match found after an exhaustive search, in terms of the size of the database searched. The likelihood of a good match arising just by chance between unrelated templates must increase with the size of the search database, simply because there are more opportunities. How should a given "best match" score be interpreted? Addressing these questions on a principled basis requires models of the underlying probability distributions that describe the likelihood of a given degree of similarity arising by chance from unrelated sources. Likewise, if comparisons are required over an increasing range of image orientations because of uncertainty about image tilt, the probability of a good similarity score arising just by chance from unrelated templates again grows automatically, because there are more opportunities. In all these respects, biometric similarity score normalization is needed, and it plays a critical role in the avoidance of false matches in the publicly deployed algorithms for iris recognition.
2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008
With the purpose of improving Spoken Language Understanding (SLU) performance, a combination of d... more With the purpose of improving Spoken Language Understanding (SLU) performance, a combination of different acoustic speech recognition (ASR) systems is proposed. State a-posteriori probabilities obtained with systems using different acoustic feature sets are combined with log-linear interpolation. In order to perform a coherent combination of these probabilities, acoustic models must have the same topology (i.e. same set of states). For this purpose, a fast and efficient twin model training protocol is proposed. By a wise choice of acoustic feature sets and log-linear interpolation of their likelihood ratios, a substantial Concept Error Rate (CER) reduction has been observed on the test part of the French MEDIA corpus.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GM... more In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GMM model. This supervector space is used as a data representation space, which has a high dimensionality. Moreover, this space is not intrinsically discriminant and a complete speech segment is represented by only one vector, withdrawing mainly the possibility to take into account temporal or sequential information. This work proposes a new approach where each acoustic frame is represented in a discriminant binary space. The proposed approach relies on a UBM to structure the acoustic space in regions. Each region is then populated with a set of Gaussian models, denoted as "specificities", able to emphasize speaker specific information. Each acoustic frame is mapped in the discriminant binary space, turning "on" or "off" all the specificities to create a large binary vector. All the following steps, speaker reference extraction, likelihood estimation or decision take place in this binary space. Even if this work is a first step in this avenue, the experiments based on NIST SRE 2008 framework demonstrate the potential of the proposed approach. Moreover, this approach opens the opportunity to rethink all the classical processes using a discrete, binary view.
Encyclopedia of Biometrics, 2009
The intrinsic characteristic of a biometric signal may be used to determine its suitability for f... more The intrinsic characteristic of a biometric signal may be used to determine its suitability for further processing by the biometric system or to assess its conformance to preestablished standards. The quality of a biometric signal is a numerical value (or a vector) that measures this intrinsic attribute (See also ▶ Biometric Sample Quality).
Cornell University - arXiv, May 12, 2016
In this paper, we propose a speaker-verification system based on maximum likelihood linear regres... more In this paper, we propose a speaker-verification system based on maximum likelihood linear regression (MLLR) super-vectors, for which speakers are characterized by m-vectors. These vectors are obtained by a uniform segmentation of the speaker MLLR super-vector using an overlapped sliding window. We consider three approaches for MLLR transformation, based on the conventional 1-best automatic transcription, on the lattice word transcription, or on a simple global universal background model (UBM). Session variability compensation is performed in a post-processing module with probabilistic linear discriminant analysis (PLDA) or the eigen factor radial (EFR). Alternatively, we propose a cascade post-processing for the MLLR super-vector based speaker-verification system. In this case, the m-vectors or MLLR super-vectors are first projected onto a lower-dimensional vector space generated by linear discriminant analysis (LDA). Next, PLDA session variability compensation and scoring is applied to the reduced-dimensional vectors. This approach combines the advantages of both techniques and makes the estimation of PLDA parameters easier. Experimental results on telephone conversations of the NIST 2008 and 2010 speaker recognition evaluation (SRE) indicate that the proposed m-vector system performs significantly better than the conventional system based on the full MLLR super-vectors. Cascade post-processing further reduces the error rate in all cases. Finally, we present the results of fusion with a standard i-vector system in the feature, as well as in the score domain, demonstrating that the m-vector system is both competitive and complementary with it.
2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), 2012
In this paper, we propose a speaker verification system called m-vector system, where speakers ar... more In this paper, we propose a speaker verification system called m-vector system, where speakers are represented by uniform segmentation of their Maximum Likelihood Linear Regression (MLLR) super-vectors, denoted m-vectors. The MLLR super-vectors are extracted with respect to Universal Background Model (UBM) with MLLR adaptation using the speakers data. Two criterion are followed to segment the MLLR super-vector: one is disjoint segmentation technique and other one is overlapped windows. Afterward, m-vectors are conditioned by our recently proposed [1] session variability compensation algorithm before calculating score during test phase. However, the proposed method is not based on any total variability space concept and uses simple MLLR transformation for extracting m-vector without considering any transcription of the speech segment. The proposed system shows promising performance compared to the conventional i-vector system. This indicates that session variability compensation play...
ArXiv, 2019
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric... more Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as ''presentation attacks.'' These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing att...
Interspeech 2009, 2009
Video genre classification is a challenging task in a global context of fast growing video collec... more Video genre classification is a challenging task in a global context of fast growing video collections available on the Internet. This paper presents a new method for video genre identification by audio analysis. Our approach relies on the combination of low and high level audio features. We investigate the discriminative capacity of features related to acoustic instability, speaker interactivity, speech quality and acoustic space characterization. The genre identification is performed on these features by using a SVM classifier. Experiments are conducted on a corpus composed from cartoons, movies, news, commercials and musics on which we obtain an identification rate of 91%.
Interspeech 2007, 2007
Spoken document retrieval (SDR) systems must be vocabulary-free in order to deal with arbitrary q... more Spoken document retrieval (SDR) systems must be vocabulary-free in order to deal with arbitrary query words because a user often searches the section where a query word is spoken, and query words are liable to be special terms that are not included in a speech recognizer's dictionary. We have previously proposed new subword models, such as the 1/2 phone model, the 1/3 phone model, and the sub-phonetic segment (SPS) model, and have confirmed the effectiveness of these models for SDR [1]. These models are more sophisticated on the time axis than phoneme models such as the triphone model. The present paper proposes an integration method of plural retrieval results that are obtained from each subword model and demonstrates the performance improvement through experiments using an actual presentation speech corpus.
The Speaker and Language Recognition Workshop (Odyssey 2020), 2020
Using deep learning methods has led to significant improvement in speaker recognition systems. In... more Using deep learning methods has led to significant improvement in speaker recognition systems. Introducing xvectors as a speaker modeling method has made these systems more robust. Since, in challenging environments with noise and reverberation, the performance of x-vectors systems degrades significantly, the demand for denoising techniques remains as before. In this paper, for the first time, we try to denoise the xvectors speaker embedding. Our focus is on additive noise. Firstly, we use the i-MAP method which considers that both noise and clean x-vectors have a Gaussian distribution. Then, leveraging denoising autoencoders (DAE) we try to reconstruct the clean x-vector from the corrupted version. After that, we propose two hybrid systems composed of statistical i-MAP and DAE. Finally, we propose a novel DAE architecture, named Deep Stacked DAE, composed of several DAEs where each DAE receives as input the output of its predecessor DAE concatenated with the difference between noisy x-vectors and its predecessor's output. The experiments on Fabiol corpus show that the results given by the hybrid DAE i-MAP method in several cases outperforms the conventional DAE and i-MAP methods. Also, the results for Deep Stacked DAE in most cases is better than the other proposed methods. For utterances longer than 12 seconds we achieved a 51% improvement in terms of EER with Deep Stacked DAE, and for utterances shorter than 2 seconds, Deep Stacked DAE gives 18% improvements compared to the baseline system.
Computer Speech & Language, 2020
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric... more Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof challenge initiative was created to foster research on anti-spoofing and to provide common platforms for the assessment and comparison of spoofing countermeasures. The first edition, ASVspoof 2015, focused upon the study of countermeasures for detecting of text-to-speech synthesis (TTS) and voice conversion (VC) attacks. The second edition, ASVspoof 2017, focused instead upon replay spoofing attacks and countermeasures. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona fide utterances even by human subjects. It is expected that the ASVspoof 2019 database, with its varied coverage of different types of spoofing data, could further foster research on anti-spoofing.
Computer Speech & Language, 2019
Speech recordings are a rich source of personal, sensitive data that can be used to support a ple... more Speech recordings are a rich source of personal, sensitive data that can be used to support a plethora of diverse applications, from health profiling to biometric recognition. It is therefore essential that speech recordings are adequately protected so that they cannot be misused. Such protection, in the form of privacy-preserving technologies, is required to ensure that: (i) the biometric profiles of a given individual (e.g., across different biometric service operators) are unlinkable; (ii) leaked, encrypted biometric information is irreversible, and that (iii) biometric references are renewable. Whereas many privacy-preserving technologies have been developed for other biometric characteristics, very few solutions have been proposed to protect privacy in the case of speech signals. Despite privacy preservation this is now being mandated by recent European and international data protection regulations. With the aim of fostering progress and collaboration between researchers in the speech, biometrics and applied cryptography communities, this survey article provides an introduction to the field, starting with a legal perspective on privacy preservation in the case of speech data. It then establishes the requirements for effective privacy preservation, reviews generic cryptography-based solutions, followed by specific techniques that are applicable to speaker characterisation (biometric applications) and speech characterisation (non-biometric applications). Glancing at non-biometrics, methods are presented to avoid function creep, preventing the exploitation of biometric information, e.g., to single out an identity in speech-assisted health care via I Recent advances in speaker and language recognition and characterisation.
Computer Speech & Language, 2017
Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniqu... more Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniques have been proposed to deal with additive noise within this framework. Due to the complexity of its effect in the i-vector space, a lot of effort has been put into dealing with noise in other domains (speech enhancement, feature compensation, robust i-vector extraction and robust scoring). As far as we know, there was no serious attempt to handle the noise problem directly in the i-vector space without relying on data distributions computed on a prior domain. The aim of this paper is twofold. First, it proposes a fullcovariance Gaussian modeling of the clean i-vectors and noise distribution in the i-vector space and introduces a technique to estimate a clean i-vector given the noisy version and the noise density function using the MAP approach. Based on NIST data, we show that it is possible to improve by up to 60% the baseline system performance. Second, in order to make this algorithm usable in a real application and reduce the computational time needed by i-MAP, we propose an extension that requires building a noise distribution database in the i-vector space in an off-line step and using it later in the test phase. We show that it is possible to achieve comparable results using this approach (up to 57% of relative EER improvement) with a sufficiently large noise distribution database.
Encyclopedia of Biometrics, 2009
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
Speech recognition applications are known to require a significant amount of resources (memory, c... more Speech recognition applications are known to require a significant amount of resources (memory, computing power). However, embedded speech recognition systems, such as in mobile phones, only authorizes few KB of memory and few MIPS. In the context of HMM-based speech recognizers, each HMMstate distribution is modeled independently from to the other and has a large amount of parameters. In spite of using statetying techniques, the size of the acoustic models stays large and certain redundancy remains between states. In this paper, we investigate the capacity of the Subspace Gaussian Mixture approach to reduce the acoustic models size while keeping good performances. We introduce a simplification concerning state specific Gaussians weights estimation, which is a very complex and time consuming procedure in the original approach. With this approach, we show that the acoustic model size can be reduced by 92% with almost the same performance as the standard acoustic modeling.
Lecture Notes in Computer Science
The LIA developed a speech recognition toolkit providing most of the components required by speec... more The LIA developed a speech recognition toolkit providing most of the components required by speech-to-text systems. This toolbox allowed to build a Broadcast News (BN) transcription system was involved in the ESTER evaluation campaign ([3]), on unconstrained transcription and real-time transcription tasks. In this paper, we describe the techniques we used to reach the real-time, starting from our baseline 10xRT system. We focus on some aspects of the A* search algorithm which are critical for both efficiency and accuracy. Then, we evaluate the impact of the different system components (lexicon, language models and acoustic models) to the trade-off between efficiency and accuracy. Experiments are carried out in framework of the ESTER evaluation campaign. Our results show that the real time system reaches performance on about 5.6% absolute WER whorses than the standard 10xRT system, with an absolute WER (Word Error Rate) of about 26.8%.
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
A method is described for predicting acoustic feature variability by analyzing the consensus and ... more A method is described for predicting acoustic feature variability by analyzing the consensus and relative entropy of phoneme posterior probability distributions obtained with different acoustic models having the same type of observations. Variability prediction is used for diagnosis of automatic speech recognition (ASR) systems. When errors are likely to occur, different feature sets are considered for correcting recognition results. Experimental results are provided on the CH1 Italian portion of AURORA3.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
State-of-the-art speaker recognition systems performance degrades considerably in noisy environme... more State-of-the-art speaker recognition systems performance degrades considerably in noisy environments even though they achieve very good results in clean conditions. In order to deal with this strong limitation, we aim in this work to remove the noisy part of an i-vector directly in the i-vector space. Our approach offers the advantage to operate only at the i-vector extraction level, letting the other steps of the system unchanged. A maximum a posteriori (MAP) procedure is applied in order to obtain clean version of the noisy i-vectors taking advantage of prior knowledge about clean i-vectors distribution. To perform this MAP estimation, Gaussian assumptions over clean and noise i-vectors distributions are made. Operating on NIST 2008 data, we show a relative improvement up to 60% compared with baseline system. Our approach also outperforms the "multi-style" backend training technique. The efficiency of the proposed method is obtained at the price of relative high computational cost. We present at the end some ideas to improve this aspect.
Encyclopedia of Biometrics, 2015
All biometric recognition systems are based on similarity metrics that enable decisions of "same"... more All biometric recognition systems are based on similarity metrics that enable decisions of "same" or "different" to be made. Such metrics require normalizations in order to make them commensurable across comparison cases that may differ greatly in the quantity of data available or in the quality of the data. Is a "perfect match" based only on a small amount of data better or worse than a less perfect match based on more data? Another need for score normalization arises when interpreting the best match found after an exhaustive search, in terms of the size of the database searched. The likelihood of a good match arising just by chance between unrelated templates must increase with the size of the search database, simply because there are more opportunities. How should a given "best match" score be interpreted? Addressing these questions on a principled basis requires models of the underlying probability distributions that describe the likelihood of a given degree of similarity arising by chance from unrelated sources. Likewise, if comparisons are required over an increasing range of image orientations because of uncertainty about image tilt, the probability of a good similarity score arising just by chance from unrelated templates again grows automatically, because there are more opportunities. In all these respects, biometric similarity score normalization is needed, and it plays a critical role in the avoidance of false matches in the publicly deployed algorithms for iris recognition.
2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008
With the purpose of improving Spoken Language Understanding (SLU) performance, a combination of d... more With the purpose of improving Spoken Language Understanding (SLU) performance, a combination of different acoustic speech recognition (ASR) systems is proposed. State a-posteriori probabilities obtained with systems using different acoustic feature sets are combined with log-linear interpolation. In order to perform a coherent combination of these probabilities, acoustic models must have the same topology (i.e. same set of states). For this purpose, a fast and efficient twin model training protocol is proposed. By a wise choice of acoustic feature sets and log-linear interpolation of their likelihood ratios, a substantial Concept Error Rate (CER) reduction has been observed on the test part of the French MEDIA corpus.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GM... more In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GMM model. This supervector space is used as a data representation space, which has a high dimensionality. Moreover, this space is not intrinsically discriminant and a complete speech segment is represented by only one vector, withdrawing mainly the possibility to take into account temporal or sequential information. This work proposes a new approach where each acoustic frame is represented in a discriminant binary space. The proposed approach relies on a UBM to structure the acoustic space in regions. Each region is then populated with a set of Gaussian models, denoted as "specificities", able to emphasize speaker specific information. Each acoustic frame is mapped in the discriminant binary space, turning "on" or "off" all the specificities to create a large binary vector. All the following steps, speaker reference extraction, likelihood estimation or decision take place in this binary space. Even if this work is a first step in this avenue, the experiments based on NIST SRE 2008 framework demonstrate the potential of the proposed approach. Moreover, this approach opens the opportunity to rethink all the classical processes using a discrete, binary view.
Encyclopedia of Biometrics, 2009
The intrinsic characteristic of a biometric signal may be used to determine its suitability for f... more The intrinsic characteristic of a biometric signal may be used to determine its suitability for further processing by the biometric system or to assess its conformance to preestablished standards. The quality of a biometric signal is a numerical value (or a vector) that measures this intrinsic attribute (See also ▶ Biometric Sample Quality).