Gang Liu | University of Texas at Dallas (original) (raw)

Papers by Gang Liu

Research paper thumbnail of WEIGHTED TRAINING FOR SPEECH UNDER LOMBARD EFFECT FOR SPEAKER RECOGNITION

The presence of Lombard Effect in speech is proven to have severe effects on the performance of s... more The presence of Lombard Effect in speech is proven to have severe effects on the performance of speech systems, especially speaker recognition. Varying kinds of Lombard speech are produced by speakers under influence of varying noise types [1]. This study proposes a high-accuracy classifier using deep neural networks for detecting various kinds of Lom-bard speech against neutral speech, independent of the noise levels causing the Lombard Effect. Lombard Effect detection accuracies as high as 95.7% are achieved using this novel model. The deep neural network based classification is further exploited by validation based weighted training of robust i-Vector based speaker identification systems. The proposed weighted training achieves a relative EER improvement of 28.4% over an i-Vector baseline system, confirming the effectiveness of deep neural networks in modeling Lombard Effect.

Research paper thumbnail of Frequency Offset Correction in Single Sideband(SSB) Speech by Deep Neural Network for Speaker Verification

Communication system mismatch represents a major influence for loss in speaker recognition perfor... more Communication system mismatch represents a major influence for loss in speaker recognition performance. This paper considers a type of nonlinear communication system mismatch-mod-ulation/demodulation (Mod/DeMod) carrier drift in single side-band (SSB) speech signals. We focus on the problem of estimating frequency offset in SSB speech in order to improve speaker verification performance of the drifted speech. Based on a two-step framework from previous work, we propose using a multi-layered neural network architecture, stacked denoising autoencoder (SDA), to determine the unique interval of the offset value in the first step. Experimental results demonstrate that the SDA based system can produce up to a +16.1% relative improvement in frequency offset estimation accuracy. A speaker verification evaluation shows a +65.9% relative improvement in EER when SSB speech signal is compensated with the frequency offset value estimated by the proposed method.

Research paper thumbnail of I-vector Based Physical Task Stress Detection with Different Fusion Strategies

It is common for subjects to produce speech while performing a physical task where speech technol... more It is common for subjects to produce speech while performing a physical task where speech technology may be used. Variabil-ities are introduced to speech since physical task can influence human speech production. These variabilities degrade the performance of most speech systems. It is vital to detect speech under physical stress variabilities for subsequent algorithm pro-cesssing. This study presents a method for detecting physical task stress from speech. Inspired by the fact that i-vectors can generally model total factors from speech, a state-of-the-art i-vector framework is investigated with MFCCs and our previously formulated TEO-CB-Auto-Env features for neutral/physical task stress detection. Since MFCCs are derived from a linear speech production model and TEO-CB-Auto-Env features employ a nonlinear operator, these two features are believed to have complementary effects on physical task stress detection. Two alternative fusion strategies (feature-level and score-level fusion) are investigated to validate this hypothesis. Experiments over the UT-Scope Physical Corpus demonstrate that a relative accuracy gain of 2.68% is obtained when fusing different feature based i-vectors. An additional relative performance boost with of 6.52% in accuracy is achieved using score level fusion.

Research paper thumbnail of An i-Vector PLDA based Gender Identification Approach for Severely Distorted and Multilingual DARPA RATS Data Why female and male speech differ? Motivations for i-Vector based Gender ID approach

This study proposes an i-Vector based approach to gender identification. Gender-labeled utterances... more This study proposes an i-Vector based approach to gender
identification. Gender-labeled utterances from the Fisher English
(FE) corpus are used to formulate an i-Vector extraction
framework, and a Probabilistic Linear Discriminant Analysis
(PLDA) back-end is employed to compute the scores
for gender identification. A novel duration mismatch compensation
strategy is also presented that offers very little
degradation in identification accuracy even with a large reduction
in the duration of the test-segment. The proposed
method is shown to consistently outperform a GMM-UBM
based gender-identification scheme on several test-sets created
from a held-out portion of the FE corpus, and is able to
achieve an identification accuracy of up to 97.63%. On the
severely distorted and multilingual DARPA-RATS (Robust
Automatic Transcription of Speech) corpora, the proposed
approach achieves an identification accuracy of 76.48% using
only the FE data in training. Next, a novel unsupervised
domain adaptation strategy is also presented that utilizes only
unlabeled RATS data to adapt the out-of-domain PLDA parameters
derived from the FE training data. The strategy is
able to offer a 6.8% relative improvement in identification
accuracy, and a 14.75% relative reduction in Equal Error Rate
(EER) compared to using the out-of-domain PLDA model on
the RATS test-utterances. These improvements are significant
since: 1) RATS test-utterances are severely distorted, 2)
No labeled data of any kind is used for 4 of the 5 languages
present in the test-utterances.

Research paper thumbnail of Unsupervised accent classification for deep data fusion of accent and language information

Automatic Dialect Identification (DID) has recently gained substantial interest in the speech pro... more Automatic Dialect Identification (DID) has recently gained substantial interest in the speech processing community. Studies have shown that the variation in speech due to dialect is a factor which significantly impacts speech system performance. Dialects differ in various ways such as acoustic traits (phonetic realization of vowels and consonants, rhythmical characteristics, prosody) and content based word selection (grammar, vocabulary, phonetic distribution, lexical distribution, semantics). The traditional DID classifier is usually based on Gaussian Mixture Modeling (GMM), which is employed as baseline system. We investigate various methods of improving the DID based on acoustic and text language subsystems to further boost the performance. For acoustic approach, we propose to use i-Vector system. For text language based dialect classification, a series of natural language processing (NLP) techniques are explored to address word selection and grammar factors, which cannot be modeled using an acoustic modeling system. These NLP techniques include: two traditional approaches, including N-Gram modeling and Latent Semantic Analysis (LSA), and a novel approach based on Term Frequency–Inverse Document Frequency (TF-IDF) and logistic regression classification. Due to the sparsity of training data, traditional text approaches do not offer superior performance. However, the proposed TF-IDF approach shows comparable performance to the i-Vector acoustic system, which when fused with the i-Vector system results in a final audio-text combined solution that is more discriminative. Compared with the GMM baseline system, the proposed audio-text DID system provides a relative improvement in dialect classification performance of +40.1% and +47.1% on the self-collected corpus (UT-Podcast) and NIST LRE-2009 data, respectively. The experiment results validate the feasibility of leveraging both acoustic and textual information in achieving improved DID performance.

Research paper thumbnail of JOINT INFORMATION FROM NONLINEAR AND LINEAR FEATURES FOR SPOOFING DETECTION: AN I-VECTOR/DNN BASED APPROACH

Sustaining automatic speaker verification(ASV) systems from spoofing attacks remains an essential... more Sustaining automatic speaker verification(ASV) systems from spoofing attacks remains an essential challenge, even if significant progress in ASV has been achieved in recent years. In this study, an automatic spoofing detection approach using an i-vector framework is proposed. Two approaches are used for frame-level feature extraction: cepstral-based Perceptual Minimum Variance Distortionless Response (PMVDR), and non-linear speech-production-motivated Teager Energy Operator (TEO) Critical Band (CB) Autocorrela-tion Envelope (Auto-Env). An utterance-level i-vector for each recording is formed by concatenating PMVDR and TEO-CB-Auto-Env i-vectors, followed by linear discriminative analysis (LDA) for maximizing the ratio of between-class to within-class scatterings. A Gaussian classifier and DNN are also investigated for back-end scoring. Experiments using the ASVspoof 2015 corpus show that our proposed method successfully detects spoofing attacks. By combining the TEO-CB-Auto-Env and PMVDR features, a relative 76.7% improvement in terms of EER is obtained compared with the best single-feature system.

Research paper thumbnail of UNCERTAINTY PROPAGATION IN FRONT END FACTOR ANALYSIS FOR NOISE ROBUST SPEAKER RECOGNITION

In this study, we explore the propagation of uncertainty in the state-of-the-art speaker recognit... more In this study, we explore the propagation of uncertainty in the state-of-the-art speaker recognition system. Specifically, we incorporate the uncertainty associated with observation features into the i-Vector extraction framework. To prove the concept, both the oracle and practically estimated uncertainty are used for evaluation. The oracle uncertainty is calculated assuming the knowledge of clean speech features, while the estimated uncertainties are obtained using SPLICE and joint-GMM based methods. We evaluate the proposed framework on both YOHO and NIST 2010 Speaker Recognition Evaluation (SRE) corpora by artificially introducing noise at different SNRs. In the speaker verification experiments, we confirmed that the proposed uncertainty based i-Vector extraction framework shows significant robustness against noise.

Research paper thumbnail of Acoustic Feature Transformation using UBM-based LDA for Speaker Recognition

In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of ... more In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of acoustic space division. Each Gaussian mixture of trained UBM represents one distinct acoustic region. The posterior probabilities of features belonging to each region are further used as core components of Baum-Welch statistics. Therefore, the quality of estimated Baum-Welch statistics depends highly on how acoustic regions are separable with each other. In this paper, we propose to transform the front end acoustical features into a space where the separability of mixtures of trained UBM can be optimized. To achieve this, an UBM was first trained from the acoustical features and a transformation matrix is estimated using linear discriminant analysis (LDA) by treating each mixture of trained UBM as independent class. Therefore, the proposed method named as UBM-based LDA (uLDA) does not require any speaker labels or other supervised information. The obtained transformation matrix is then applied to acoustic features for i-Vector extraction. Experimental results on the male part of core conditions of NIST SRE 2010 dataset confirmed the improved performance using proposed method.

Research paper thumbnail of An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios

This study aims to explore the case of robust speaker recognition with multi-session enrollments ... more This study aims to explore the case of robust speaker
recognition with multi-session enrollments and noise, with an
emphasis on optimal organization and utilization of speaker
information presented in the enrollment and development data.
This study has two core objectives. First, we investigate more
robust back-ends to address noisy multi-session enrollment data
for speaker recognition. This task is achieved by proposing novel
back-end algorithms. Second, we construct a highly discriminative
speaker verification framework. This task is achieved through
intrinsic and extrinsic back-end algorithm modification, resulting
in complementary sub-systems. Evaluation of the proposed
framework is performed on the NIST SRE2012 corpus. Results
not only confirm individual sub-system advancements over an
established baseline, the final grand fusion solution also represents
a comprehensive overall advancement for the NIST SRE2012 core
tasks. Compared with state-of-the-art SID systems on the NIST
SRE2012, the novel parts of this study are: 1) exploring a more
diverse set of solutions for low-dimensional i-Vector based modeling;
and 2) diversifying the information configuration before
modeling. All these two parts work together, resulting in very
competitive performance with reasonable computational cost.

Research paper thumbnail of Investigating State-of-the-Art Speaker Verification in the case of Unlabeled Development Data

In this study, we describe the systems developed by the Center for Robust Speech Systems (CRSS), ... more In this study, we describe the systems developed by the Center for Robust Speech Systems (CRSS), Univ. of Texas -Dallas, for the NIST i-vector challenge. Given the emphasis of this challenge is on utilizing unlabeled development data, our system development focuses on: 1) leveraging the channel variation from unlabeled development data through unsupervised clustering; 2) investigating different classifiers containing complementary information that can be used in fusion; and 3) extracting meta-data information for test and model i-vectors. Our results indicate substantial improvement in performance by incorporating one or more of the aforementioned techniques.

Research paper thumbnail of SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

It is well known that speech utterances convey a rich diversity of information concerning the spe... more It is well known that speech utterances convey a rich diversity of information concerning the speaker in addition to related semantic content. Such information may contain speaker traits such as personality, likability, health/pathology, etc. To detect speaker traits in human computer interface is an important task toward formulating more efficient and natural computer engagement. This study proposes two groups of supra-segmental features for improving speaker trait detection performance. Compared with the 6125 dimension features based baseline system, the proposed supra-segmental system not only improves performance by 9.0%, but also is computationally attractive and proper for real life application since it derives a less than 63 dimension features, which are 99% less than the baseline system.

Research paper thumbnail of ROBUST LANGUAGE RECOGNITION BASED ON DIVERSE FEATURES

In real scenarios, robust language identification (LID) is usually hindered by factors such as ba... more In real scenarios, robust language identification (LID) is usually hindered by factors such as background noise, channel, and speech duration mismatches. To address these issues, this study focuses on the advancements of diverse acoustic features, back-ends, and their influence on LID system fusion. There is little research about the selection of complementary features for a multiple system fusion in LID. A set of distinct features are considered, which can be grouped into three categories: classical features, innovative features, and extensional features. In addition, both front-end concatenation and back-end fusion are considered. The results suggest that no single feature type is universally vital across all LID tasks and that a fusion of a diverse set is needed to ensure sustained LID performance in challenging scenarios. Moreover, the back-end fusion also consistently enhances the system performance significantly. More specifically, the proposed hybrid fusion method improves system performance by +38.5% and +46.2% on the DARPA RATS and the NIST LRE09 data sets, respectively.

Research paper thumbnail of Acoustic Feature Transformation using UBM-based LDA for Speaker Recognition

In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of ... more In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of acoustic space division. Each Gaussian mixture of trained UBM represents one distinct acoustic region. The posterior probabilities of features belonging to each region are further used as core components of Baum-Welch statistics. Therefore, the quality of estimated Baum-Welch statistics depends highly on how acoustic regions are separable with each other. In this paper, we propose to transform the front end acoustical features into a space where the separability of mixtures of trained UBM can be optimized. To achieve this, an UBM was first trained from the acoustical features and a transformation matrix is estimated using linear discriminant analysis (LDA) by treating each mixture of trained UBM as independent class. Therefore, the proposed method named as UBM-based LDA (uLDA) does not require any speaker labels or other supervised information. The obtained transformation matrix is then applied to acoustic features for i-Vector extraction. Experimental results on the male part of core conditions of NIST SRE 2010 dataset confirmed the improved performance using proposed method.

Research paper thumbnail of Device-Free People Counting and Localization

Device-free passive (DfP) localization has been proposed as an emerging technique for localizing ... more Device-free passive (DfP) localization has been proposed as an emerging technique for localizing people, without requiring them to carry any devices. Potential applications include elder-care, security enforcement, building occupancy statistics, etc.

Research paper thumbnail of Crowd++: Unsupervised Speaker Count with Smartphones

Smartphones are excellent mobile sensing platforms, with the microphone in particular being exerc... more Smartphones are excellent mobile sensing platforms, with the microphone in particular being exercised in several audio inference applications. We take smartphone audio inference a step further and demonstrate for the first time that it's possible to accurately estimate the number of people talking in a certain place -with an average error distance of 1.5 speakers -through unsupervised machine learning analysis on audio segments captured by the smartphones. Inference occurs transparently to the user and no human intervention is needed to derive the classification model. Our results are based on the design, implementation, and evaluation of a system called Crowd++, involving 120 participants in 10 very different environments. We show that no dedicated external hardware or cumbersome supervised learning approaches are needed but only off-the-shelf smartphones used in a transparent manner. We believe our findings have profound implications in many research fields, including social sensing and personal wellbeing assessment.

Research paper thumbnail of Robust Speech Enhancement Techniques for ASR in Non-stationary Noise and Dynamic Environments

In the current ASR systems the presence of competing speakers greatly degrades the recognition pe... more In the current ASR systems the presence of competing speakers greatly degrades the recognition performance. This phenomenon is getting even more prominent in the case of hands-free, far-field ASR systems like the "Smart-TV" systems, where reverberation and non-stationary noise pose additional challenges. Furthermore, speakers are, most often, not standing still while speaking. To address these issues, we propose a cascaded system that includes Time Differences of Arrival estimation, multi-channel Wiener Filtering, nonnegative matrix factorization (NMF), multi-condition training, and robust feature extraction, whereas each of them additively improves the overall performance. The final cascaded system presents an average of 50% and 45% relative improvement in ASR word accuracy for the CHiME 2011(non-stationary noise) and CHiME 2012 (non-stationary noise plus speaker head movement) tasks, respectively. Index Terms: array signal processing, automatic speech recognition, robustness, acoustic noise, non-negative matrix factorization

Research paper thumbnail of The CRSS systems for the 2010 NIST speaker recognition evaluation

This document briefly describes the systems submitted by the Center for Robust Speech Systems (CR... more This document briefly describes the systems submitted by the Center for Robust Speech Systems (CRSS) from The University of Texas at Dallas (UTD) in the 2010 NIST Speaker Recognition Evaluation. Our systems primarily use factor analysis as feature extractor [1] and support vector machine (SVM) classification framework. Our main focus in the evaluation is on the telephone trials in the core condition and 10 second train-test condition. Novel elements in our system include a supervised probabilistic principal component analysis (SPPCA) based approach for factor analysis, and an algorithm for optimal selection of the negative samples for training the SVM.

Research paper thumbnail of UTD-CRSS SYSTEMS FOR NIST LANGUAGE RECOGNITION EVALUATION 2011

Research paper thumbnail of A Linguistic Data Acquisition Front-End for Language Recognition Evaluation

One of the major challenges of the language identification (LID) system comes from the sparse tra... more One of the major challenges of the language identification (LID) system comes from the sparse training data. Manually collecting the linguistic data through the controlled studio is usually expensive and impractical. But multilingual broadcast programs (Voice of America, for instance) can be collected as a reasonable alternative to the linguistic data acquisition issue. However, unlike studio collected linguistic data, broadcast programs usually contain many contents other than pure linguistic data: musical contents in foreground/background, commercials, noise from practical life. In this study, a systematic processing approach is proposed to extract the linguistic data from the broadcast media. The experimental results obtained on NIST LRE 2009 data show that the proposed method can provide 22.2% relative improvement of segmentation accuracy and 20.5% relative improvement of LID accuracy.

Research paper thumbnail of AN INVESTIGATION ON BACK-END FOR SPEAKER RECOGNITION IN MULTI-SESSION ENROLLMENT

This study explores various back-end classifiers for robust speaker recognition in multi-session ... more This study explores various back-end classifiers for robust speaker recognition in multi-session enrollment, with emphasis on optimal utilization and organization of speaker information present in the development data. Our objective is to construct a highly discriminative back-end framework by fusing several back-ends on an i-vector system framework. It is demonstrated that, by using different information/data configuration and modeling schemes, performance of the fused system can be significantly improved compared to an individual system using a single front-end and back-end. Averaged across both genders, we obtain a relative improvement in EER and minDCF by 56.5% and 49.4%, respectively. Consistent performance gains obtained using the proposed strategy validates its effectiveness. This system is part of the CRSS' NIST SRE 2012 submission system.

Research paper thumbnail of WEIGHTED TRAINING FOR SPEECH UNDER LOMBARD EFFECT FOR SPEAKER RECOGNITION

The presence of Lombard Effect in speech is proven to have severe effects on the performance of s... more The presence of Lombard Effect in speech is proven to have severe effects on the performance of speech systems, especially speaker recognition. Varying kinds of Lombard speech are produced by speakers under influence of varying noise types [1]. This study proposes a high-accuracy classifier using deep neural networks for detecting various kinds of Lom-bard speech against neutral speech, independent of the noise levels causing the Lombard Effect. Lombard Effect detection accuracies as high as 95.7% are achieved using this novel model. The deep neural network based classification is further exploited by validation based weighted training of robust i-Vector based speaker identification systems. The proposed weighted training achieves a relative EER improvement of 28.4% over an i-Vector baseline system, confirming the effectiveness of deep neural networks in modeling Lombard Effect.

Research paper thumbnail of Frequency Offset Correction in Single Sideband(SSB) Speech by Deep Neural Network for Speaker Verification

Communication system mismatch represents a major influence for loss in speaker recognition perfor... more Communication system mismatch represents a major influence for loss in speaker recognition performance. This paper considers a type of nonlinear communication system mismatch-mod-ulation/demodulation (Mod/DeMod) carrier drift in single side-band (SSB) speech signals. We focus on the problem of estimating frequency offset in SSB speech in order to improve speaker verification performance of the drifted speech. Based on a two-step framework from previous work, we propose using a multi-layered neural network architecture, stacked denoising autoencoder (SDA), to determine the unique interval of the offset value in the first step. Experimental results demonstrate that the SDA based system can produce up to a +16.1% relative improvement in frequency offset estimation accuracy. A speaker verification evaluation shows a +65.9% relative improvement in EER when SSB speech signal is compensated with the frequency offset value estimated by the proposed method.

Research paper thumbnail of I-vector Based Physical Task Stress Detection with Different Fusion Strategies

It is common for subjects to produce speech while performing a physical task where speech technol... more It is common for subjects to produce speech while performing a physical task where speech technology may be used. Variabil-ities are introduced to speech since physical task can influence human speech production. These variabilities degrade the performance of most speech systems. It is vital to detect speech under physical stress variabilities for subsequent algorithm pro-cesssing. This study presents a method for detecting physical task stress from speech. Inspired by the fact that i-vectors can generally model total factors from speech, a state-of-the-art i-vector framework is investigated with MFCCs and our previously formulated TEO-CB-Auto-Env features for neutral/physical task stress detection. Since MFCCs are derived from a linear speech production model and TEO-CB-Auto-Env features employ a nonlinear operator, these two features are believed to have complementary effects on physical task stress detection. Two alternative fusion strategies (feature-level and score-level fusion) are investigated to validate this hypothesis. Experiments over the UT-Scope Physical Corpus demonstrate that a relative accuracy gain of 2.68% is obtained when fusing different feature based i-vectors. An additional relative performance boost with of 6.52% in accuracy is achieved using score level fusion.

Research paper thumbnail of An i-Vector PLDA based Gender Identification Approach for Severely Distorted and Multilingual DARPA RATS Data Why female and male speech differ? Motivations for i-Vector based Gender ID approach

This study proposes an i-Vector based approach to gender identification. Gender-labeled utterances... more This study proposes an i-Vector based approach to gender
identification. Gender-labeled utterances from the Fisher English
(FE) corpus are used to formulate an i-Vector extraction
framework, and a Probabilistic Linear Discriminant Analysis
(PLDA) back-end is employed to compute the scores
for gender identification. A novel duration mismatch compensation
strategy is also presented that offers very little
degradation in identification accuracy even with a large reduction
in the duration of the test-segment. The proposed
method is shown to consistently outperform a GMM-UBM
based gender-identification scheme on several test-sets created
from a held-out portion of the FE corpus, and is able to
achieve an identification accuracy of up to 97.63%. On the
severely distorted and multilingual DARPA-RATS (Robust
Automatic Transcription of Speech) corpora, the proposed
approach achieves an identification accuracy of 76.48% using
only the FE data in training. Next, a novel unsupervised
domain adaptation strategy is also presented that utilizes only
unlabeled RATS data to adapt the out-of-domain PLDA parameters
derived from the FE training data. The strategy is
able to offer a 6.8% relative improvement in identification
accuracy, and a 14.75% relative reduction in Equal Error Rate
(EER) compared to using the out-of-domain PLDA model on
the RATS test-utterances. These improvements are significant
since: 1) RATS test-utterances are severely distorted, 2)
No labeled data of any kind is used for 4 of the 5 languages
present in the test-utterances.

Research paper thumbnail of Unsupervised accent classification for deep data fusion of accent and language information

Automatic Dialect Identification (DID) has recently gained substantial interest in the speech pro... more Automatic Dialect Identification (DID) has recently gained substantial interest in the speech processing community. Studies have shown that the variation in speech due to dialect is a factor which significantly impacts speech system performance. Dialects differ in various ways such as acoustic traits (phonetic realization of vowels and consonants, rhythmical characteristics, prosody) and content based word selection (grammar, vocabulary, phonetic distribution, lexical distribution, semantics). The traditional DID classifier is usually based on Gaussian Mixture Modeling (GMM), which is employed as baseline system. We investigate various methods of improving the DID based on acoustic and text language subsystems to further boost the performance. For acoustic approach, we propose to use i-Vector system. For text language based dialect classification, a series of natural language processing (NLP) techniques are explored to address word selection and grammar factors, which cannot be modeled using an acoustic modeling system. These NLP techniques include: two traditional approaches, including N-Gram modeling and Latent Semantic Analysis (LSA), and a novel approach based on Term Frequency–Inverse Document Frequency (TF-IDF) and logistic regression classification. Due to the sparsity of training data, traditional text approaches do not offer superior performance. However, the proposed TF-IDF approach shows comparable performance to the i-Vector acoustic system, which when fused with the i-Vector system results in a final audio-text combined solution that is more discriminative. Compared with the GMM baseline system, the proposed audio-text DID system provides a relative improvement in dialect classification performance of +40.1% and +47.1% on the self-collected corpus (UT-Podcast) and NIST LRE-2009 data, respectively. The experiment results validate the feasibility of leveraging both acoustic and textual information in achieving improved DID performance.

Research paper thumbnail of JOINT INFORMATION FROM NONLINEAR AND LINEAR FEATURES FOR SPOOFING DETECTION: AN I-VECTOR/DNN BASED APPROACH

Sustaining automatic speaker verification(ASV) systems from spoofing attacks remains an essential... more Sustaining automatic speaker verification(ASV) systems from spoofing attacks remains an essential challenge, even if significant progress in ASV has been achieved in recent years. In this study, an automatic spoofing detection approach using an i-vector framework is proposed. Two approaches are used for frame-level feature extraction: cepstral-based Perceptual Minimum Variance Distortionless Response (PMVDR), and non-linear speech-production-motivated Teager Energy Operator (TEO) Critical Band (CB) Autocorrela-tion Envelope (Auto-Env). An utterance-level i-vector for each recording is formed by concatenating PMVDR and TEO-CB-Auto-Env i-vectors, followed by linear discriminative analysis (LDA) for maximizing the ratio of between-class to within-class scatterings. A Gaussian classifier and DNN are also investigated for back-end scoring. Experiments using the ASVspoof 2015 corpus show that our proposed method successfully detects spoofing attacks. By combining the TEO-CB-Auto-Env and PMVDR features, a relative 76.7% improvement in terms of EER is obtained compared with the best single-feature system.

Research paper thumbnail of UNCERTAINTY PROPAGATION IN FRONT END FACTOR ANALYSIS FOR NOISE ROBUST SPEAKER RECOGNITION

In this study, we explore the propagation of uncertainty in the state-of-the-art speaker recognit... more In this study, we explore the propagation of uncertainty in the state-of-the-art speaker recognition system. Specifically, we incorporate the uncertainty associated with observation features into the i-Vector extraction framework. To prove the concept, both the oracle and practically estimated uncertainty are used for evaluation. The oracle uncertainty is calculated assuming the knowledge of clean speech features, while the estimated uncertainties are obtained using SPLICE and joint-GMM based methods. We evaluate the proposed framework on both YOHO and NIST 2010 Speaker Recognition Evaluation (SRE) corpora by artificially introducing noise at different SNRs. In the speaker verification experiments, we confirmed that the proposed uncertainty based i-Vector extraction framework shows significant robustness against noise.

Research paper thumbnail of Acoustic Feature Transformation using UBM-based LDA for Speaker Recognition

In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of ... more In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of acoustic space division. Each Gaussian mixture of trained UBM represents one distinct acoustic region. The posterior probabilities of features belonging to each region are further used as core components of Baum-Welch statistics. Therefore, the quality of estimated Baum-Welch statistics depends highly on how acoustic regions are separable with each other. In this paper, we propose to transform the front end acoustical features into a space where the separability of mixtures of trained UBM can be optimized. To achieve this, an UBM was first trained from the acoustical features and a transformation matrix is estimated using linear discriminant analysis (LDA) by treating each mixture of trained UBM as independent class. Therefore, the proposed method named as UBM-based LDA (uLDA) does not require any speaker labels or other supervised information. The obtained transformation matrix is then applied to acoustic features for i-Vector extraction. Experimental results on the male part of core conditions of NIST SRE 2010 dataset confirmed the improved performance using proposed method.

Research paper thumbnail of An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy Enrollment Scenarios

This study aims to explore the case of robust speaker recognition with multi-session enrollments ... more This study aims to explore the case of robust speaker
recognition with multi-session enrollments and noise, with an
emphasis on optimal organization and utilization of speaker
information presented in the enrollment and development data.
This study has two core objectives. First, we investigate more
robust back-ends to address noisy multi-session enrollment data
for speaker recognition. This task is achieved by proposing novel
back-end algorithms. Second, we construct a highly discriminative
speaker verification framework. This task is achieved through
intrinsic and extrinsic back-end algorithm modification, resulting
in complementary sub-systems. Evaluation of the proposed
framework is performed on the NIST SRE2012 corpus. Results
not only confirm individual sub-system advancements over an
established baseline, the final grand fusion solution also represents
a comprehensive overall advancement for the NIST SRE2012 core
tasks. Compared with state-of-the-art SID systems on the NIST
SRE2012, the novel parts of this study are: 1) exploring a more
diverse set of solutions for low-dimensional i-Vector based modeling;
and 2) diversifying the information configuration before
modeling. All these two parts work together, resulting in very
competitive performance with reasonable computational cost.

Research paper thumbnail of Investigating State-of-the-Art Speaker Verification in the case of Unlabeled Development Data

In this study, we describe the systems developed by the Center for Robust Speech Systems (CRSS), ... more In this study, we describe the systems developed by the Center for Robust Speech Systems (CRSS), Univ. of Texas -Dallas, for the NIST i-vector challenge. Given the emphasis of this challenge is on utilizing unlabeled development data, our system development focuses on: 1) leveraging the channel variation from unlabeled development data through unsupervised clustering; 2) investigating different classifiers containing complementary information that can be used in fusion; and 3) extracting meta-data information for test and model i-vectors. Our results indicate substantial improvement in performance by incorporating one or more of the aforementioned techniques.

Research paper thumbnail of SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

It is well known that speech utterances convey a rich diversity of information concerning the spe... more It is well known that speech utterances convey a rich diversity of information concerning the speaker in addition to related semantic content. Such information may contain speaker traits such as personality, likability, health/pathology, etc. To detect speaker traits in human computer interface is an important task toward formulating more efficient and natural computer engagement. This study proposes two groups of supra-segmental features for improving speaker trait detection performance. Compared with the 6125 dimension features based baseline system, the proposed supra-segmental system not only improves performance by 9.0%, but also is computationally attractive and proper for real life application since it derives a less than 63 dimension features, which are 99% less than the baseline system.

Research paper thumbnail of ROBUST LANGUAGE RECOGNITION BASED ON DIVERSE FEATURES

In real scenarios, robust language identification (LID) is usually hindered by factors such as ba... more In real scenarios, robust language identification (LID) is usually hindered by factors such as background noise, channel, and speech duration mismatches. To address these issues, this study focuses on the advancements of diverse acoustic features, back-ends, and their influence on LID system fusion. There is little research about the selection of complementary features for a multiple system fusion in LID. A set of distinct features are considered, which can be grouped into three categories: classical features, innovative features, and extensional features. In addition, both front-end concatenation and back-end fusion are considered. The results suggest that no single feature type is universally vital across all LID tasks and that a fusion of a diverse set is needed to ensure sustained LID performance in challenging scenarios. Moreover, the back-end fusion also consistently enhances the system performance significantly. More specifically, the proposed hybrid fusion method improves system performance by +38.5% and +46.2% on the DARPA RATS and the NIST LRE09 data sets, respectively.

Research paper thumbnail of Acoustic Feature Transformation using UBM-based LDA for Speaker Recognition

In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of ... more In state-of-the-art speaker recognition system, universal background model (UBM) plays a role of acoustic space division. Each Gaussian mixture of trained UBM represents one distinct acoustic region. The posterior probabilities of features belonging to each region are further used as core components of Baum-Welch statistics. Therefore, the quality of estimated Baum-Welch statistics depends highly on how acoustic regions are separable with each other. In this paper, we propose to transform the front end acoustical features into a space where the separability of mixtures of trained UBM can be optimized. To achieve this, an UBM was first trained from the acoustical features and a transformation matrix is estimated using linear discriminant analysis (LDA) by treating each mixture of trained UBM as independent class. Therefore, the proposed method named as UBM-based LDA (uLDA) does not require any speaker labels or other supervised information. The obtained transformation matrix is then applied to acoustic features for i-Vector extraction. Experimental results on the male part of core conditions of NIST SRE 2010 dataset confirmed the improved performance using proposed method.

Research paper thumbnail of Device-Free People Counting and Localization

Device-free passive (DfP) localization has been proposed as an emerging technique for localizing ... more Device-free passive (DfP) localization has been proposed as an emerging technique for localizing people, without requiring them to carry any devices. Potential applications include elder-care, security enforcement, building occupancy statistics, etc.

Research paper thumbnail of Crowd++: Unsupervised Speaker Count with Smartphones

Smartphones are excellent mobile sensing platforms, with the microphone in particular being exerc... more Smartphones are excellent mobile sensing platforms, with the microphone in particular being exercised in several audio inference applications. We take smartphone audio inference a step further and demonstrate for the first time that it's possible to accurately estimate the number of people talking in a certain place -with an average error distance of 1.5 speakers -through unsupervised machine learning analysis on audio segments captured by the smartphones. Inference occurs transparently to the user and no human intervention is needed to derive the classification model. Our results are based on the design, implementation, and evaluation of a system called Crowd++, involving 120 participants in 10 very different environments. We show that no dedicated external hardware or cumbersome supervised learning approaches are needed but only off-the-shelf smartphones used in a transparent manner. We believe our findings have profound implications in many research fields, including social sensing and personal wellbeing assessment.

Research paper thumbnail of Robust Speech Enhancement Techniques for ASR in Non-stationary Noise and Dynamic Environments

In the current ASR systems the presence of competing speakers greatly degrades the recognition pe... more In the current ASR systems the presence of competing speakers greatly degrades the recognition performance. This phenomenon is getting even more prominent in the case of hands-free, far-field ASR systems like the "Smart-TV" systems, where reverberation and non-stationary noise pose additional challenges. Furthermore, speakers are, most often, not standing still while speaking. To address these issues, we propose a cascaded system that includes Time Differences of Arrival estimation, multi-channel Wiener Filtering, nonnegative matrix factorization (NMF), multi-condition training, and robust feature extraction, whereas each of them additively improves the overall performance. The final cascaded system presents an average of 50% and 45% relative improvement in ASR word accuracy for the CHiME 2011(non-stationary noise) and CHiME 2012 (non-stationary noise plus speaker head movement) tasks, respectively. Index Terms: array signal processing, automatic speech recognition, robustness, acoustic noise, non-negative matrix factorization

Research paper thumbnail of The CRSS systems for the 2010 NIST speaker recognition evaluation

This document briefly describes the systems submitted by the Center for Robust Speech Systems (CR... more This document briefly describes the systems submitted by the Center for Robust Speech Systems (CRSS) from The University of Texas at Dallas (UTD) in the 2010 NIST Speaker Recognition Evaluation. Our systems primarily use factor analysis as feature extractor [1] and support vector machine (SVM) classification framework. Our main focus in the evaluation is on the telephone trials in the core condition and 10 second train-test condition. Novel elements in our system include a supervised probabilistic principal component analysis (SPPCA) based approach for factor analysis, and an algorithm for optimal selection of the negative samples for training the SVM.

Research paper thumbnail of UTD-CRSS SYSTEMS FOR NIST LANGUAGE RECOGNITION EVALUATION 2011

Research paper thumbnail of A Linguistic Data Acquisition Front-End for Language Recognition Evaluation

One of the major challenges of the language identification (LID) system comes from the sparse tra... more One of the major challenges of the language identification (LID) system comes from the sparse training data. Manually collecting the linguistic data through the controlled studio is usually expensive and impractical. But multilingual broadcast programs (Voice of America, for instance) can be collected as a reasonable alternative to the linguistic data acquisition issue. However, unlike studio collected linguistic data, broadcast programs usually contain many contents other than pure linguistic data: musical contents in foreground/background, commercials, noise from practical life. In this study, a systematic processing approach is proposed to extract the linguistic data from the broadcast media. The experimental results obtained on NIST LRE 2009 data show that the proposed method can provide 22.2% relative improvement of segmentation accuracy and 20.5% relative improvement of LID accuracy.

Research paper thumbnail of AN INVESTIGATION ON BACK-END FOR SPEAKER RECOGNITION IN MULTI-SESSION ENROLLMENT

This study explores various back-end classifiers for robust speaker recognition in multi-session ... more This study explores various back-end classifiers for robust speaker recognition in multi-session enrollment, with emphasis on optimal utilization and organization of speaker information present in the development data. Our objective is to construct a highly discriminative back-end framework by fusing several back-ends on an i-vector system framework. It is demonstrated that, by using different information/data configuration and modeling schemes, performance of the fused system can be significantly improved compared to an individual system using a single front-end and back-end. Averaged across both genders, we obtain a relative improvement in EER and minDCF by 56.5% and 49.4%, respectively. Consistent performance gains obtained using the proposed strategy validates its effectiveness. This system is part of the CRSS' NIST SRE 2012 submission system.