John Hansen - Academia.edu (original) (raw)

Papers by John Hansen

Research paper thumbnail of DSP for In-Vehicle and Mobile Systems

Research paper thumbnail of An improved cluster model selection method for agglomerative hierarchical speaker clustering using incremental Gaussian mixture models

Interspeech 2010, 2010

In this paper, we improve our previous cluster model selection method for agglomerative hierarchi... more In this paper, we improve our previous cluster model selection method for agglomerative hierarchical speaker clustering (AHSC) based on incremental Gaussian mixture models (iGMMs). In the previous work, we measured the likelihood of all the data points in a given cluster for each mixture component of the GMM modeling the cluster. Then, we selected the N-best component Gaussians with the highest likelihoods to make the GMM refined for the purpose of better cluster representation. N was chosen empirically then, but it is difficult to set an optimal N universally in general. In this work, we propose an improved method to adaptively select component Gaussians from the GMM considered, by measuring the degree of representativeness of each Gaussian component, which we define in this paper. Experiments on two data sets including 17 meeting speech excerpts verify that the proposed approach improves the overall clustering performance by approximately 20% and 10% (relative), respectively, compared to the previous method. Index Terms: agglomerative hierarchical speaker clustering, incremental Gaussian mixture model, cluster model selection, degree of representativeness 2 A single data point mentioned in this paper corresponds to a melfrequency cepstral coefficient (MFCC).

Research paper thumbnail of Bilateral Cochlear Implant Processing of Coding Strategies With CCi-MOBILE, an Open-Source Research Platform

IEEE/ACM Transactions on Audio, Speech, and Language Processing

While speech understanding for cochlear implant (CI) users in quiet is relatively effective, list... more While speech understanding for cochlear implant (CI) users in quiet is relatively effective, listeners experience difficulty in identification of speaker and sound location. To assist for better residual hearing abilities and speech intelligibility support, bilateral and bimodal forms of assisted hearing is becoming popular among CI users. Effective bilateral processing calls for testing precise algorithm synchronization and fitting between both left and right ear channels in order to capture interaural time and level difference cues (ITD and ILDs). This work demonstrates bilateral implant algorithm processing using a custom-made CI research platform-CCi-MOBILE, which is capable of capturing precise source localization information and supports researchers in testing bilateral CI processing in real-time naturalistic environments. Simulation-based, objective, and subjective testing has been performed to validate the accuracy of the platform. The subjective test results produced an RMS error of ±8.66°for source localization, which is comparable to the performance of commercial CI processors.

Research paper thumbnail of Score-Aging Calibration for Speaker Verification

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016

The gradual changes that occur in the human voice due to aging create challenges for speaker veri... more The gradual changes that occur in the human voice due to aging create challenges for speaker verification. This study presents an approach to calibrating the output scores of a speaker verification system using the time interval between comparison samples as additional information. Several functions are proposed for the incorporation of this time information, which is viewed as aging information, in a conventional linear score calibration transformation. Experiments are presented on data with shortterm aging intervals ranging between 2 months and 3 years, and long-term aging intervals of up to 30 years. The aging calibration proposal is shown to offset the decreased discrimination and calibration performance for both short-and long-term intervals, and to extrapolate well to unseen aging intervals. Relative reductions in C r (cost of log-likelihood ratio) of 1-4% and 10-43% are obtained at short-and long-term intervals, respectively. Assuming that a system has knowledge of the time interval between samples under comparison, this approach represents a straightforward means of compensating for the detrimental impact of aging on speaker verification performance.

Research paper thumbnail of Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection

The Journal of the Acoustical Society of America

Research paper thumbnail of Noise update modeling for speech enhancement: when do we do enough?

In speech enhancement, it is generally assumed that if you can update your noise estimate on a fr... more In speech enhancement, it is generally assumed that if you can update your noise estimate on a frame-by-frame basis, you should achieve the highest level of enhancement performance. However, for many noise types and environmental conditions, it is not necessary to perform an update on a frame-by-frame basis to achieve superior performance if the noise structure does not change rapidly. For applications where compute/memory resources are limited, better overall speech performance could be achieved if a more reasonable update rate is estimated so that available compute/memory resources could be made available to the enhancement algorithm itself. In this study, we propose a framework to model the noise structure with the goal of determining the best update rate required to achieve a given quality for speech enhancement. Speech systems generally develop specialized solutions for niose which are unique to each application (i.e., recognition, speaker ID, enhancement etc.). Here we propose a model to predict the noise update rate required to achieve a given quality for enhancement. We evaluate the algorithm across a corpus of four noise types under different levels of degradation. The error between the mean observed and the mean predicted Itakuta-Saito (IS) values of quality are typically between 0.06 to 1.78 IS for our model selected noise frame update rate of 1 frame every 5 frames using the Log-MMSE enhancement scheme. Finally we consider mobile and resource limited applications where such a framework would be useful

Research paper thumbnail of In-set/out-of-set speaker identification based on discriminative speech frame selection

Interspeech 2005, 2005

In this paper, we propose a novel discriminative speech frame selection (DSFS) scheme for the pro... more In this paper, we propose a novel discriminative speech frame selection (DSFS) scheme for the problem of in-set/out-of-set speaker identification, which seeks to decrease the similarity between speaker models and background model (or antispeaker model), and increase the accuracy of speaker identification. The working scheme of DSFS consists of two steps: speech frame analysis and discriminative frame selection. Two methods are used to perform DSFS, (i) Teager Energy Operator (TEO) energy based and (ii) MELP pitch based methods. An evaluation using both clean and noisy corpora that include single and multiple recording sessions show that both TEO energy based and MELP pitch based DSFS schemes can reduce EER (equal error rate) dramatically over a traditional GMM-UBM baseline system. Compared with traditional GMM speaker identification, the DSFS is able to select only discriminative speech frames, and therefore consider only discriminative features. This selection is able to decrease the overlap between speaker models and background model, and improve the performance of in-set/out-of-set speaker identification.

Research paper thumbnail of A study on deep neural network acoustic model adaptation for robust far-field speech recognition

Interspeech 2015, 2015

Even though deep neural network acoustic models provide an increased degree of robustness in auto... more Even though deep neural network acoustic models provide an increased degree of robustness in automatic speech recognition, there is still a large performance drop in the task of far-field speech recognition in reverberant and noisy environments. In this study, we explore DNN adaptation techniques to achieve improved robustness to environmental mismatch for far-field speech recognition. In contrast to many recent studies investigating the role of feature processing in DNN-HMM systems, we focus on adaptation of a clean-trained DNN model to speech data captured by a distant-talking microphone in a target environment with substantial reverberation and noise. We show that significant performance gains can be obtained by discriminatively estimating a set of adaptation parameters to compensate the mismatch between a clean-trained model and a small set of noisy and reverberant adaptation data. Using various adaptation strategies, relative word error rate improvements of up to 16% could be obtained on the single-channel task of the recent Aspire challenge.

Research paper thumbnail of Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model

IEEE Transactions on Audio, Speech, and Language Processing, 2010

Research paper thumbnail of Generalized parametric spectral subtraction using weighted Euclidean distortion

Interspeech 2008, 2008

An improved version of the original parametric formulation of the generalized spectral subtractio... more An improved version of the original parametric formulation of the generalized spectral subtraction method is presented in this study. The original formulation uses parameters that minimize the mean-square error (MSE) between the estimated and true speech spectral amplitudes. However, the MSE does not take into account any perceptual measure. We propose two new short-time spectral amplitude estimators based on a perceptual error criterion-the weighted Euclidean distortion. The error function is easily adaptable to penalize spectral peaks and valleys differently. Performance evaluations were performed using two noise types over four SNR levels and compared to the original parametric formulation. Results demonstrate that for most cases the proposed estimators achieve greater noise suppression without introducing speech distortion.

Research paper thumbnail of Tagging child-adult interactions in naturalistic, noisy, daylong school environments using i-vector based diarization system

SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, 2019

Assessing child growth in terms of speech and language is a crucial indicator of long term learni... more Assessing child growth in terms of speech and language is a crucial indicator of long term learning ability and lifelong progress. Since the preschool classroom provides a potent opportunity for monitoring growth in young children's interactions, analyzing such data has come into prominence for early childhood researchers. The foremost task of any analysis of such naturalistic recordings would involve parsing and tagging the interactions between adults and young children. An automated tagging system will provide child interaction metrics and would be important for any further processing. This study investigates the language environment of 3-5 year old children using a CRSS based diarization strategy employing an i-vector-based baseline that captures adult-to-child or childto-child rapid conversational turns in a naturalistic noisy early childhood setting. We provide analysis of various loss functions and learning algorithms using Deep Neural Networks to separate child speech from adult speech. Performance is measured in terms of diarization error rate, Jaccard error rate and shows good results for tagging adult vs children's speech. Distinction between primary and secondary child would be useful for monitoring a given child and analysis is provided for the same. Our diarization system provides insights into the direction for preprocessing and analyzing challenging naturalistic daylong child speech recordings.

Research paper thumbnail of Active Learning Based Constrained Clustering For Speaker Diarization

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017

Most speaker diarization research has focused on unsupervised scenarios, where no human supervisi... more Most speaker diarization research has focused on unsupervised scenarios, where no human supervision is available. However, in many real-world applications, a certain amount of human input could be expected, especially when minimal human supervision brings significant performance improvement. In this study, we propose an active learning based bottom-up speaker clustering algorithm to effectively improve speaker diarization performance with limited human input. Specifically, the proposed active learning based speaker clustering has two different stages: explore and constrained clustering. The explore stage is to quickly discover at least one sample for each speaker for boosting speaker clustering process with reliable initial speaker clusters. After discovering all, or a majority, of the involved speakers during explore stage, the constrained clustering is performed. Constrained clustering is similar to traditional bottom-up clustering process with an important difference that the clusters created during explore stage are restricted from merging with each other. Constrained clustering continues until only the clusters generated from the explore stage are left. Since the objective of active learning based speaker clustering algorithm is to provide good initial speaker models, performance saturates as soon as sufficient examples are ensured for each cluster. To further improve diarization performance with increasing human input, we propose a second method which actively select speech segments that account for the largest expected speaker error from existing cluster assignments for human evaluation and reassignment. The algorithms are evaluated on our recently created Apollo Mission Control Center dataset as well as augmented multiparty interaction meeting corpus. The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision. Index Terms-Active learning, bottom-up clustering, speaker diarization. I. INTRODUCTION S PEAKER diarization is the process of automatically detecting who spoke when in an audio sequence. With an

Research paper thumbnail of Dialect separation assessment using log-likelihood score distributions

Interspeech 2008, 2008

Dialect differences within a given language represent major challenges for sustained speech syste... more Dialect differences within a given language represent major challenges for sustained speech system performance. For speech recognition, little if any knowledge exists on differences between dialects (e.g. vocabulary, grammar, prosody, etc.). Effective dialect classification can contribute to improved ASR, speaker ID, and spoken document retrieval. This study, presents an approach to establish a metric to estimate the separation between dialects, and to provide some sense of expected speech system performance. The proposed approach compares dialects based on their loglikelihood score distributions. From the score distributions, a numerical measure is obtained to assess the separation between resulting GMM dialect models. The proposed scheme is evaluated on a corpus of Arabic dialects. The sensitivity of the dialect separation score is also quantified based on controlled mixing of dialect data for the case of measuring dialect training data purity. The resulting scheme is shown to be effective in measuring dialect distance, and represents an important objective way of assessing dialect differences within a common language.

Research paper thumbnail of Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014

Room reverberation poses various deleterious effects on performance of automatic speech systems. ... more Room reverberation poses various deleterious effects on performance of automatic speech systems. Speaker identification (SID) performance, in particular, degrades rapidly as reverberation time increases. Reverberation causes two forms of spectro-temporal distortions on speech signals: i) self-masking which is due to early reflections and ii) overlap-masking which is due to late reverberation. Overlap-masking effect of reverberation has been shown to have a greater adverse impact on performance of speech systems. Motivated by this fact, this study proposes a blind spectral weighting (BSW) technique for suppressing the reverberation overlap-masking effect on SID systems. The technique is blind in the sense that prior knowledge of neither the anechoic signal nor the room impulse response is required. Performance of the proposed technique is evaluated on speaker verification tasks under simulated and actual reverberant mismatched conditions. Evaluations are conducted in the context of the conventional GMM-UBM as well as the state-of-the-art i-vector based systems. The GMM-UBM experiments are performed using speech material from a new data corpus well suited for speaker verification experiments under actual reverberant mismatched conditions, entitled MultiRoom8. The i-vector experiments are carried out with microphone (interview and phonecall) data from the NIST SRE 2010 extended evaluation set which are digitally convolved with three different measured room impulse responses extracted from the Aachen impulse response (AIR) database. Experimental results prove that incorporating the proposed blind technique into the standard MFCC feature extraction framework yields significant improvement in SID performance under reverberation mismatch.

Research paper thumbnail of Single-channel speech separation using Soft-minimum Permutation Invariant Training

ArXiv, 2021

The goal of speech separation is to extract multiple speech sources from a single microphone reco... more The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in ha...

Research paper thumbnail of In-vehicle based speech processing for hearing impaired subjects

In-Vehicle Based Speech Processing for Hearing Impaired Listeners Xianxian Zhangl, John HL Hansen... more In-Vehicle Based Speech Processing for Hearing Impaired Listeners Xianxian Zhangl, John HL Hansen”, Kathryn Arehart2, Jessica Rossi-Katzz 1 Robust Speech Processing Group, Center for Spoken Language Research, University of Colorado, Boulder ... [2.] Deller, JR, Hansen ...

Research paper thumbnail of Missing-feature method for speaker recognition in band-restricted conditions

In this study, the missing-feature method is considered to address band-limited speech for speake... more In this study, the missing-feature method is considered to address band-limited speech for speaker recognition. In an effort to mitigate possible degradation due to the general speaker independent model, a two-step reconstruction scheme is developed, where speaker class independent/dependent models are used separately. An advanced marginalization in the cepstral domain is proposed employing a high order extension method in order to address loss of model accuracy in the conventional method due to cepstrum truncation. To detect the cut-off regions from incoming speech, a blind mask estimation scheme is employed which uses a synthesized band-limited speech model. Experimental results on band-limited conditions indicate that our two-step reconstruction scheme with missingfeature processing is effective in improving in-set/out-of-set speaker recognition performance for band-limited speech, particularly in severely band-restricted conditions (i.e., 4.72% EER improvement in 2, 3, and 4kHz band-limited conditions over a conventional data-driven method). The improvement of the proposed marginalization method proves its effectiveness for acoustic model conversion by employing high order extension, showing 0.57% EER improvement over conventional marginalization.

Research paper thumbnail of Audio-Visual Isolated Digit Recognition For Whispered Speech

Publication in the conference proceedings of EUSIPCO, Barcelona, Spain, 2011

Research paper thumbnail of A Systematic Strategy For Robust Automatic Dialect Identification

Publication in the conference proceedings of EUSIPCO, Barcelona, Spain, 2011

Research paper thumbnail of Improved "TEO" feature-based automatic stress detection using physiological and acoustic speech sensors

The acoustic pressure microphone has served as the primary instrument for collecting speech data ... more The acoustic pressure microphone has served as the primary instrument for collecting speech data for automatic speech recognition systems. The acoustic microphone suffers from limitations, such as sensitivity to background noise and relatively far proximity to speech production organs. Alternative speech collection sensors may serve to enhance the effectiveness of automatic speech recognition systems. In this study, we first consider an experimental evaluation of the TEO-CB-AutoEnv feature in an actual law enforcement training scenario. We consider feature relation to stress level assessment over time. Next, we explore the use of the physiological microphone, a gel-based device placed next to the vocal folds on the outside of the throat used to measure vibrations of the vocal tract and minimize background noise, as we investigate the effectiveness of a TEO-CB-AutoEnvbased automatic stress recognition system. We employ both acoustic and physiological sensors as stand-alone speech data collection devices as well as consider both sensors concurrently. For the latter, we devise a weighted composite decision scheme using both the acoustic and physiological microphone data that yields relative average error rate reductions of 32% and 6% versus sole employment of acoustic and physiological microphone data, respectively, in a realistic stressful environment.

Research paper thumbnail of DSP for In-Vehicle and Mobile Systems

Research paper thumbnail of An improved cluster model selection method for agglomerative hierarchical speaker clustering using incremental Gaussian mixture models

Interspeech 2010, 2010

In this paper, we improve our previous cluster model selection method for agglomerative hierarchi... more In this paper, we improve our previous cluster model selection method for agglomerative hierarchical speaker clustering (AHSC) based on incremental Gaussian mixture models (iGMMs). In the previous work, we measured the likelihood of all the data points in a given cluster for each mixture component of the GMM modeling the cluster. Then, we selected the N-best component Gaussians with the highest likelihoods to make the GMM refined for the purpose of better cluster representation. N was chosen empirically then, but it is difficult to set an optimal N universally in general. In this work, we propose an improved method to adaptively select component Gaussians from the GMM considered, by measuring the degree of representativeness of each Gaussian component, which we define in this paper. Experiments on two data sets including 17 meeting speech excerpts verify that the proposed approach improves the overall clustering performance by approximately 20% and 10% (relative), respectively, compared to the previous method. Index Terms: agglomerative hierarchical speaker clustering, incremental Gaussian mixture model, cluster model selection, degree of representativeness 2 A single data point mentioned in this paper corresponds to a melfrequency cepstral coefficient (MFCC).

Research paper thumbnail of Bilateral Cochlear Implant Processing of Coding Strategies With CCi-MOBILE, an Open-Source Research Platform

IEEE/ACM Transactions on Audio, Speech, and Language Processing

While speech understanding for cochlear implant (CI) users in quiet is relatively effective, list... more While speech understanding for cochlear implant (CI) users in quiet is relatively effective, listeners experience difficulty in identification of speaker and sound location. To assist for better residual hearing abilities and speech intelligibility support, bilateral and bimodal forms of assisted hearing is becoming popular among CI users. Effective bilateral processing calls for testing precise algorithm synchronization and fitting between both left and right ear channels in order to capture interaural time and level difference cues (ITD and ILDs). This work demonstrates bilateral implant algorithm processing using a custom-made CI research platform-CCi-MOBILE, which is capable of capturing precise source localization information and supports researchers in testing bilateral CI processing in real-time naturalistic environments. Simulation-based, objective, and subjective testing has been performed to validate the accuracy of the platform. The subjective test results produced an RMS error of ±8.66°for source localization, which is comparable to the performance of commercial CI processors.

Research paper thumbnail of Score-Aging Calibration for Speaker Verification

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016

The gradual changes that occur in the human voice due to aging create challenges for speaker veri... more The gradual changes that occur in the human voice due to aging create challenges for speaker verification. This study presents an approach to calibrating the output scores of a speaker verification system using the time interval between comparison samples as additional information. Several functions are proposed for the incorporation of this time information, which is viewed as aging information, in a conventional linear score calibration transformation. Experiments are presented on data with shortterm aging intervals ranging between 2 months and 3 years, and long-term aging intervals of up to 30 years. The aging calibration proposal is shown to offset the decreased discrimination and calibration performance for both short-and long-term intervals, and to extrapolate well to unseen aging intervals. Relative reductions in C r (cost of log-likelihood ratio) of 1-4% and 10-43% are obtained at short-and long-term intervals, respectively. Assuming that a system has knowledge of the time interval between samples under comparison, this approach represents a straightforward means of compensating for the detrimental impact of aging on speaker verification performance.

Research paper thumbnail of Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection

The Journal of the Acoustical Society of America

Research paper thumbnail of Noise update modeling for speech enhancement: when do we do enough?

In speech enhancement, it is generally assumed that if you can update your noise estimate on a fr... more In speech enhancement, it is generally assumed that if you can update your noise estimate on a frame-by-frame basis, you should achieve the highest level of enhancement performance. However, for many noise types and environmental conditions, it is not necessary to perform an update on a frame-by-frame basis to achieve superior performance if the noise structure does not change rapidly. For applications where compute/memory resources are limited, better overall speech performance could be achieved if a more reasonable update rate is estimated so that available compute/memory resources could be made available to the enhancement algorithm itself. In this study, we propose a framework to model the noise structure with the goal of determining the best update rate required to achieve a given quality for speech enhancement. Speech systems generally develop specialized solutions for niose which are unique to each application (i.e., recognition, speaker ID, enhancement etc.). Here we propose a model to predict the noise update rate required to achieve a given quality for enhancement. We evaluate the algorithm across a corpus of four noise types under different levels of degradation. The error between the mean observed and the mean predicted Itakuta-Saito (IS) values of quality are typically between 0.06 to 1.78 IS for our model selected noise frame update rate of 1 frame every 5 frames using the Log-MMSE enhancement scheme. Finally we consider mobile and resource limited applications where such a framework would be useful

Research paper thumbnail of In-set/out-of-set speaker identification based on discriminative speech frame selection

Interspeech 2005, 2005

In this paper, we propose a novel discriminative speech frame selection (DSFS) scheme for the pro... more In this paper, we propose a novel discriminative speech frame selection (DSFS) scheme for the problem of in-set/out-of-set speaker identification, which seeks to decrease the similarity between speaker models and background model (or antispeaker model), and increase the accuracy of speaker identification. The working scheme of DSFS consists of two steps: speech frame analysis and discriminative frame selection. Two methods are used to perform DSFS, (i) Teager Energy Operator (TEO) energy based and (ii) MELP pitch based methods. An evaluation using both clean and noisy corpora that include single and multiple recording sessions show that both TEO energy based and MELP pitch based DSFS schemes can reduce EER (equal error rate) dramatically over a traditional GMM-UBM baseline system. Compared with traditional GMM speaker identification, the DSFS is able to select only discriminative speech frames, and therefore consider only discriminative features. This selection is able to decrease the overlap between speaker models and background model, and improve the performance of in-set/out-of-set speaker identification.

Research paper thumbnail of A study on deep neural network acoustic model adaptation for robust far-field speech recognition

Interspeech 2015, 2015

Even though deep neural network acoustic models provide an increased degree of robustness in auto... more Even though deep neural network acoustic models provide an increased degree of robustness in automatic speech recognition, there is still a large performance drop in the task of far-field speech recognition in reverberant and noisy environments. In this study, we explore DNN adaptation techniques to achieve improved robustness to environmental mismatch for far-field speech recognition. In contrast to many recent studies investigating the role of feature processing in DNN-HMM systems, we focus on adaptation of a clean-trained DNN model to speech data captured by a distant-talking microphone in a target environment with substantial reverberation and noise. We show that significant performance gains can be obtained by discriminatively estimating a set of adaptation parameters to compensate the mismatch between a clean-trained model and a small set of noisy and reverberant adaptation data. Using various adaptation strategies, relative word error rate improvements of up to 16% could be obtained on the single-channel task of the recent Aspire challenge.

Research paper thumbnail of Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model

IEEE Transactions on Audio, Speech, and Language Processing, 2010

Research paper thumbnail of Generalized parametric spectral subtraction using weighted Euclidean distortion

Interspeech 2008, 2008

An improved version of the original parametric formulation of the generalized spectral subtractio... more An improved version of the original parametric formulation of the generalized spectral subtraction method is presented in this study. The original formulation uses parameters that minimize the mean-square error (MSE) between the estimated and true speech spectral amplitudes. However, the MSE does not take into account any perceptual measure. We propose two new short-time spectral amplitude estimators based on a perceptual error criterion-the weighted Euclidean distortion. The error function is easily adaptable to penalize spectral peaks and valleys differently. Performance evaluations were performed using two noise types over four SNR levels and compared to the original parametric formulation. Results demonstrate that for most cases the proposed estimators achieve greater noise suppression without introducing speech distortion.

Research paper thumbnail of Tagging child-adult interactions in naturalistic, noisy, daylong school environments using i-vector based diarization system

SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, 2019

Assessing child growth in terms of speech and language is a crucial indicator of long term learni... more Assessing child growth in terms of speech and language is a crucial indicator of long term learning ability and lifelong progress. Since the preschool classroom provides a potent opportunity for monitoring growth in young children's interactions, analyzing such data has come into prominence for early childhood researchers. The foremost task of any analysis of such naturalistic recordings would involve parsing and tagging the interactions between adults and young children. An automated tagging system will provide child interaction metrics and would be important for any further processing. This study investigates the language environment of 3-5 year old children using a CRSS based diarization strategy employing an i-vector-based baseline that captures adult-to-child or childto-child rapid conversational turns in a naturalistic noisy early childhood setting. We provide analysis of various loss functions and learning algorithms using Deep Neural Networks to separate child speech from adult speech. Performance is measured in terms of diarization error rate, Jaccard error rate and shows good results for tagging adult vs children's speech. Distinction between primary and secondary child would be useful for monitoring a given child and analysis is provided for the same. Our diarization system provides insights into the direction for preprocessing and analyzing challenging naturalistic daylong child speech recordings.

Research paper thumbnail of Active Learning Based Constrained Clustering For Speaker Diarization

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017

Most speaker diarization research has focused on unsupervised scenarios, where no human supervisi... more Most speaker diarization research has focused on unsupervised scenarios, where no human supervision is available. However, in many real-world applications, a certain amount of human input could be expected, especially when minimal human supervision brings significant performance improvement. In this study, we propose an active learning based bottom-up speaker clustering algorithm to effectively improve speaker diarization performance with limited human input. Specifically, the proposed active learning based speaker clustering has two different stages: explore and constrained clustering. The explore stage is to quickly discover at least one sample for each speaker for boosting speaker clustering process with reliable initial speaker clusters. After discovering all, or a majority, of the involved speakers during explore stage, the constrained clustering is performed. Constrained clustering is similar to traditional bottom-up clustering process with an important difference that the clusters created during explore stage are restricted from merging with each other. Constrained clustering continues until only the clusters generated from the explore stage are left. Since the objective of active learning based speaker clustering algorithm is to provide good initial speaker models, performance saturates as soon as sufficient examples are ensured for each cluster. To further improve diarization performance with increasing human input, we propose a second method which actively select speech segments that account for the largest expected speaker error from existing cluster assignments for human evaluation and reassignment. The algorithms are evaluated on our recently created Apollo Mission Control Center dataset as well as augmented multiparty interaction meeting corpus. The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision. Index Terms-Active learning, bottom-up clustering, speaker diarization. I. INTRODUCTION S PEAKER diarization is the process of automatically detecting who spoke when in an audio sequence. With an

Research paper thumbnail of Dialect separation assessment using log-likelihood score distributions

Interspeech 2008, 2008

Dialect differences within a given language represent major challenges for sustained speech syste... more Dialect differences within a given language represent major challenges for sustained speech system performance. For speech recognition, little if any knowledge exists on differences between dialects (e.g. vocabulary, grammar, prosody, etc.). Effective dialect classification can contribute to improved ASR, speaker ID, and spoken document retrieval. This study, presents an approach to establish a metric to estimate the separation between dialects, and to provide some sense of expected speech system performance. The proposed approach compares dialects based on their loglikelihood score distributions. From the score distributions, a numerical measure is obtained to assess the separation between resulting GMM dialect models. The proposed scheme is evaluated on a corpus of Arabic dialects. The sensitivity of the dialect separation score is also quantified based on controlled mixing of dialect data for the case of measuring dialect training data purity. The resulting scheme is shown to be effective in measuring dialect distance, and represents an important objective way of assessing dialect differences within a common language.

Research paper thumbnail of Blind Spectral Weighting for Robust Speaker Identification under Reverberation Mismatch

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014

Room reverberation poses various deleterious effects on performance of automatic speech systems. ... more Room reverberation poses various deleterious effects on performance of automatic speech systems. Speaker identification (SID) performance, in particular, degrades rapidly as reverberation time increases. Reverberation causes two forms of spectro-temporal distortions on speech signals: i) self-masking which is due to early reflections and ii) overlap-masking which is due to late reverberation. Overlap-masking effect of reverberation has been shown to have a greater adverse impact on performance of speech systems. Motivated by this fact, this study proposes a blind spectral weighting (BSW) technique for suppressing the reverberation overlap-masking effect on SID systems. The technique is blind in the sense that prior knowledge of neither the anechoic signal nor the room impulse response is required. Performance of the proposed technique is evaluated on speaker verification tasks under simulated and actual reverberant mismatched conditions. Evaluations are conducted in the context of the conventional GMM-UBM as well as the state-of-the-art i-vector based systems. The GMM-UBM experiments are performed using speech material from a new data corpus well suited for speaker verification experiments under actual reverberant mismatched conditions, entitled MultiRoom8. The i-vector experiments are carried out with microphone (interview and phonecall) data from the NIST SRE 2010 extended evaluation set which are digitally convolved with three different measured room impulse responses extracted from the Aachen impulse response (AIR) database. Experimental results prove that incorporating the proposed blind technique into the standard MFCC feature extraction framework yields significant improvement in SID performance under reverberation mismatch.

Research paper thumbnail of Single-channel speech separation using Soft-minimum Permutation Invariant Training

ArXiv, 2021

The goal of speech separation is to extract multiple speech sources from a single microphone reco... more The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in ha...

Research paper thumbnail of In-vehicle based speech processing for hearing impaired subjects

In-Vehicle Based Speech Processing for Hearing Impaired Listeners Xianxian Zhangl, John HL Hansen... more In-Vehicle Based Speech Processing for Hearing Impaired Listeners Xianxian Zhangl, John HL Hansen”, Kathryn Arehart2, Jessica Rossi-Katzz 1 Robust Speech Processing Group, Center for Spoken Language Research, University of Colorado, Boulder ... [2.] Deller, JR, Hansen ...

Research paper thumbnail of Missing-feature method for speaker recognition in band-restricted conditions

In this study, the missing-feature method is considered to address band-limited speech for speake... more In this study, the missing-feature method is considered to address band-limited speech for speaker recognition. In an effort to mitigate possible degradation due to the general speaker independent model, a two-step reconstruction scheme is developed, where speaker class independent/dependent models are used separately. An advanced marginalization in the cepstral domain is proposed employing a high order extension method in order to address loss of model accuracy in the conventional method due to cepstrum truncation. To detect the cut-off regions from incoming speech, a blind mask estimation scheme is employed which uses a synthesized band-limited speech model. Experimental results on band-limited conditions indicate that our two-step reconstruction scheme with missingfeature processing is effective in improving in-set/out-of-set speaker recognition performance for band-limited speech, particularly in severely band-restricted conditions (i.e., 4.72% EER improvement in 2, 3, and 4kHz band-limited conditions over a conventional data-driven method). The improvement of the proposed marginalization method proves its effectiveness for acoustic model conversion by employing high order extension, showing 0.57% EER improvement over conventional marginalization.

Research paper thumbnail of Audio-Visual Isolated Digit Recognition For Whispered Speech

Publication in the conference proceedings of EUSIPCO, Barcelona, Spain, 2011

Research paper thumbnail of A Systematic Strategy For Robust Automatic Dialect Identification

Publication in the conference proceedings of EUSIPCO, Barcelona, Spain, 2011

Research paper thumbnail of Improved "TEO" feature-based automatic stress detection using physiological and acoustic speech sensors

The acoustic pressure microphone has served as the primary instrument for collecting speech data ... more The acoustic pressure microphone has served as the primary instrument for collecting speech data for automatic speech recognition systems. The acoustic microphone suffers from limitations, such as sensitivity to background noise and relatively far proximity to speech production organs. Alternative speech collection sensors may serve to enhance the effectiveness of automatic speech recognition systems. In this study, we first consider an experimental evaluation of the TEO-CB-AutoEnv feature in an actual law enforcement training scenario. We consider feature relation to stress level assessment over time. Next, we explore the use of the physiological microphone, a gel-based device placed next to the vocal folds on the outside of the throat used to measure vibrations of the vocal tract and minimize background noise, as we investigate the effectiveness of a TEO-CB-AutoEnvbased automatic stress recognition system. We employ both acoustic and physiological sensors as stand-alone speech data collection devices as well as consider both sensors concurrently. For the latter, we devise a weighted composite decision scheme using both the acoustic and physiological microphone data that yields relative average error rate reductions of 32% and 6% versus sole employment of acoustic and physiological microphone data, respectively, in a realistic stressful environment.