Ravichander Vipperla - Academia.edu (original) (raw)

Papers by Ravichander Vipperla

Research paper thumbnail of Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization

Convolutive non-negative matrix factorization (CNMF) is an ef-fective approach for supervised aud... more Convolutive non-negative matrix factorization (CNMF) is an ef-fective approach for supervised audio source separation. It re-lies on the availability of sufficient training data to learn a set of bases for each acoustic source. For automatic speech recog-nition (ASR) in a multi-source noise environment, the varied nature of background noise makes it a challenging task to learn the noise bases and thereby to suppress it from the speech sig-nal using CNMF. A large amount of training data is required to reliably capture noise variation, but this generally leads to an unacceptable computational burden. Here, we address this problem by learning the noise bases using a computationally efficient, online CNMF approach. By learning the noise bases from several hours of ambient noise data and over a few seconds of local acoustic context, we show that background noise can be effectively attenuated from noisy speech. ASR accuracies on the CHiME corpus with the denoised speech show relative impr...

Research paper thumbnail of Phone adaptive training for speaker diarization

The linguistic content of a speech signal is a source of unwanted variation which can degrade spe... more The linguistic content of a speech signal is a source of unwanted variation which can degrade speaker diariza-tion performance. This paper presents our latest work to reduce its impact. The new approach, referred to as Phone Adaptive Training (PAT), is analogous to speaker adaptive training used in automatic speech recognition. We report an oracle experiment which shows that PAT has the potential to deliver a 33% relative improvement in the diarization error rate of our baseline system. Prac-tical experiments show significant improvements across two standard, independent evaluation datasets.

Research paper thumbnail of Convolutive non-negative sparse coding and new features for speech overlap handling in speaker diarization

The effective handling of overlapping speech is at the limits of the current state of the art in ... more The effective handling of overlapping speech is at the limits of the current state of the art in speaker diarization. This pa-per presents our latest work in overlap detection. We report the combination of features derived through convolutive non-negative sparse coding and new energy, spectral and voicing-related features within a conventional HMM system. Overlap detection results are fully integrated into our top-down diariza-tion system through the application of overlap exclusion and overlap labeling. Experiments on a subset of the AMI cor-pus show that the new system delivers significant reductions in missed speech and speaker error. Through overlap exclusion and labelling the overall diarization error rate is shown to im-prove by 6.4 % relative.

Research paper thumbnail of A new speaker verification spoofing countermeasure based on local binary patterns

This paper presents a new countermeasure for the protection of automatic speaker verification sys... more This paper presents a new countermeasure for the protection of automatic speaker verification systems from spoofed, converted voice signals. The new countermeasure is based on the analysis of a sequence of acoustic feature vectors using Local Binary Patterns (LBPs). Compared to existing approaches the new countermeasure is less reliant on prior knowledge and affords robust protection from not only voice conversion, for which it is optimised, but also spoofing attacks from speech synthesis and artificial signals, all of which otherwise provoke significant increases in false acceptance. The work highlights the difficulty in detecting converted voice and also discusses the need for formal evaluations to develop new countermeasures which are less reliant on prior knowledge and thus more reflective of practical use cases.

Research paper thumbnail of Speech overlap detection and attribution using convolutive non-negative sparse coding

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

This paper presents recent advances in the application of convolutive non-negative sparse coding ... more This paper presents recent advances in the application of convolutive non-negative sparse coding (CNSC) to the problem of overlap detection in the context of conference meetings and speaker diarization. CNSC is used to project a mixed speaker signal onto separate speaker bases and hence to detect intervals of competing speech. We present new energy ratio and total energy features which give significant improvements over our previous work. The system is assessed using a subset of the AMI meeting corpus. We report results which are comparable to the state of the art which support the potential of a new approach to overlap detection. An analysis of system performance highlights the importance of further work to addresses weaknesses in detecting particularly short segments of overlapping speech.

Research paper thumbnail of Speech Input from Older Users in Smart Environments: Challenges and Perspectives

Lecture Notes in Computer Science, 2009

Although older people are an important user group for smart environments, there has been relative... more Although older people are an important user group for smart environments, there has been relatively little work on adapting natural language interfaces to their requirements. In this paper, we focus on a particularly thorny problem: processing speech input from older users. Our experiments on the MATCH corpus show clearly that we need age-specific adaptation in order to recognize older users' speech reliably. Language models need to cover typical interaction patterns of older people, and acoustic models need to accommodate older voices. Further research is needed into intelligent adaptation techniques that will allow existing large, robust systems to be adapted with relatively small amounts of in-domain, age appropriate data. In addition, older users need to be supported with adequate strategies for handling speech recognition errors. 1 National Statistics: Morbidity: Arthritis is more common in women.

Research paper thumbnail of Evolutionary discriminative confidence estimation for spoken term detection

Multimedia Tools and Applications, 2013

ABSTRACT Spoken term detection (STD) is the task of searching for occurrences of spoken terms in ... more ABSTRACT Spoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 h of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.

Research paper thumbnail of Online Non-Negative Convolutive Pattern Learning for Speech Signals

IEEE Transactions on Signal Processing, 2000

The unsupervised learning of spectro-temporal patterns within speech signals is of interest in a ... more The unsupervised learning of spectro-temporal patterns within speech signals is of interest in a broad range of applications. Where patterns are non-negative and convolutive in nature, relevant learning algorithms include convolutive nonnegative matrix factorization (CNMF) and its sparse alternative, convolutive non-negative sparse coding (CNSC). Both algorithms, however, place unrealistic demands on computing power and memory which prohibit their application in large scale tasks. This paper proposes a new online implementation of CNMF and CNSC which processes input data piece-by-piece and updates learned patterns gradually with accumulated statistics. The proposed approach facilitates pattern learning with huge volumes of training data that are beyond the capability of existing alternatives. We show that, with unlimited data and computing resources, the new online learning algorithm almost surely converges to a local minimum of the objective cost function. In more realistic situations, where the amount of data is large and computing power is limited, online learning tends to obtain lower empirical cost than conventional batch learning.

Research paper thumbnail of Ageing Voices: The Effect of Changes in Voice Parameters on ASR Performance

EURASIP Journal on Audio, Speech, and Music Processing, 2010

Abstract With ageing, human voices undergo several changes which are typically characterized by i... more Abstract With ageing, human voices undergo several changes which are typically characterized by increased hoarseness and changes in articulation patterns. In this study, we have examined the effect on Automatic Speech Recognition (ASR) and found that the Word Error Rates (WER) on older voices is 10% absolute higher compared to those of adult voices. Subsequently, we compared several voice source parameters including fundamental frequency, jitter, shimmer, harmonicity, and cepstral peak prominence of adult and older ...

Research paper thumbnail of Convolutive Non-Negative Sparse Coding and New Features for Speech Overlap Handling in Speaker Diarization

energy

The effective handling of overlapping speech is at the limits of the current state of the art in ... more The effective handling of overlapping speech is at the limits of the current state of the art in speaker diarization. This paper presents our latest work in overlap detection. We report the combination of features derived through convolutive nonnegative sparse coding and new energy, spectral and voicingrelated features within a conventional HMM system. Overlap detection results are fully integrated into our top-down diarization system through the application of overlap exclusion and overlap labeling. Experiments on a subset of the ...

Research paper thumbnail of Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals

The vulnerability of automatic speaker verification systems to imposture or spoofing is widely ac... more The vulnerability of automatic speaker verification systems to imposture or spoofing is widely acknowledged. This paper shows that extremely high false alarm rates can be provoked by simple spoofing attacks with artificial, non-speech-like signals and highlights the need for spoofing countermeasures. We show that two new, but trivial countermeasures based on higher-level, dynamic features and voice quality assessment offer varying degrees of protection and that further work is needed to develop more robust spoofing countermeasure mechanisms. Finally, we show that certain classifiers are inherently more robust to such attacks than others which strengthens the case for fused-system approaches to automatic speaker verification.

Research paper thumbnail of ON THE VULNERABILITY OF AUTOMATIC SPEAKER RECOGNITION TO SPOOFING ATTACKS WITH ARTIFICIAL SIGNALS

Automatic speaker verification (ASV) systems are increasingly being used for biometric authentica... more Automatic speaker verification (ASV) systems are increasingly being used for biometric authentication even if their vulnerability to imposture or spoofing is now widely acknowledged. Recent work has proposed different spoofing approaches which can be used to test vulnerabilities. This paper introduces a new approach based on artificial, tone-like signals which provoke higher ASV scores than genuine client tests. Experimental results show degradations in the equal error rate from 8.5% to 77.3% and from 4.8% to 64.3% for standard Gaussian mixture model and factor analysis based ASV systems respectively. These findings demonstrate the importance of efforts to develop dedicated countermeasures, some of them trivial, to protect ASV systems from spoofing.

Research paper thumbnail of Age recognition for spoken dialogue systems: Do we need it?

Proc. Interspeech, 2009

When deciding whether to adapt relevant aspects of the system to the particular needs of older us... more When deciding whether to adapt relevant aspects of the system to the particular needs of older users, spoken dialogue systems often rely on automatic detection of chronological age. In this paper, we show that vocal age-ing as measured by acoustic features is an ...

Research paper thumbnail of Direct posterior confidence for out-of-vocabulary spoken term detection

ACM Transactions on Information Systems, 2012

Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to co... more Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to conventional speech transcription and keyword spotting, STD is an open-vocabulary task and is necessarily required to address out-of-vocabulary (OOV) terms. Approaches based on subword units, e.g. phonemes, are widely used to solve the OOV issue; however, performance on OOV terms is still significantly inferior to that for in-vocabulary (INV) terms.

Research paper thumbnail of Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization

Convolutive non-negative matrix factorization (CNMF) is an ef-fective approach for supervised aud... more Convolutive non-negative matrix factorization (CNMF) is an ef-fective approach for supervised audio source separation. It re-lies on the availability of sufficient training data to learn a set of bases for each acoustic source. For automatic speech recog-nition (ASR) in a multi-source noise environment, the varied nature of background noise makes it a challenging task to learn the noise bases and thereby to suppress it from the speech sig-nal using CNMF. A large amount of training data is required to reliably capture noise variation, but this generally leads to an unacceptable computational burden. Here, we address this problem by learning the noise bases using a computationally efficient, online CNMF approach. By learning the noise bases from several hours of ambient noise data and over a few seconds of local acoustic context, we show that background noise can be effectively attenuated from noisy speech. ASR accuracies on the CHiME corpus with the denoised speech show relative impr...

Research paper thumbnail of Phone adaptive training for speaker diarization

The linguistic content of a speech signal is a source of unwanted variation which can degrade spe... more The linguistic content of a speech signal is a source of unwanted variation which can degrade speaker diariza-tion performance. This paper presents our latest work to reduce its impact. The new approach, referred to as Phone Adaptive Training (PAT), is analogous to speaker adaptive training used in automatic speech recognition. We report an oracle experiment which shows that PAT has the potential to deliver a 33% relative improvement in the diarization error rate of our baseline system. Prac-tical experiments show significant improvements across two standard, independent evaluation datasets.

Research paper thumbnail of Convolutive non-negative sparse coding and new features for speech overlap handling in speaker diarization

The effective handling of overlapping speech is at the limits of the current state of the art in ... more The effective handling of overlapping speech is at the limits of the current state of the art in speaker diarization. This pa-per presents our latest work in overlap detection. We report the combination of features derived through convolutive non-negative sparse coding and new energy, spectral and voicing-related features within a conventional HMM system. Overlap detection results are fully integrated into our top-down diariza-tion system through the application of overlap exclusion and overlap labeling. Experiments on a subset of the AMI cor-pus show that the new system delivers significant reductions in missed speech and speaker error. Through overlap exclusion and labelling the overall diarization error rate is shown to im-prove by 6.4 % relative.

Research paper thumbnail of A new speaker verification spoofing countermeasure based on local binary patterns

This paper presents a new countermeasure for the protection of automatic speaker verification sys... more This paper presents a new countermeasure for the protection of automatic speaker verification systems from spoofed, converted voice signals. The new countermeasure is based on the analysis of a sequence of acoustic feature vectors using Local Binary Patterns (LBPs). Compared to existing approaches the new countermeasure is less reliant on prior knowledge and affords robust protection from not only voice conversion, for which it is optimised, but also spoofing attacks from speech synthesis and artificial signals, all of which otherwise provoke significant increases in false acceptance. The work highlights the difficulty in detecting converted voice and also discusses the need for formal evaluations to develop new countermeasures which are less reliant on prior knowledge and thus more reflective of practical use cases.

Research paper thumbnail of Speech overlap detection and attribution using convolutive non-negative sparse coding

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

This paper presents recent advances in the application of convolutive non-negative sparse coding ... more This paper presents recent advances in the application of convolutive non-negative sparse coding (CNSC) to the problem of overlap detection in the context of conference meetings and speaker diarization. CNSC is used to project a mixed speaker signal onto separate speaker bases and hence to detect intervals of competing speech. We present new energy ratio and total energy features which give significant improvements over our previous work. The system is assessed using a subset of the AMI meeting corpus. We report results which are comparable to the state of the art which support the potential of a new approach to overlap detection. An analysis of system performance highlights the importance of further work to addresses weaknesses in detecting particularly short segments of overlapping speech.

Research paper thumbnail of Speech Input from Older Users in Smart Environments: Challenges and Perspectives

Lecture Notes in Computer Science, 2009

Although older people are an important user group for smart environments, there has been relative... more Although older people are an important user group for smart environments, there has been relatively little work on adapting natural language interfaces to their requirements. In this paper, we focus on a particularly thorny problem: processing speech input from older users. Our experiments on the MATCH corpus show clearly that we need age-specific adaptation in order to recognize older users' speech reliably. Language models need to cover typical interaction patterns of older people, and acoustic models need to accommodate older voices. Further research is needed into intelligent adaptation techniques that will allow existing large, robust systems to be adapted with relatively small amounts of in-domain, age appropriate data. In addition, older users need to be supported with adequate strategies for handling speech recognition errors. 1 National Statistics: Morbidity: Arthritis is more common in women.

Research paper thumbnail of Evolutionary discriminative confidence estimation for spoken term detection

Multimedia Tools and Applications, 2013

ABSTRACT Spoken term detection (STD) is the task of searching for occurrences of spoken terms in ... more ABSTRACT Spoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 h of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.

Research paper thumbnail of Online Non-Negative Convolutive Pattern Learning for Speech Signals

IEEE Transactions on Signal Processing, 2000

The unsupervised learning of spectro-temporal patterns within speech signals is of interest in a ... more The unsupervised learning of spectro-temporal patterns within speech signals is of interest in a broad range of applications. Where patterns are non-negative and convolutive in nature, relevant learning algorithms include convolutive nonnegative matrix factorization (CNMF) and its sparse alternative, convolutive non-negative sparse coding (CNSC). Both algorithms, however, place unrealistic demands on computing power and memory which prohibit their application in large scale tasks. This paper proposes a new online implementation of CNMF and CNSC which processes input data piece-by-piece and updates learned patterns gradually with accumulated statistics. The proposed approach facilitates pattern learning with huge volumes of training data that are beyond the capability of existing alternatives. We show that, with unlimited data and computing resources, the new online learning algorithm almost surely converges to a local minimum of the objective cost function. In more realistic situations, where the amount of data is large and computing power is limited, online learning tends to obtain lower empirical cost than conventional batch learning.

Research paper thumbnail of Ageing Voices: The Effect of Changes in Voice Parameters on ASR Performance

EURASIP Journal on Audio, Speech, and Music Processing, 2010

Abstract With ageing, human voices undergo several changes which are typically characterized by i... more Abstract With ageing, human voices undergo several changes which are typically characterized by increased hoarseness and changes in articulation patterns. In this study, we have examined the effect on Automatic Speech Recognition (ASR) and found that the Word Error Rates (WER) on older voices is 10% absolute higher compared to those of adult voices. Subsequently, we compared several voice source parameters including fundamental frequency, jitter, shimmer, harmonicity, and cepstral peak prominence of adult and older ...

Research paper thumbnail of Convolutive Non-Negative Sparse Coding and New Features for Speech Overlap Handling in Speaker Diarization

energy

The effective handling of overlapping speech is at the limits of the current state of the art in ... more The effective handling of overlapping speech is at the limits of the current state of the art in speaker diarization. This paper presents our latest work in overlap detection. We report the combination of features derived through convolutive nonnegative sparse coding and new energy, spectral and voicingrelated features within a conventional HMM system. Overlap detection results are fully integrated into our top-down diarization system through the application of overlap exclusion and overlap labeling. Experiments on a subset of the ...

Research paper thumbnail of Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals

The vulnerability of automatic speaker verification systems to imposture or spoofing is widely ac... more The vulnerability of automatic speaker verification systems to imposture or spoofing is widely acknowledged. This paper shows that extremely high false alarm rates can be provoked by simple spoofing attacks with artificial, non-speech-like signals and highlights the need for spoofing countermeasures. We show that two new, but trivial countermeasures based on higher-level, dynamic features and voice quality assessment offer varying degrees of protection and that further work is needed to develop more robust spoofing countermeasure mechanisms. Finally, we show that certain classifiers are inherently more robust to such attacks than others which strengthens the case for fused-system approaches to automatic speaker verification.

Research paper thumbnail of ON THE VULNERABILITY OF AUTOMATIC SPEAKER RECOGNITION TO SPOOFING ATTACKS WITH ARTIFICIAL SIGNALS

Automatic speaker verification (ASV) systems are increasingly being used for biometric authentica... more Automatic speaker verification (ASV) systems are increasingly being used for biometric authentication even if their vulnerability to imposture or spoofing is now widely acknowledged. Recent work has proposed different spoofing approaches which can be used to test vulnerabilities. This paper introduces a new approach based on artificial, tone-like signals which provoke higher ASV scores than genuine client tests. Experimental results show degradations in the equal error rate from 8.5% to 77.3% and from 4.8% to 64.3% for standard Gaussian mixture model and factor analysis based ASV systems respectively. These findings demonstrate the importance of efforts to develop dedicated countermeasures, some of them trivial, to protect ASV systems from spoofing.

Research paper thumbnail of Age recognition for spoken dialogue systems: Do we need it?

Proc. Interspeech, 2009

When deciding whether to adapt relevant aspects of the system to the particular needs of older us... more When deciding whether to adapt relevant aspects of the system to the particular needs of older users, spoken dialogue systems often rely on automatic detection of chronological age. In this paper, we show that vocal age-ing as measured by acoustic features is an ...

Research paper thumbnail of Direct posterior confidence for out-of-vocabulary spoken term detection

ACM Transactions on Information Systems, 2012

Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to co... more Spoken term detection (STD) is a fundamental task in spoken information retrieval. Compared to conventional speech transcription and keyword spotting, STD is an open-vocabulary task and is necessarily required to address out-of-vocabulary (OOV) terms. Approaches based on subword units, e.g. phonemes, are widely used to solve the OOV issue; however, performance on OOV terms is still significantly inferior to that for in-vocabulary (INV) terms.