Chalapathy Neti - Academia.edu (original) (raw)
Papers by Chalapathy Neti
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000
Abstract Introduces a practical system that aims to detect a user's intent to speak to a comp... more Abstract Introduces a practical system that aims to detect a user's intent to speak to a computer, by considering both audio and visual cues. The whole system is designed to intuitively turn on the microphone for speech recognition without needing to click on a ...
Proceedings of the IEEE, 2003
Visual speech information from the speaker's mouth region has been successfully shown to improve ... more Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
Much progress has been achieved during the past two decades in audio-visual automatic speech reco... more Much progress has been achieved during the past two decades in audio-visual automatic speech recognition (AVASR). However, challenges persist that hinder AVASR deployment in practical situations, most notably, robust and fast extraction of visual speech features. We review our effort in overcoming this problem, based on an appearance-based visual feature representation of the speaker's mouth region. In particular: (a) We discuss AVASR in realistic, visually challenging domains, where lighting, background, and head-pose vary significantly. To enhance visual-front-end robustness in such environments, we employ an improved statistical-based face detection algorithm, that significantly outperforms our baseline scheme. However, visual-only recognition remains inferior to visually "clean" (studio-like) data, thus demonstrating the importance of accurate mouth region extraction. (b) We then consider a wearable audio-visual sensor to directly capture the mouth region, thus eliminating face detection. Its use improves visual-only recognition, even over full-face videos recorded in the studio-like environment. (c) Finally, we address the speed issue in visual feature extraction, by discussing our real-time AVASR prototype implementation. The reported progress demonstrates the feasibility of practical AVASR.
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), 2004
This paper looks into the information fusion problem in the context of audio-visual speech recogn... more This paper looks into the information fusion problem in the context of audio-visual speech recognition. Existing approaches to audio-visual fusion typically address the problem in either the feature domain or the decision domain. In this work, we consider a hybrid approach that aims to take advantages of both the feature fusion and the decision fusion methodologies. We introduce a general formulation to facilitate information fusion at multiple stages, followed by an experimental study of a set of fusion schemes allowed by the framework. The proposed method is implemented on a realtime audio-visual speech recognition system, and evaluated on connected digit recognition tasks under varying acoustic conditions. The results show that the multistage fusion system consistently achieves lower word error rates than the reference feature fusion and decision fusion systems. It is further shown that removing the audio only channel from the multistage system only leads to minimal degradations in recognition performance while providing a noticeable reduction in computational load.
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
This paper addresses the problem of audio-visual information fusion to provide highly robust spee... more This paper addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood estimation. Experiments, performed for a speaker-independent large vocabulary continuous speech recognition task and different integration methods, show that best performance is obtained by asynchronous stream integration. This system reduces the error rate at a 8.5 dB SNR with additive speech "babble" noise by 27 % relative over audio-only models and by 12 % relative over traditional audio-visual models using concatenative feature fusion.
IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., 2001
Four different visual speech parameterisation methods are compared on a large vocabulary, continu... more Four different visual speech parameterisation methods are compared on a large vocabulary, continuous, audio-visual speech recognition task using the IBM ViaVoice TM audio-visual speech database. Three are direct mouth image region based transforms; discrete cosine and wavelet transforms, and principal component analysis. The fourth uses a statistical model of shape and appearance called an active appearance model, to track and obtain model parameters describing the entire face.
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
Page 1. IMPROVED FACE AND FEATURE FINDING FOR AUDIO-VISUAL SPEECH RECOGNITION IN VISUALLY CHALLEN... more Page 1. IMPROVED FACE AND FEATURE FINDING FOR AUDIO-VISUAL SPEECH RECOGNITION IN VISUALLY CHALLENGING ENVIRONMENTS Jintao Jiang 1 , Gerasimos Potamianos 2 , Harriet Nock 2 , Giridharan Iyengar 2 , Chalapathy Neti 2 ...
2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), 2003
We present a prototype for the automatic recognition of audiovisual speech, developed to augment ... more We present a prototype for the automatic recognition of audiovisual speech, developed to augment the IBM ViaVoice TM speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium TM 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is therefore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice TM engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.
Proceedings of The IEEE, 2003
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous spe... more In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio-and visual-only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: The first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).
Workshop on Multimedia Information Systems, 2002
In this paper we describe methods for automatic labeling of high - level semantic concepts in doc... more In this paper we describe methods for automatic labeling of high - level semantic concepts in documentary style videos The empha - sis of this paper is on audio processing and on fusing informa - tion from multiple modalities The work described represents ini - tial work towards a trainable system that acquires a collection of generic "intermediate" semantic
International Conference on Acoustics, Speech, and Signal Processing, 2002
In this paper we present our approach to detect monologues in video shots. A monologue shot is de... more In this paper we present our approach to detect monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony be- tween audio and video signals are also
Pattern Recognition, 2004
Orthogonal information present in the video signal associated with the audio helps in improving t... more Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features from the video sequence. This paper uses geometrical features to describe the lip shapes. Curve-based Active Shape Models are used to extract the geometry. These geometrically represented visual parameters are used along with the audio cepstral features to perform an audio-visual classification. It is shown that the bimodal system presented here gives an improvement in the classification results over classification using only the audio features.
In this paper we present a Hindi Speech Recognition system which has been trained on 40 hours of ... more In this paper we present a Hindi Speech Recognition system which has been trained on 40 hours of audio data and a has a trigram language model that is trained with 3 million words. For a vocabulary size of 65000 words, the system gives a word accuracy of 75% to 95%.
Lecture Notes in Computer Science, 2003
... drumbeats and an image sequence show-ing the person beating that drum would be ... Suggestion... more ... drumbeats and an image sequence show-ing the person beating that drum would be ... Suggestions include [14], which uses Canonical Correlation Analysis on a set of training data to find ... in broadcast video would allow wider application of IBM's audio-visual speech recognition ...
Proceedings of the second international conference on Human Language Technology Research -, 2002
... have met limited success in severely degraded environ-ments, mismatched to system training [2... more ... have met limited success in severely degraded environ-ments, mismatched to system training [2-4]. Clearly, novel, non-traditional ... A number of audio-visual integration strategies appear in the lit-erature that can be grouped into two broad categories (see ... [4] R. Stern, A. Acero, F ...
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
In this paper, we propose a new fast and flexible algorithm based on the maximum entropy (MAXENT)... more In this paper, we propose a new fast and flexible algorithm based on the maximum entropy (MAXENT) criterion to estimate stream weights in a state-synchronous multi-stream HMM. The technique is compared to the minimum classification error (MCE) criterion and to a brute-force, grid-search optimization of the WER on both a small and a large vocabulary audio-visual continuous speech recognition task. When estimating global stream weights, the MAX-ENT approach gives comparable results to the grid-search and the MCE. Estimation of state dependent weights is also considered: We observe significant improvements in both the MAXENT and MCE criteria, which, however, do not result in significant WER gains.
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-v... more We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied on MFCC based audio-only features, as well as on visualonly features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied on the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoice TM audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case.
Proceedings. IEEE International Conference on Multimedia and Expo, 2002
We describe methods for automatic labeling of high-level semantic concepts in documentary style v... more We describe methods for automatic labeling of high-level semantic concepts in documentary style videos. The emphasis of this paper is on audio processing and on fusing information from multiple modalities. The work described represents initial work towards a trainable system that acquires a collection of generic "intermediate" semantic concepts across modalities (such as audio, video, text) and combines information from
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000
Abstract Introduces a practical system that aims to detect a user's intent to speak to a comp... more Abstract Introduces a practical system that aims to detect a user's intent to speak to a computer, by considering both audio and visual cues. The whole system is designed to intuitively turn on the microphone for speech recognition without needing to click on a ...
Proceedings of the IEEE, 2003
Visual speech information from the speaker's mouth region has been successfully shown to improve ... more Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
Much progress has been achieved during the past two decades in audio-visual automatic speech reco... more Much progress has been achieved during the past two decades in audio-visual automatic speech recognition (AVASR). However, challenges persist that hinder AVASR deployment in practical situations, most notably, robust and fast extraction of visual speech features. We review our effort in overcoming this problem, based on an appearance-based visual feature representation of the speaker's mouth region. In particular: (a) We discuss AVASR in realistic, visually challenging domains, where lighting, background, and head-pose vary significantly. To enhance visual-front-end robustness in such environments, we employ an improved statistical-based face detection algorithm, that significantly outperforms our baseline scheme. However, visual-only recognition remains inferior to visually "clean" (studio-like) data, thus demonstrating the importance of accurate mouth region extraction. (b) We then consider a wearable audio-visual sensor to directly capture the mouth region, thus eliminating face detection. Its use improves visual-only recognition, even over full-face videos recorded in the studio-like environment. (c) Finally, we address the speed issue in visual feature extraction, by discussing our real-time AVASR prototype implementation. The reported progress demonstrates the feasibility of practical AVASR.
2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), 2004
This paper looks into the information fusion problem in the context of audio-visual speech recogn... more This paper looks into the information fusion problem in the context of audio-visual speech recognition. Existing approaches to audio-visual fusion typically address the problem in either the feature domain or the decision domain. In this work, we consider a hybrid approach that aims to take advantages of both the feature fusion and the decision fusion methodologies. We introduce a general formulation to facilitate information fusion at multiple stages, followed by an experimental study of a set of fusion schemes allowed by the framework. The proposed method is implemented on a realtime audio-visual speech recognition system, and evaluated on connected digit recognition tasks under varying acoustic conditions. The results show that the multistage fusion system consistently achieves lower word error rates than the reference feature fusion and decision fusion systems. It is further shown that removing the audio only channel from the multistage system only leads to minimal degradations in recognition performance while providing a noticeable reduction in computational load.
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
This paper addresses the problem of audio-visual information fusion to provide highly robust spee... more This paper addresses the problem of audio-visual information fusion to provide highly robust speech recognition. We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration. We show how these models can be trained jointly based on maximum likelihood estimation. Experiments, performed for a speaker-independent large vocabulary continuous speech recognition task and different integration methods, show that best performance is obtained by asynchronous stream integration. This system reduces the error rate at a 8.5 dB SNR with additive speech "babble" noise by 27 % relative over audio-only models and by 12 % relative over traditional audio-visual models using concatenative feature fusion.
IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., 2001
Four different visual speech parameterisation methods are compared on a large vocabulary, continu... more Four different visual speech parameterisation methods are compared on a large vocabulary, continuous, audio-visual speech recognition task using the IBM ViaVoice TM audio-visual speech database. Three are direct mouth image region based transforms; discrete cosine and wavelet transforms, and principal component analysis. The fourth uses a statistical model of shape and appearance called an active appearance model, to track and obtain model parameters describing the entire face.
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004
Page 1. IMPROVED FACE AND FEATURE FINDING FOR AUDIO-VISUAL SPEECH RECOGNITION IN VISUALLY CHALLEN... more Page 1. IMPROVED FACE AND FEATURE FINDING FOR AUDIO-VISUAL SPEECH RECOGNITION IN VISUALLY CHALLENGING ENVIRONMENTS Jintao Jiang 1 , Gerasimos Potamianos 2 , Harriet Nock 2 , Giridharan Iyengar 2 , Chalapathy Neti 2 ...
2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698), 2003
We present a prototype for the automatic recognition of audiovisual speech, developed to augment ... more We present a prototype for the automatic recognition of audiovisual speech, developed to augment the IBM ViaVoice TM speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium TM 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is therefore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice TM engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.
Proceedings of The IEEE, 2003
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous spe... more In this work we demonstrate an improvement in the state-of-theart large vocabulary continuous speech recognition (LVCSR) performance, under clean and noisy conditions, by the use of visual information, in addition to the traditional audio one. We take a decision fusion approach for the audio-visual information, where the single-modality (audio-and visual-only) HMM classifiers are combined to recognize audio-visual speech. More specifically, we tackle the problem of estimating the appropriate combination weights for each of the modalities. Two different techniques are described: The first uses an automatically extracted estimate of the audio stream reliability in order to modify the weights for each modality (both clean and noisy audio results are reported), while the second is a discriminative model combination approach where weights on pre-defined model classes are optimized to minimize WER (clean audio only results).
Workshop on Multimedia Information Systems, 2002
In this paper we describe methods for automatic labeling of high - level semantic concepts in doc... more In this paper we describe methods for automatic labeling of high - level semantic concepts in documentary style videos The empha - sis of this paper is on audio processing and on fusing informa - tion from multiple modalities The work described represents ini - tial work towards a trainable system that acquires a collection of generic "intermediate" semantic
International Conference on Acoustics, Speech, and Signal Processing, 2002
In this paper we present our approach to detect monologues in video shots. A monologue shot is de... more In this paper we present our approach to detect monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony be- tween audio and video signals are also
Pattern Recognition, 2004
Orthogonal information present in the video signal associated with the audio helps in improving t... more Orthogonal information present in the video signal associated with the audio helps in improving the accuracy of a speech recognition system. Audio-visual speech recognition involves extraction of both the audio as well as visual features from the input signal. Extraction of visual parameters is done by the recognition of speech dependent features from the video sequence. This paper uses geometrical features to describe the lip shapes. Curve-based Active Shape Models are used to extract the geometry. These geometrically represented visual parameters are used along with the audio cepstral features to perform an audio-visual classification. It is shown that the bimodal system presented here gives an improvement in the classification results over classification using only the audio features.
In this paper we present a Hindi Speech Recognition system which has been trained on 40 hours of ... more In this paper we present a Hindi Speech Recognition system which has been trained on 40 hours of audio data and a has a trigram language model that is trained with 3 million words. For a vocabulary size of 65000 words, the system gives a word accuracy of 75% to 95%.
Lecture Notes in Computer Science, 2003
... drumbeats and an image sequence show-ing the person beating that drum would be ... Suggestion... more ... drumbeats and an image sequence show-ing the person beating that drum would be ... Suggestions include [14], which uses Canonical Correlation Analysis on a set of training data to find ... in broadcast video would allow wider application of IBM's audio-visual speech recognition ...
Proceedings of the second international conference on Human Language Technology Research -, 2002
... have met limited success in severely degraded environ-ments, mismatched to system training [2... more ... have met limited success in severely degraded environ-ments, mismatched to system training [2-4]. Clearly, novel, non-traditional ... A number of audio-visual integration strategies appear in the lit-erature that can be grouped into two broad categories (see ... [4] R. Stern, A. Acero, F ...
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
In this paper, we propose a new fast and flexible algorithm based on the maximum entropy (MAXENT)... more In this paper, we propose a new fast and flexible algorithm based on the maximum entropy (MAXENT) criterion to estimate stream weights in a state-synchronous multi-stream HMM. The technique is compared to the minimum classification error (MCE) criterion and to a brute-force, grid-search optimization of the WER on both a small and a large vocabulary audio-visual continuous speech recognition task. When estimating global stream weights, the MAX-ENT approach gives comparable results to the grid-search and the MCE. Estimation of state dependent weights is also considered: We observe significant improvements in both the MAXENT and MCE criteria, which, however, do not result in significant WER gains.
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-v... more We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied on MFCC based audio-only features, as well as on visualonly features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied on the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoice TM audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case.
Proceedings. IEEE International Conference on Multimedia and Expo, 2002
We describe methods for automatic labeling of high-level semantic concepts in documentary style v... more We describe methods for automatic labeling of high-level semantic concepts in documentary style videos. The emphasis of this paper is on audio processing and on fusing information from multiple modalities. The work described represents initial work towards a trainable system that acquires a collection of generic "intermediate" semantic concepts across modalities (such as audio, video, text) and combines information from