A comparison of model and transform-based visual features for audio-visual LVCSR (original) (raw)
Related papers
2004
Abstract We compare two different groups of visual features that can be used in addition to audio to improve automatic speech recognition (ASR), high-and low-level visual features. Facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, are used as high-level visual features. Principal component analysis (PCA) based projection weights of the intensity images of the mouth area are used as low-level visual features. PCA is also applied on the FAPs.
Appearance Feature Extraction Versus Image Transform-Based Approach for Visual Speech Recognition
International Journal of Computational Intelligence and Applications, 2006
In this paper we propose a new appearance based system which consists of two stages: visual speech feature extraction and classification, followed by recognition of the extracted feature, thereby the result is a complete lipreading system. This lipreading system employs our Hyper Column Model (HCM) approach to extract and classify the visual features and uses the Hidden Markov Model (HMM) for recognition. This paper addresses mainly the first stage; i.e. feature extraction and classification. We investigate the HCM performance to achieve feature extraction and classification and then compare the performance when replacing HCM with Fast Discrete Cosine Transform (FDCT). Unlike FDCT, HCM could extract the entire features without any loss. Also the experiments have shown that HCM is generally better than FDCT and provides a good distribution of the phonemes in the feature space for recognition purposes. For fair comparison, two databases are exploited with three different sets of resolution for each database. One of these two databases is designed to include shifted and scaled objects. Experiments reveal that HCM is capable of recovering and dealing with such image restrictions whereas the effectiveness of FDCT drops drastically especially for new subjects.
A comparison of visual features for audiovisual automatic speech recognition
2008
Abstract The use of visual information from speaker's mouth region has shown to improve the performance of Automatic Speech Recognition (ASR) systems. This is particularly useful in presence of noise, which even in moderate form severely degrades the speech recognition performance of systems using only audio information. Various sets of features extracted from speaker's mouth region have been used to improve upon the performance of an ASR system in such challenging conditions and have met many successes.
Comparison of visual features for audio-visual speech recognition using the AURORA-2J-AV database
The Journal of the Acoustical Society of America, 2006
The use of visual information from speaker's mouth region has shown to improve the performance of Automatic Speech Recognition (ASR) systems. This is particularly useful in presence of noise, which even in moderate form severely degrades the speech recognition performance of systems using only audio information. Various sets of features extracted from speaker's mouth region have been used to improve upon the performance of an ASR system in such challenging conditions and have met many successes. To the best of authors knowledge, the effect of using these techniques on recognition performance on the basis of phonemes have not been investigated yet. This paper presents a comparison of phoneme recognition performance using visual features extracted from mouth region-of-interest using discrete cosine transform (DCT) and discrete wavelet transform (DWT). New DCT and DWT features have also been extracted and compared with the previously used one. These features were used along with audio features based on Mel-Frequency Cepstral Coefficients (MFCC). This work will help in selecting suitable features for different application and identify the limitations of these methods in recognition of individual phonemes.
A NEW VISUAL SPEECH MODELLING APPROACH FOR VISUAL SPEECH RECOGNITION
2011
In this paper we propose a new learning-based representation that is referred to as Visual Speech Unit (VSU) for visual speech recognition (VSR). The new Visual Speech Unit concept proposes an extension of the standard viseme model that is currently applied for VSR by including in this representation not only the data associated with the visemes, but also the transitory information between consecutive visemes. The developed speech recognition system consists of several computational stages: (a) lips segmentation, (b) construction of the Expectation-Maximization Principal Component Analysis (EM-PCA) manifolds from the input video image, (c) registration between the models of the VSUs and the EM-PCA data constructed from the input image sequence and (d) recognition of the VSUs using a standard Hidden Markov Model (HMM) classification scheme. In this paper we were particularly interested to evaluate the classification accuracy obtained for our new VSU models when compared with that attained for standard (MPEG-4) viseme models. The experimental results indicate that we achieved 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.
Analysis of lip geometric features for audio-visual speech recognition
IEEE Transactions on Systems, Man, and Cybernetics, 2004
Audio-visual speech recognition employing both acoustic and visual speech information is a novel extension of acoustic speech recognition and it significantly improves the recognition accuracy in noisy environments. Although various audio-visual speech-recognition systems have been developed, a rigorous and detailed comparison of the potential geometric visual features from speakers' faces is essential. Thus, in this paper the geometric visual features are compared and analyzed rigorously for their importance in audio-visual speech recognition. Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results. Speech signals are modeled by hidden Markov models (HMMs) and using the optimized HMMs and geometric visual features the accuracy of acoustic-only, visual-only, and audiovisual speech recognition methods are compared. The audio-visual speech recognition scheme has a much improved recognition accuracy compared to acoustic-only and visual-only speech recognition especially at high noise levels. The experimental results showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal-to-noise ratio of 0 dB).
Audio-visual speech modeling for continuous speech recognition
Multimedia, IEEE Transactions on, 2000
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve the recognition performance in noisy environments. The system consists of three components: 1) a visual module; 2) an acoustic module; and 3) a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally, the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (Relative Spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate.
Audio-Visual Automatic Speech Recognition Using Dynamic Visual Features
2009
Human speech recognition is bi-modal in nature and the addition of visual information from the speaker's mouth region has been shown to improve the performance of automatic speech recognition (ASR) systems. The performance of audio-only ASRs deteriorates rapidly in the presence of even moderate noise, but can be improved by including visual information from the speaker's mouth region. The new approach taken in this paper is to incorporate dynamic information captured from the speaker's mouth occurring during successive frames of video obtained during uttered speech. Audio-only, visual-only and audio-visual recognisers were studied in the presence of noise and show that the audio-visual recogniser has more robust performance.
Extraction of Visual Features for Lipreading
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002
The multimodal nature of speech is often ignored in human-computer interaction, but lip deformations and other body motion, such as those of the head, convey additional information. We integrate speech cues from many sources and this improves intelligibility, especially when the acoustic signal is degraded. This paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively. The third, bottom-up, method uses a nonlinear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multitalker visual speech recognition task of isolated letters.