A real-time prototype for small-vocabulary audio-visual ASR (original) (raw)
Related papers
Audio-Visual Speech Recognition
2000
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach human levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements. Visual speech
Towards practical deployment of audio-visual speech recognition
2004
Much progress has been achieved during the past two decades in audio-visual automatic speech recognition (AVASR). However, challenges persist that hinder AVASR deployment in practical situations, most notably, robust and fast extraction of visual speech features. We review our effort in overcoming this problem, based on an appearance-based visual feature representation of the speaker's mouth region. In particular: (a) We discuss AVASR in realistic, visually challenging domains, where lighting, background, and head-pose vary significantly. To enhance visual-front-end robustness in such environments, we employ an improved statistical-based face detection algorithm, that significantly outperforms our baseline scheme. However, visual-only recognition remains inferior to visually "clean" (studio-like) data, thus demonstrating the importance of accurate mouth region extraction. (b) We then consider a wearable audio-visual sensor to directly capture the mouth region, thus eliminating face detection. Its use improves visual-only recognition, even over full-face videos recorded in the studio-like environment. (c) Finally, we address the speed issue in visual feature extraction, by discussing our real-time AVASR prototype implementation. The reported progress demonstrates the feasibility of practical AVASR.
2016 12th International Computer Engineering Conference (ICENCO), 2016
Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. The combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality. Multimodal recognition is therefore acknowledged as a vital component of the next generation of spoken language systems. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Initially, Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose. These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrated by a preliminary experiment on the GRID sentence database one of the largest databases available for audiovisual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 4% for the grammar based word recognition system overall speakers is achieved for speaker independent test.
Recent advances in the automatic recognition of audiovisual speech
2003
Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.
EV/MA: An Architecture for Audio-Visual Speech Recognition
bswerd.com
This paper presents an architecture of audio-visual speech recognition. It is a hidden Markov model based system that uses completely separate models for the auditory and visual domains, with an integration that integrates the classifications made by both modalities. The visual domain uses the parameters of an ellipse approximating the shape of a speaker's lips as a feature for recognition (EV) and the auditory domain uses classical MEL-cepstrum based features (MA). Techniques for extracting these elliptical features are also discussed. The architecture is shown to provide a vast improvement over classical auditory-only recognition when there is noise in the auditory signal.
A segment-based audio-visual speech recognizer
Proceedings of the 6th international conference on Multimodal interfaces - ICMI '04, 2004
This paper presents the development and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. To support this research, we have collected a new video corpus, called Audio-Visual TIMIT (AV-TIMIT), which consists of 4 total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR system which incorporates a novel audio-visual integration scheme using segment-constrained Hidden Markov Models (HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition process.
An audio-visual corpus for multimodal automatic speech recognition
Journal of Intelligent Information Systems, 2017
A review of available audiovisual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audiovisual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audiovisual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.
2014
In this paper we present a system for audio-visual speech recognition based on a hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) approach. To setup the system it was necessary to record a new audio-visual database. We will describe the recording and labeling of the database. The fusion of audio and video data is a key aspect of the paper. Three conditions, when only the audio or only the video data is reliable and when they are both equally reliable, will attract our attention. A method to combine the video and audio information based on these three conditions will be presented. An implementation of this method in an automatic fusion depending on the noise level in the audio channel is developed. The performance of the complete system is demonstrated using two types of additive noise at varying SNR. 1.
Audio-Visual Automatic Speech Recognition Using Dynamic Visual Features
2009
Human speech recognition is bi-modal in nature and the addition of visual information from the speaker's mouth region has been shown to improve the performance of automatic speech recognition (ASR) systems. The performance of audio-only ASRs deteriorates rapidly in the presence of even moderate noise, but can be improved by including visual information from the speaker's mouth region. The new approach taken in this paper is to incorporate dynamic information captured from the speaker's mouth occurring during successive frames of video obtained during uttered speech. Audio-only, visual-only and audio-visual recognisers were studied in the presence of noise and show that the audio-visual recogniser has more robust performance.
Simple and Effective Visual Speech Recognition System
Journal of emerging technologies and innovative research, 2020
: Visual Speech Recognition (VSR) is the process of deciphering speech without any audio means. This is a technique employed by people with hearing impairments. This ability of lip-reading will enable such people to interact with others and engage themselves in conversation. In this paper the Viola-Jones Algorithm is employed to detect and capture the face of a person speaking. The region of Interest (ROI) or the mouth region is defined relatively to the nose region, hence can be identified and extracted. Specific methods have been employed for extraction of features. The Discrete Cosine Transform (DCT) is used to extract the visual features and obtain the final vector to train the model. A speakerdependent VSR system is discussed to recognize the isolated letters and digits from a input video given to it.