Audio-visual speech perception without speech cues (original) (raw)

Top-down and bottom-up modulation of audiovisual integration in speech

European Journal of Cognitive Psychology, 2005

This research assesses how audiovisual speech integration mechanisms are modulated by sensory and cognitive variables. For this purpose, the McGurk effect ) was used as an experimental paradigm. This effect occurs when participants are exposed to incongruent auditory and visual speech signals. For example, when an auditory /b/ is dubbed onto a visual /g/, listeners are led to perceive a fused phoneme like /d/. With the reverse presentation, they experience a combination such as /bg/. In two experiments, auditory intensity (40 dB, 50 dB, 60 dB, and 70 dB), face size (large : 19 * 23 cm and small: 1.8 * 2 cm) and instructions ("multiple choice"

Evidence of correlation between acoustic and visual features of speech

This paper examines the degree of correlation between lip and jaw configuration and speech acoustics. The lip and jaw positions are characterised by a system of measurements taken from video images of the speaker's face and profile, and the acoustics are represented using line spectral pair parameters and a measure of RMS energy. A correlation is found between the measured acoustic parameters and a linear estimate of the acoustics recovered from the visual data. This correlation exists despite the simplicity of the visual representation and is in rough agreement with correlations measured in earlier work by Yehia et al. using different techniques. However, analysis of the estimation errors suggests that the visual information, as parameterised in our experiment, offers only a weak constraint on the acoustics. Results are discussed from the perspective of models of early audiovisual integration.

Speech intelligibility derived from asynchronous processing of auditory-visual information

Proceedings of the Workshop on Audio-Visual …, 2001

The current study examines the temporal parameters associated with cross-modal integration of auditory-visual information for sentential material. The speech signal was filtered into 1/3-octave channels, all of which were discarded except for a low-frequency (298-375 Hz) and a high-frequency (4762-6000 Hz) band. The intelligibility of this audio-only signal ranged between 9% and 31% for nine normal-hearing subjects. Visual-alone presentation of the same material ranged between 1% and 22% intelligibility. When the audio and video signals are combined and presented in synchrony, intelligibility climbs to an average of 63%. When the audio signal leads the video, intelligibility declines appreciably for even the shortest asynchrony of 40 ms, falling to an asymptotic level of performance for asynchronies of approximately 120 ms and longer. In contrast, when the video signal leads the audio, intelligibility remains relatively stable for onset asynchronies up to 160-200 ms. Hence, there is a marked asymmetry in the integration of audio and visual information that has important implications for sensory-based models of auditory-visual speech processing.

Audio–visual speech perception is special

Cognition, 2005

In face-to-face conversation speech is perceived by ear and eye. We studied the prerequisites of audio-visual speech perception by using perceptually ambiguous sine wave replicas of natural speech as auditory stimuli. When the subjects were not aware that the auditory stimuli were speech, they showed only negligible integration of auditory and visual stimuli. When the same subjects learned to perceive the same auditory stimuli as speech, they integrated the auditory and visual stimuli in a similar manner as natural speech. These results demonstrate the existence of a multisensory speech-specific mode of perception.

Visual analog of the acoustic amplitude envelope benefits speech perception in noise

The Journal of the Acoustical Society of America

The nature of the visual input that integrates with the audio signal to yield speech processing advantages remains controversial. This study tests the hypothesis that the information extracted for audiovisual integration includes co-occurring suprasegmental dynamic changes in the acoustic and visual signal. English sentences embedded in multitalker babble noise were presented to native English listeners in audio-only and audiovisual modalities. A significant intelligibility enhancement with the visual analogs congruent to the acoustic amplitude envelopes was observed. These results suggest that dynamic visual modulation provides speech rhythmic information that can be integrated online with the audio signal to enhance speech intelligibility. V

THE EFFECTS OF SEPARATING AUDITORY AND VISUAL SOURCES ON AUDIOVISUAL INTEGRATION OF SPEECH

When the image of a speaker saying the bisyllable /aga/ is presented in synchrony with the sound of a speaker saying /aba/, subjects tend to report hearing the sound /ada/. The present experiment explores the effects of spatial separation on this class of perceptual illusion known as the McGurk effect. Synchronous auditory and visual speech signals were presented from different locations. The auditory signal was presented from positions 0°, 30°, 60° and 90° in azimuth away from the visual signal source. The results show that spatial incongruencies do not substantially influence the multimodal integration of speech signals.

Effects of Mouth-Only and Whole-Face Displays on Audio-Visual Speech Perception in Noise: Is the Vision of a Talker’s Full Face Truly the Most Efficient Solution?

The goal of the present study was to establish the nature of visual input (featural vs holistic) and the mode of its presentation that facilitates best audio-visual speech perception. Sixteen participants were asked to repeat acoustically strongly and mildly degraded syllables, presented in auditory and three audio-visual conditions, within which one contained holistic and two contained featural visual information. The featural audio-visual conditions differed in characteristics of talker’s mouth presentation. Data on correct repetitions and participants fixations duration in talker’s mouth area were collected. The results showed that the facilitative effect of visual information on speech perception depended upon both auditory input degradation level and the visual presentation format, while eye-movement behavior was only affected by the visual input format. Featural information, when presented in a format containing no high contrast elements, was overall the most efficient visual ...

Visual Influences on Perception of Speech and Nonspeech Vocal-Tract Events

Language and Speech, 2006

We report four experiments designed to determine whether visual information affects judgments of acoustically-specified nonspeech events as well as speech events (the "McGurk effect"). Previous findings have shown only weak McGurk effects for nonspeech stimuli, whereas strong effects are found for consonants. We used click sounds that serve as consonants in some African languages, but that are perceived as nonspeech by American English listeners. We found a significant McGurk effect for clicks presented in isolation that was much smaller than that found for stop-consonant-vowel syllables. In subsequent experiments, we found strong McGurk effects, comparable to those found for English syllables, for click-vowel syllables, and weak effects, comparable to those found for isolated clicks, for excised release bursts of stop consonants presented in isolation. We interpret these findings as evidence that the potential contributions of speech-specific processes on the McGurk effect are limited, and discuss the results in relation to current explanations for the McGurk effect.