Effects of separating auditory and visual sources on audiovisual integration of speech (original) (raw)

THE EFFECTS OF SEPARATING AUDITORY AND VISUAL SOURCES ON AUDIOVISUAL INTEGRATION OF SPEECH

When the image of a speaker saying the bisyllable /aga/ is presented in synchrony with the sound of a speaker saying /aba/, subjects tend to report hearing the sound /ada/. The present experiment explores the effects of spatial separation on this class of perceptual illusion known as the McGurk effect. Synchronous auditory and visual speech signals were presented from different locations. The auditory signal was presented from positions 0°, 30°, 60° and 90° in azimuth away from the visual signal source. The results show that spatial incongruencies do not substantially influence the multimodal integration of speech signals.

Psychophysics of the McGurk and other audiovisual speech integration effects

Journal of Experimental Psychology: Human Perception and Performance, 2011

When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, the presented acoustic signal versus the acoustic signal recorded with the presented video, and the correlation between the presented acoustic and video stimuli. In Experiment 1, open-set perceptual responses were obtained for acoustic /bA/ or /lA/ dubbed to video /bA, dA, gA, vA, zA, lA, wA, ΔA/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships.

Timing of auditory-visual integration in the McGurk effect

Society of Neuroscience annual meeting, San Diego, CA, 2001

In the following set of experiments, we used the McGurk illusion first reported by McGurk and McDonald (1976) to examine multisensory integration. In its “fusion” component, the illusion emerges when a participant is presented with an auditory bilabial (eg/ba/) dubbed onto a visual velar (eg articulatory movement/ga/). Under these conditions participants consistently report hearing an alveolar/da/or/∆ a/, virtual percept resulting from the AV fusion.

When a /Bi/g/Gi/g becomes a /Di/g: Explorations of the McGurk effect in speech perception

Australian Journal of Psychology

Although the McGurk Effect is a well researched illusory phenomenon arising from discrepant auditory and visual speech information little is known about the influence of lexical processes on this phenomenon. Thus, we investigated the McGurk Effect using three letter consonant-vowel-consonant real word and pseudoword pairs with an audiovisual discrepancy positioned at either stimulus onset or offset. The results demonstrated that the frequency of illusions was similar for real words and pseudowords when the discrepancy was at stimulus onset but was significantly lower for real words when the audiovisual discrepancy was positioned at stimulus offset. Positioning of audiovisual discrepancy was not important for accurate auditory perception of pseudowords. These results suggest that the McGurk illusion is the result of audiovisual integration that occurs early in perception prior to word identification and that these early audiovisual integrative processes are modulated by lexical knowledge.

Binding and unbinding the McGurk effect in audiovisual speech fusion: Follow-up experiments on a new paradigm

The McGurk effect demonstrates the existence of a fusion process in audiovisual speech perception: the combination of the sound "ba" with the face of a speaker who pronounces "ga" is frequently perceived as "da". We assume that in the upstream of this phonetic fusion process, there is a "binding" process, which controls the combination of image and sound, and can block or reduce it in the case of audiovisual incoherencies (conditional binding process), as in the case of a dubbed film. To test and explore this binding hypothesis, we designed various experiments in which a coherent or incoherent audiovisual context is placed before McGurk stimuli, and we show that the incoherent contextual stimulus can significantly reduce the McGurk effect.

Multisensory integration of speech signals: the relationship between space and time

Experimental Brain Research, 2006

Integrating audiovisual cues for simple events is affected when sources are separated in space and time. By contrast, audiovisual perception of speech appears resilient when either spatial or temporal disparities exist. We investigated whether speech perception is sensitive to the combination of spatial and temporal inconsistencies. Participants heard the bisyllable /aba/ while seeing a face produce the incongruent bisyllable /ava/. We tested the level of visual influence over auditory perception when the sound was asynchronous with respect to facial motion (from −360 to +360 ms) and emanated from five locations equidistant to the participant. Although an interaction was observed, it was not related to participants’ perception of synchrony, nor did it indicate a linear relationship between the effect of spatial and temporal discrepancies. We conclude that either the complexity of the signal or the nature of the task reduces reliance on spatial and temporal contiguity for audiovisual speech perception.

Temporal window of integration in auditory-visual speech perception

Neuropsychologia, 2007

Forty-three normal hearing participants were tested in two experiments, which focused on temporal coincidence in auditory visual (AV) speech perception. In these experiments, audio recordings of/pa/and/ba/were dubbed onto video recordings of /ba/or/ga/, respectively (A p V k , A b V g ), to produce the illusory "fusion" percepts /ta/, or /da/ . Hearing lips and seeing voices. Nature, 264, 746-747]. In Experiment 1, an identification task using McGurk pairs with asynchronies ranging from −467 ms (auditory lead) to +467 ms was conducted. Fusion responses were prevalent over temporal asynchronies from −30 ms to +170 ms and more robust for audio lags. In Experiment 2, simultaneity judgments for incongruent and congruent audiovisual tokens (A d V d , A t V t ) were collected. McGurk pairs were more readily judged as asynchronous than congruent pairs. Characteristics of the temporal window over which simultaneity and fusion responses were maximal were quite similar, suggesting the existence of a 200 ms duration asymmetric bimodal temporal integration window.

Brain activity during audiovisual speech perception: an fMRI study of the McGurk effect

Neuroreport, 2003

fMRI was used to assess the relationship between brain activation and the degree of audiovisual integration of speech information during a phoneme categorization task. Twelve subjects heard a speaker say the syllable /aba/ paired either with video of the speaker saying the same consonant or a di¡erent one (/ava/). In order to manipulate the degree of audiovisual integration, the audio was either synchronous or 7 400 ms out of phase with the visual stimulus. Subjects reported whether they heard the consonant /b/ or another consonant; fewer /b/ responses when the audio and visual stimuli were mismatched indicated higher levels of visual in£uence on speech perception (McGurk e¡ect). Active brain regions during presentation of the incongruent stimuli included the superior temporal and inferior frontal gyrus, as well as extrastriate, premotor and posterior parietal cortex. A regression analysis related the strength of the McGurk e¡ect to levels of brain activation. Paradoxically, higher numbers of /b/ responses were positively correlated with activation in the left occipito-temporal junction, an area often associated with processing visual motion. This activation suggests that auditory information modulates visual processing to a¡ect perception.

Audiovisual integration in speech perception

Frontiers in Systems Neuroscience, 2009

research. I am also greatly indebted to members of David Pisoni's Speech Research Laboratory and James Townsend's Cognitive Modeling Laboratory for helpful comments and stimulating discussion over the years. The participation of my subjects is also greatly appreciated. iv TOWARD A UNIFIED THEORY OF AUDIOVISUAL INTEGRATION IN SPEECH PERCEPTION Auditory and visual speech recognition unfolds in real time and occurs effortlessly for normal hearing listeners. However, model theoretic descriptions of the systems level cognitive processes responsible for "integrating" auditory and visual speech information are currently lacking, primarily because they rely too heavily on accuracy rather than reaction time predictions. Speech and language researchers have argued about whether audiovisual integration occurs in a parallel or in coactive fashion, and also the extent to which audiovisual occurs in an "efficient" manner. The Double Factorial Paradigm introduced in Section 1 is an experimental paradigm that is equipped to address dynamical processing issues related to architecture (parallel vs. coactive processing) as well as efficiency (capacity). Experiment 1 employed a simple word discrimination task to assess both architecture and capacity in high accuracy settings. Experiments 2 and 3 assessed these same issues using auditory and visual distractors in Divided Attention and Focused Attention tasks respectively. Experiment 4 investigated audiovisual integration efficiency across different auditory signal-to-noise ratios. The results can be summarized follows: Integration typically occurs in parallel with an efficient stopping rule, integration occurs automatically in both focused and divided attention versions of the task, and audiovisual integration is only efficient (in the time domain) when the clarity of the auditory signal is relatively poor-although considerable individual differences were observed. In Section 3, these results were captured within the milieu of parallel linear dynamic processing models with cross channel interactions. Finally, in Section 4, I discussed broader implications for this research, including applications for clinical research and neural-biological models of audiovisual convergence. v TABLE OF CONTENTS

Perception of congruent and incongruent audiovisual speech stimuli

AVSP, 2005

Previous studies of audiovisual (AV) speech integration have used behavioral methods to examine perception of congruent and incongruent AV speech stimuli. Such studies have investigated responses to a relatively limited set of the possible incongruent combinations of AV speech stimuli. A central issue for examining a wider range of incongruent AV speech stimuli is developing a systematic method for alignment that will work with a wide variety of segments. In the present study, we investigated the use of three ...