The auditory and the visual percept evoked by the same audiovisual vowels (original) (raw)

The auditory and the visual percept evoked by the same audiovisual stimuli

2007

In analyses and models of audiovisual speech perception, it has been common to consider three percepts: (1) the auditory percept evoked by acoustic stimuli, (2) the visual percept evoked by optic stimuli and (3) a common percept evoked by synchronous optic and acoustic stimuli. Here, it is shown that a vocal percept that is heard and influenced by vision has to be distinguished from a gestural percept that is seen and influenced by audition. In the two experiments reported, syllables distinguished solely by their vowels [i], [y] or [e] were presented to phonetically sophisticated subjects auditorily, visually and in incongruently cross-dubbed audiovisual form. In the first, the subjects rated roundedness, lip spreading, openness and backness of the vowels they heard-in the second of those they saw. The results confirmed that roundedness is mainly heard by eye while openness is heard by ear. Heard backness (retraction) varied with the acoustic and optic presence of roundedness. Seen openness was substantially influenced by acoustic cues, while there was no such influence on seen roundedness. The results are discussed in the context of theories and models.

Audiovisual perception of openness and lip rounding in front vowels

Journal of Phonetics, 2007

Swedish nonsense syllables /ɡiɡ/, /ɡyɡ/, /ɡeɡ/ and /ɡøɡ/, produced by four speakers, were video-recorded and presented to male and female subjects in auditory, visual and audiovisual mode and also in cross-dubbed audiovisual form with incongruent cues to vowel openness, roundedness, or both. With audiovisual stimuli, subjects perceived openness nearly always by ear. Most subjects perceived roundedness by eye rather than by ear although the auditory conditions were optimal and the sensation was an auditory one. This resulted in fused percepts such as when an acoustic /ɡeɡ/ dubbed onto an optic /ɡyɡ/ was predominantly perceived as /ɡøɡ/. Since the acoustic cues to openness are prominent, while those to roundedness are less reliable, this lends support to the "information reliability hypothesis" in multisensory perception: The perception of a feature is dominated by the modality that provides the more reliable information. A mostly male minority relied less on vision. The between-gender difference was significant. Presence of lip rounding (a visibly marked feature) was noticed more easily than its absence. The influence of optic information was not fully explicable on the basis of the subjects' success rates in lipreading compared with auditory perception. It was highest in stimuli produced by a speaker who smiled.

The Effect of Incongruent Visual Cues on the Heard Quality of Front Vowels

2008

Swedish nonsense syllables distinguished solely by their vowels [i], [y] or [e], were presented to phonetically sophisticated subjects auditorily, visually and in cross-dubbed audiovisual form with incongruent cues to openness, roundedness or both. Acoustic [y] dubbed onto optic [i] or [e] was heard as a retracted [i], while acoustic [i] or [e] dubbed onto optic [y] were perceived as rounded and slightly fronted. This confirms the higher weight of the more reliable information and that intermodal integration occurs at the level of phonetically informative properties prior to any categorization.

Visual perception of vowels from static and dynamic cues

Journal of the Acoustical Society of America, 2018

The purpose of the study was to analyse human identification of Polish vowels from static and dynamic durationally slowed visual cues. A total of 152 participants identified 6 Polish vowels produced by 4 speakers from static (still images) and dynamic (videos) cues. The results show that 59% of static vowels and 63% of dynamic vowels were successfully identified. There was a strong confusion between vowels within front, central, and back classes. Finally, correct identification strongly depended on speakers, showing that speakers differ significantly in how “clearly” they produce vowel configurations.

Audiovisual perception of Swedish vowels with and without conflicting cues

2004

Auditory, visual and audiovisual syllables with and without conflicting vowel cues (/i y e ø/) presented to men and women showed (1) most to perceive roundedness by eye rather than by ear, (2) a mostly male minority to be less relying on vision, (3) presence of lip rounding to be noticed more easily than absence, and (4) all to perceive openness by ear rather than by eye.

Visual speech discrimination and identification of natural and synthetic consonant stimuli

Frontiers in psychology, 2015

From phonetic features to connected discourse, every level of psycholinguistic structure including prosody can be perceived through viewing the talking face. Yet a longstanding notion in the literature is that visual speech perceptual categories comprise groups of phonemes (referred to as visemes), such as /p, b, m/ and /f, v/, whose internal structure is not informative to the visual speech perceiver. This conclusion has not to our knowledge been evaluated using a psychophysical discrimination paradigm. We hypothesized that perceivers can discriminate the phonemes within typical viseme groups, and that discrimination measured with d-prime (d') and response latency is related to visual stimulus dissimilarities between consonant segments. In Experiment 1, participants performed speeded discrimination for pairs of consonant-vowel spoken nonsense syllables that were predicted to be same, near, or far in their perceptual distances, and that were presented as natural or synthesized v...

The perceptual basis of the feature vowel height

2015

The present study investigated whether listeners perceptually map phonetic information to phonological feature categories or to phonemes. The test case is a phonological feature that occurs in most of the world’s languages, namely vowel height, and its acoustic correlate, the first formant (F1). We first simulated vowel discrimination in virtual listeners who perceive speech sounds through phonological features and virtual listeners who perceive through phonemes. The simulations revealed that feature listeners differed from phoneme listeners in their perceptual discrimination of F1 along a front-back boundary continuum as compared to a front (or back) continuum. The competing predictions of phoneme-based versus feature-based vowel discrimination were explicitly tested in real human listeners. The real listeners’ vowel discrimination did not resemble the simulated phoneme listeners, and was compatible with that of the simulated feature listeners. The findings suggest that humans perc...

Visual Influences on Perception of Speech and Nonspeech Vocal-Tract Events

Language and Speech, 2006

We report four experiments designed to determine whether visual information affects judgments of acoustically-specified nonspeech events as well as speech events (the "McGurk effect"). Previous findings have shown only weak McGurk effects for nonspeech stimuli, whereas strong effects are found for consonants. We used click sounds that serve as consonants in some African languages, but that are perceived as nonspeech by American English listeners. We found a significant McGurk effect for clicks presented in isolation that was much smaller than that found for stop-consonant-vowel syllables. In subsequent experiments, we found strong McGurk effects, comparable to those found for English syllables, for click-vowel syllables, and weak effects, comparable to those found for isolated clicks, for excised release bursts of stop consonants presented in isolation. We interpret these findings as evidence that the potential contributions of speech-specific processes on the McGurk effect are limited, and discuss the results in relation to current explanations for the McGurk effect.

A universal bias in adult vowel perception – By ear or by eye

Cognition, 2017

Speech perceivers are universally biased toward ''focal" vowels (i.e., vowels whose adjacent formants are close in frequency, which concentrates acoustic energy into a narrower spectral region). This bias is demonstrated in phonetic discrimination tasks as a directional asymmetry: a change from a relatively less to a relatively more focal vowel results in significantly better performance than a change in the reverse direction. We investigated whether the critical information for this directional effect is limited to the auditory modality, or whether visible articulatory information provided by the speaker's face also plays a role. Unimodal auditory and visual as well as bimodal (auditory-visual) vowel stimuli were created from video recordings of a speaker producing variants of /u/, differing in both their degree of focalization and visible lip rounding (i.e., lip compression and protrusion). In Experiment 1, we confirmed that subjects showed an asymmetry while discriminating the auditory vowel stimuli. We then found, in Experiment 2, a similar asymmetry when subjects lipread those same vowels. In Experiment 3, we found asymmetries, comparable to those found for unimodal vowels, for bimodal vowels when the audio and visual channels were phonetically-congruent. In contrast, when the audio and visual channels were phonetically-incongruent (as in the ''McGurk effect"), this asymmetry was disrupted. These findings collectively suggest that the perceptual processes underlying the ''focal" vowel bias are sensitive to articulatory information available across sensory modalities, and raise foundational issues concerning the extent to which vowel perception derives from general-auditory or speech-gesture-specific processes.