The word superiority effect in audiovisual speech perception (original) (raw)

Visual speech discrimination and identification of natural and synthetic consonant stimuli

Frontiers in psychology, 2015

From phonetic features to connected discourse, every level of psycholinguistic structure including prosody can be perceived through viewing the talking face. Yet a longstanding notion in the literature is that visual speech perceptual categories comprise groups of phonemes (referred to as visemes), such as /p, b, m/ and /f, v/, whose internal structure is not informative to the visual speech perceiver. This conclusion has not to our knowledge been evaluated using a psychophysical discrimination paradigm. We hypothesized that perceivers can discriminate the phonemes within typical viseme groups, and that discrimination measured with d-prime (d') and response latency is related to visual stimulus dissimilarities between consonant segments. In Experiment 1, participants performed speeded discrimination for pairs of consonant-vowel spoken nonsense syllables that were predicted to be same, near, or far in their perceptual distances, and that were presented as natural or synthesized v...

Visual Influences on Perception of Speech and Nonspeech Vocal-Tract Events

Language and Speech, 2006

We report four experiments designed to determine whether visual information affects judgments of acoustically-specified nonspeech events as well as speech events (the "McGurk effect"). Previous findings have shown only weak McGurk effects for nonspeech stimuli, whereas strong effects are found for consonants. We used click sounds that serve as consonants in some African languages, but that are perceived as nonspeech by American English listeners. We found a significant McGurk effect for clicks presented in isolation that was much smaller than that found for stop-consonant-vowel syllables. In subsequent experiments, we found strong McGurk effects, comparable to those found for English syllables, for click-vowel syllables, and weak effects, comparable to those found for isolated clicks, for excised release bursts of stop consonants presented in isolation. We interpret these findings as evidence that the potential contributions of speech-specific processes on the McGurk effect are limited, and discuss the results in relation to current explanations for the McGurk effect.

Audiovisual vowel monitoring and the word superiority effect in children

International Journal of Behavioral Development, 2012

The goal of this study was to explore whether viewing the speaker's articulatory gestures contributes to lexical access in children (ages 5-10) and in adults. We conducted a vowel monitoring task with words and pseudo-words in audio-only (AO) and audiovisual (AV) contexts with white noise masking the acoustic signal. The results indicated that children clearly benefited from visual speech from age 6-7 onwards. However, unlike adults, the word superiority effect was not greater in the AV than the AO condition in children, suggesting that visual speech mostly contributes to phonemic-rather than lexical-processing during childhood, at least until the age of 10.

The processing of audio-visual speech: empirical and neural bases

Philosophical Transactions of the Royal Society B: Biological Sciences, 2008

In this selective review, I outline a number of ways in which seeing the talker affects auditory perception of speech, including, but not confi ned to, the McGurk effect. To date, studies suggest that all linguistic levels are susceptible to visual infl uence, and that two main modes of processing can be described: a complementary mode, whereby vision provides information more effi ciently than hearing for some under-specifi ed parts of the speech stream, and a correlated mode, whereby vision partially duplicates information about dynamic articulatory patterning.

Visual speech processing: Word-decoding and word-discrimination related to sentence-based speechreading and hearing-impairment

Scandinavian Journal of Psychology, 1991

Two aspects of visual speech processing in speechreading (word decoding and word discrimination) were tested in a group of 24 normal hearing and a group of 20 hearing-impaired subjects. Word decoding arid word discrimination performance were independent of facton related to the impairment, both in a quantitative and a qualitative sense. Decoding skill, but not discrimination skill, was associated with sentence-bad speechreading. The results were interpreted such that, in order to represent a critical component process in sentence-based speechreading, the visual speech perception task must entail lexically induced processing as a task-demand. The theoretical status of the word decoding task as one operationalization of a speech decoding module was discussed (Fodor, 1983). An error analysis of performance in the word decoding/discrimination tasks suggested that the perception of heard stimuli, as well as the perception of lipped stimuli, were critically dependent on the same features; that is, the temporally initial phonetic segment of the word (cf. Marslen-Wilson, 1987). Implications for a theory of visual speech perception were discussed.

Audio-visual speech perception without speech cues

1996

A series of experiments was conducted in which listeners were presented with audio-visual sentences in a transcription task. The visual components of the stimuli consisted of a male talker's face. The acoustic components consisted of : (1) natural speech (2) envelope-shaped noise which preserved the duration and amplitude of the original speech waveform and (3) various types of sinewave speech signals that followed the formant frequencies of a natural utterance. Sinewave speech is a skeletonized version of a natural utterance which contains frequency and amplitude variation of the formants, but lacks any fine-grained acoustic structure of speech. Intelligibility of the present set of sinewave sentences was relatively low in contrast to previous findings . However, intelligibility was greatly increased when visual information from a talkers face was presented along with the auditory stimuli. Further experiments demonstrated that the intelligibility of single tones increased differentially depending on which formant analog was presented. It was predicted that the increase in intelligibility for the sinewave speech with an added video display would be greater than the gain observed with envelope-shaped noise. This prediction is based on the assumption that the information-bearing phonetic properties of spoken utterances are preserved in the audio+visual sine-wave conditions. This prediction was borne out for the tonal analog of the second formant (T2), but not the tonal analog of the first formant (T1) or third formant (T3), suggesting that the information contained in the T2 analog is relevant for audiovisual integration.

A phoneme effect in visual word recognition

Cognition, 1998

In alphabetic writing systems like English or French, many words are composed of more letters than phonemes (e.g. BEACH is composed of five letters and three phonemes, i.e. /biJ/). This is due to the presence of higher order graphemes, that is, groups of letters that map into a single phoneme (e.g. EA and CH in BEACH map into the single phonemes /i/ and /J/, respectively). The present study investigated the potential role of these subsyllabic components for the visual recognition of words in a perceptual identification task. In Experiment 1, we manipulated the number of phonemes in monosyllabic, low frequency, five-letter, English words, and found that identification times were longer for words with a small number of phonemes than for words with a large number of phonemes. In Experiment 2, this 'phoneme effect' was replicated in French for low frequency, but not for high frequency, monosyllabic words. These results suggest that subsyllabic components, also referred to as functional orthographic units, play a crucial role as elementary building blocks of visual word recognition.

Time course of audio–visual phoneme identification: A cross-modal gating study

Seeing and Perceiving, 2012

When both present, visual and auditory information are combined in order to decode the speech signal. Past research has addressed to what extent visual information contributes to distinguish confusable speech sounds, but usually ignoring the continuous nature of speech perception. Here we tap at the temporal course of the contribution of visual and auditory information during the process of speech perception. To this end, we designed an audio–visual gating task with videos recorded with high speed camera. Participants were asked to identify gradually longer fragments of pseudowords varying in the central consonant. Different Spanish consonant phonemes with different degree of visual and acoustic saliency were included, and tested on visual-only, auditory-only and audio–visual trials. The data showed different patterns of contribution of unimodal and bimodal information during identification, depending on the visual saliency of the presented phonemes. In particular, for phonemes whic...

The word superiority effect in audiovisual speech perception (original) (raw)

Related papers