Romi Zäske - Academia.edu (original) (raw)

Papers by Romi Zäske

The Oxford Handbook of Voice Perception

This chapter describes the putative representation of voices in human memory by looking at percep... more This chapter describes the putative representation of voices in human memory by looking at perceptual, acoustical, and cerebral correlates for the processing of speaker identity, gender, and age. Specifically, it outlines the factors that contribute to the variability in recognition accuracy for these speaker attributes, highlighting the importance of learning mechanisms. The chapter presents behavioural and neurological evidence for the notion that voices, similar to faces, are represented in memory relative to an average prototypical voice within a multidimensional acoustic space. While this account presents a useful framework to understand the neural coding of individual voices, it will take further research to elucidate how gender, age, and other speaker attributes are represented in this putative ‘voice space’. Furthermore, it remains to be determined whether and how the processing of various vocal cues may interact to shape everyday speaker perception.

Royal Society Open Science

Facial attractiveness has been linked to the averageness (or typicality) of a face and, more tent... more Facial attractiveness has been linked to the averageness (or typicality) of a face and, more tentatively, to a speaker's vocal attractiveness, via the ‘honest signal’ hypothesis, holding that attractiveness signals good genes. In four experiments, we assessed ratings for attractiveness and two common measures of distinctiveness (‘distinctiveness-in-the-crowd’, DITC and ‘deviation-based distinctiveness', DEV) for faces and voices (simple vowels, or more naturalistic sentences) from 64 young adult speakers (32 female). Consistent and substantial negative correlations between attractiveness and DEV generally supported the averageness account of attractiveness, for both voices and faces. By contrast, and indicating that both measures of distinctiveness reflect different constructs, correlations between attractiveness and DITC were numerically positive for faces (though small and non-significant), and significant for voices in sentence stimuli. Between faces and voices, distincti...

The Oxford Handbook of Voice Perception

This chapter reviews both the development and current trends of research into human abilities to ... more This chapter reviews both the development and current trends of research into human abilities to recognize other speakers by the voice, considering convergent evidence from behavioural psychological experiments, clinical case studies, and studies using methods from the cognitive neurosciences. First, substantial evidence suggests that the recognition and identification of voices of well-known speakers and the discrimination of speaker identity for unfamiliar voices represent separate abilities. Second, and unlike for other social signals such as vocal emotion, there is no unitary set of acoustic parameters that is crucial to voice-identity recognition. Third, although much current research points to the possibility that voice identity is represented in a norm-based manner, we currently lack a detailed computational model of the representation of individual known voices. Rapid technological progress may soon promote better understanding of dimensions of statistical variation between ...

Facial attractiveness has been linked to the averageness (or typicality) of a face and, more tent... more Facial attractiveness has been linked to the averageness (or typicality) of a face and, more tentatively, to a speaker's vocal attractiveness, via the 'honest signal' hypothesis, holding that attractiveness signals good genes. In four experiments, we assessed ratings for attractiveness and two common measures of distinctiveness ('distinctiveness-in-the-crowd', DITC and 'deviation-based distinctiveness', DEV) for faces and voices (simple vowels, or more naturalistic sentences) from 64 young adult speakers (32 female). Consistent and substantial negative correlations between attractiveness and DEV generally supported the averageness account of attractiveness, for both voices and faces. By contrast, and indicating that both measures of distinctiveness reflect different constructs, correlations between attractiveness and DITC were numerically positive for faces (though small and non-significant), and significant for voices in sentence stimuli. Between faces a...

Ear & Hearing, 2022

OBJECTIVES Research on cochlear implants (CIs) has focused on speech comprehension, with little r... more OBJECTIVES Research on cochlear implants (CIs) has focused on speech comprehension, with little research on perception of vocal emotions. We compared emotion perception in CI users and normal-hearing (NH) individuals, using parameter-specific voice morphing. DESIGN Twenty-five CI users and 25 NH individuals (matched for age and gender) performed fearful-angry discriminations on bisyllabic pseudoword stimuli from morph continua across all acoustic parameters (Full), or across selected parameters (F0, Timbre, or Time information), with other parameters set to a noninformative intermediate level. RESULTS Unsurprisingly, CI users as a group showed lower performance in vocal emotion perception overall. Importantly, while NH individuals used timbre and fundamental frequency (F0) information to equivalent degrees, CI users were far more efficient in using timbre (compared to F0) information for this task. Thus, under the conditions of this task, CIs were inefficient in conveying emotion based on F0 alone. There was enormous variability between CI users, with low performers responding close to guessing level. Echoing previous research, we found that better vocal emotion perception was associated with better quality of life ratings. CONCLUSIONS Some CI users can utilize timbre cues remarkably well when perceiving vocal emotions.

Journal of Speech, Language, and Hearing Research, 2020

Purpose In their letter, Meister et al. (2020) appropriately point to a potential influence of st... more Purpose In their letter, Meister et al. (2020) appropriately point to a potential influence of stimulus type, arguing cochlear implant (CI) users may have the ability to use timbre cues only for complex stimuli such as sentences but not for brief stimuli such as vowel–consonant–vowel or single words. While we cannot exclude this possibility on the basis of Skuk et al. (2020) alone, we hold that there is a strong need to consider type of social signal (e.g., gender, age, emotion, speaker identity) to assess the profile of preserved and impaired aspects of voice processing in CI users. We discuss directions for further research to systematically consider interactive effects of stimulus type and social signal. In our view, this is crucial to understand and enhance nonverbal vocal perception skills that are relevant to successful communication with a CI.

The ability to recognize someone’s voice exists on a broad spectrum with phonagnosia on the low e... more The ability to recognize someone’s voice exists on a broad spectrum with phonagnosia on the low end and super recognition at the high end. Yet there is no standardized test to measure an individual’s ability of learning and recognizing newly-learnt voices with samples of speech-like phonetic variability. We have developed the Jena Voice Learning and Memory Test (JVLMT), a 22min-test based on item response theory and applicable across languages. The JVLMT consists of three phases in which participants first become familiarized with eight speakers and then perform a three-alternative forced choice recognition task, using pseudo sentences devoid of semantics. Acoustic (dis)similarity analyses were used to create items with different levels of difficulty. Test scores are based on 22 Rasch-conform items. Items were selected based on 232 and validated based on 454 participants in an online study. Mean accuracy is 0.51 with an SD of .18. The JVLMT showed high and moderate correlations with...

The Handbook of Speech Perception, 2021

Behavior Research Methods, 2019

Here we describe the Jena Speaker Set (JESS), a free database for unfamiliar adult voice stimuli,... more Here we describe the Jena Speaker Set (JESS), a free database for unfamiliar adult voice stimuli, comprising voices from 61 young (18-25 years) and 59 old (60-81 years) female and male speakers uttering various sentences, syllables, read text, semi-spontaneous speech, and vowels. Listeners rated two voice samples (short sentences) per speaker for attractiveness, likeability, two measures of distinctiveness (Bdeviation^-based [DEV] and Bvoice in the crowd^-based [VITC]), regional accent, and age. Interrater reliability was high, with Cronbach's α between .82 and .99. Young voices were generally rated as more attractive than old voices, but particularly so when male listeners judged female voices. Moreover, young female voices were rated as more likeable than both young male and old female voices. Young voices were judged to be less distinctive than old voices according to the DEV measure, with no differences in the VITC measure. In age ratings, listeners almost perfectly discriminated young from old voices; additionally, young female voices were perceived as being younger than young male voices. Correlations between the rating dimensions above demonstrated (among other things) that DEV-based distinctiveness was strongly negatively correlated with rated attractiveness and likeability. By contrast, VITC-based distinctiveness was uncorrelated with rated attractiveness and likeability in young voices, although a moderate negative correlation was observed for old voices. Overall, the present results demonstrate systematic effects of vocal age and gender on impressions based on the voice and inform as to the selection of suitable voice stimuli for further research into voice perception, learning, and memory.

Psychological Research, 2019

The use of signs as a major means for communication affects other functions such as spatial proce... more The use of signs as a major means for communication affects other functions such as spatial processing. Intriguingly, this is true even for functions which are less obviously linked to language processing. Speakers using signs outperform non-signers in face recognition tasks, potentially as a result of a lifelong focus on the mouth region for speechreading. On this background, we hypothesized that the processing of emotional faces is altered in persons using mostly signs for communication (henceforth named deaf signers). While for the recognition of happiness the mouth region is more crucial, the eye region matters more for recognizing anger. Using morphed faces, we created facial composites in which either the upper or lower half of an emotional face was kept neutral while the other half varied in intensity of the expressed emotion, being either happy or angry. As expected, deaf signers were more accurate at recognizing happy faces than non-signers. The reverse effect was found for angry faces. These differences between groups were most pronounced for facial expressions of low intensities. We conclude that the lifelong focus on the mouth region in deaf signers leads to more sensitive processing of happy faces, especially when expressions are relatively subtle.

Brain Research, 2019

Recent electrophysiological evidence suggests a rapid acquisition of novel speaker representation... more Recent electrophysiological evidence suggests a rapid acquisition of novel speaker representations during intentional voice learning. We investigated effects of learning intention on voice recognition, using a variant of the directed forgetting paradigm. In an old/new recognition task following voice learning, we compared performance and event-related brain potentials (ERPs) for studied voices, half of which had been prompted to be remembered (TBR) or forgotten (TBF). Furthermore, to assess incidental encoding of episodic information, participants indicated for each recognized test voice the ear of presentation during study. During study, TBR voices elicited more positive ERPs than TBF voices (from ~250 ms), possibly reflecting deeper voice encoding. In parallel, subsequent recognition performance was higher for TBR than for TBF voices. Importantly, above-chance recognition for both learning conditions nevertheless suggested a degree of non-intentional voice learning. In a surprise episodic memory test for voice location, above-chance performance was observed for TBR voices only, suggesting that episodic memory for ear of presentation depended on intentional voice encoding. At test, a left posterior ERP OLD/NEW effect for both TBR and TBF voices (from ~500 ms) reflected recognition of studied voices under both encoding conditions. By contrast, a right frontal ERP OLD/NEW effect for TBF voices only (from ~800 ms) possibly reflected additional elaborative retrieval processes. Overall, we show that ERPs are sensitive 1) to strategic voice encoding during study (from ~250 ms), and 2) to voice recognition at test (from ~500 ms), with the specific pattern of ERP OLD/NEW effects partly depending on previous encoding intention.

Facial attractiveness has been linked to the averageness (or typicality) of a face. More tentativ... more Facial attractiveness has been linked to the averageness (or typicality) of a face. More tentatively, it has also been linked to a speaker’s vocal attractiveness, via the “honest signal” hypothesis, holding that attractiveness signals good genes. In four experiments, we assessed ratings for attractiveness and two common measures of distinctiveness (“distinctiveness-in-thecrowd”- DITC and “deviation-based distinctiveness”-DEV) for faces and voices (vowels or sentences) from 64 young adult speakers (32 female). Consistent and strong negative correlations between attractiveness and DEV generally supported the averageness account of attractiveness for both voices and faces. By contrast, indicating that both measures of distinctiveness reflect different constructs, correlations between attractiveness and DITC were numerically positive for faces (though small and non-significant), and significant for voices in sentence stimuli. As the only exception, voice ratings based on vowels exhibite...

It has been hypothesized that visual perspective-taking, a basic Theory of Mind mechanism, might ... more It has been hypothesized that visual perspective-taking, a basic Theory of Mind mechanism, might operate quite automatically particularly in terms of ´what´ someone else sees. As such we were interested in whether different social categories of an agent (e.g., gender, race, nationality) influence this mental state ascription mechanism. We tested this assumption by investigating the Samson level-1 visual perspective-taking paradigm using agents with different ethnic nationality appearances. A group of self-identified Turkish and German participants were asked to make visual perspective judgments from their own perspective (self-judgment) as well as from the perspective of a prototypical Turkish or German agent (other-judgment). The respective related interference effects - altercentric and egocentric interferences - were measured. When making other-judgments, German participants showed increased egocentric interferences for Turkish compared to German agents. Turkish participants show...

Attention, Perception, & Psychophysics, 2016

Adaptation to female voices causes subsequent voices to be perceived as more male, and vice versa... more Adaptation to female voices causes subsequent voices to be perceived as more male, and vice versa. This contrastive aftereffect disappears under spatial inattention to adaptors, suggesting that voices are not encoded automatically. According to Lavie, Hirst, de Fockert, and Viding (2004), the processing of task-irrelevant stimuli during selective attention depends on perceptual resources and working memory. Possibly due to their social significance, faces may be an exceptional domain: That is, task-irrelevant faces can escape perceptual load effects. Here we tested voice processing, to study whether voice gender aftereffects (VGAEs) depend on low or high perceptual (Exp. 1) or working memory (Exp. 2) load in a relevant visual task. Participants adapted to irrelevant voices while either searching digit displays for a target (Exp. 1) or recognizing studied digits (Exp. 2). We found that the VGAE was unaffected by perceptual load, indicating that task-irrelevant voices, like faces, can also escape perceptualload effects. Intriguingly, the VGAE was increased under high memory load. Therefore, visual working memory load, but not general perceptual load, determines the processing of task-irrelevant voices.

PLOS ONE, 2015

Recognition of personally familiar voices benefits from the concurrent presentation of the corres... more Recognition of personally familiar voices benefits from the concurrent presentation of the corresponding speakers' faces. This effect of audiovisual integration is most pronounced for voices combined with dynamic articulating faces. However, it is unclear if learning unfamiliar voices also benefits from audiovisual face-voice integration or, alternatively, is hampered by attentional capture of faces, i.e., "face-overshadowing". In six study-test cycles we compared the recognition of newly-learned voices following unimodal voice learning vs. bimodal face-voice learning with either static (Exp. 1) or dynamic articulating faces (Exp. 2). Voice recognition accuracies significantly increased for bimodal learning across study-test cycles while remaining stable for unimodal learning, as reflected in numerical costs of bimodal relative to unimodal voice learning in the first two study-test cycles and benefits in the last two cycles. This was independent of whether faces were static images (Exp. 1) or dynamic videos (Exp. 2). In both experiments, slower reaction times to voices previously studied with faces compared to voices only may result from visual search for faces during memory retrieval. A general decrease of reaction times across study-test cycles suggests facilitated recognition with more speaker repetitions. Overall, our data suggest two simultaneous and opposing mechanisms during bimodal face-voice learning: while attentional capture of faces may initially impede voice learning, audiovisual integration may facilitate it thereafter.

Journal of Neuroscience, 2014

Listeners can recognize familiar human voices from variable utterances, suggesting the acquisitio... more Listeners can recognize familiar human voices from variable utterances, suggesting the acquisition of speech-invariant voice representations during familiarization. However, the neurocognitive mechanisms mediating learning and recognition of voices from natural speech are currently unknown. Using electrophysiology, we investigated how representations are formed during intentional learning of initially unfamiliar voices that were later recognized among novel voices. To probe the acquisition of speech-invariant voice representations, we compared a "same sentence" condition, in which speakers repeated the study utterances at test, and a "different sentence" condition. Although recognition performance was higher for same compared with different sentences, substantial voice learning also occurred for different sentences, with recognition performance increasing across consecutive study-test-cycles. During study, eventrelated potentials elicited by voices subsequently remembered elicited a larger sustained parietal positivity (ϳ250-1400 ms) compared with subsequently forgotten voices. This difference due to memory was unaffected by test sentence condition and may thus reflect the acquisition of speech-invariant voice representations. At test, voices correctly classified as "old" elicited a larger late positive component (300-700 ms) at Pz than voices correctly classified as "new." This event-related potential OLD/NEW effect was limited to the same sentence condition and may thus reflect speech-dependent retrieval of voices from episodic memory. Importantly, a speech-independent effect for learned compared with novel voices was found in beta band oscillations (16-17 Hz) between 290 and 370 ms at central and right temporal sites. Our results are a first step toward elucidating the electrophysiological correlates of voice learning and recognition.

Wiley Interdisciplinary Reviews: Cognitive Science, 2013

While humans use their voice mainly for communicating information about the world, paralinguistic... more While humans use their voice mainly for communicating information about the world, paralinguistic cues in the voice signal convey rich dynamic information about a speaker&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#39;s arousal and emotional state, and extralinguistic cues reflect more stable speaker characteristics including identity, biological sex and social gender, socioeconomic or regional background, and age. Here we review the anatomical and physiological bases for individual differences in the human voice, before discussing how recent methodological progress in voice morphing and voice synthesis has promoted research on current theoretical issues, such as how voices are mentally represented in the human brain. Special attention is dedicated to the distinction between the recognition of familiar and unfamiliar speakers, in everyday situations or in the forensic context, and on the processes and representational changes that accompany the learning of new voices. We describe how specific impairments and individual differences in voice perception could relate to specific brain correlates. Finally, we consider that voices are produced by speakers who are often visible during communication, and review recent evidence that shows how speaker perception involves dynamic face-voice integration. The representation of para- and extralinguistic vocal information plays a major role in person perception and social communication, could be neuronally encoded in a prototype-referenced manner, and is subject to flexible adaptive recalibration as a result of specific perceptual experience. WIREs Cogn Sci 2014, 5:15-25. doi: 10.1002/wcs.1261 CONFLICT OF INTEREST: The authors have declared no conflicts of interest for this article. For further resources related to this article, please visit the WIREs website.

Vision Research, 2010

Adaptation influences perception not only of simple stimulus qualities such as motion or colour, ... more Adaptation influences perception not only of simple stimulus qualities such as motion or colour, but also of complex stimuli such as faces. Here we demonstrate contrastive aftereffects of adaptation to facial age. In Experiment 1, participants adapted to either young or old faces, and subsequently estimated the age of morphed test faces with interpolated ages of 30, 40, 50 or 60 years. Following adaptation to old adaptors, test faces were classified as much younger when compared to classifications of the same test faces following adaptation to young faces, which in turn caused subjective test face ''aging". These aftereffects were reduced but remained clear even when facial gender changed between adaptor and test faces. In Experiment 2, we induced simultaneous opposite age aftereffects for female and male faces. Overall, these results demonstrate interactions in the perception of facial age and gender, and support dissociable neuronal coding of male and female faces.

The Oxford Handbook of Voice Perception

Royal Society Open Science

The Oxford Handbook of Voice Perception

Facial attractiveness has been linked to the averageness (or typicality) of a face and, more tent... more Facial attractiveness has been linked to the averageness (or typicality) of a face and, more tentatively, to a speaker's vocal attractiveness, via the 'honest signal' hypothesis, holding that attractiveness signals good genes. In four experiments, we assessed ratings for attractiveness and two common measures of distinctiveness ('distinctiveness-in-the-crowd', DITC and 'deviation-based distinctiveness', DEV) for faces and voices (simple vowels, or more naturalistic sentences) from 64 young adult speakers (32 female). Consistent and substantial negative correlations between attractiveness and DEV generally supported the averageness account of attractiveness, for both voices and faces. By contrast, and indicating that both measures of distinctiveness reflect different constructs, correlations between attractiveness and DITC were numerically positive for faces (though small and non-significant), and significant for voices in sentence stimuli. Between faces a...

Ear & Hearing, 2022

Journal of Speech, Language, and Hearing Research, 2020

The Handbook of Speech Perception, 2021

Behavior Research Methods, 2019

Psychological Research, 2019

Brain Research, 2019

Attention, Perception, & Psychophysics, 2016

PLOS ONE, 2015

Journal of Neuroscience, 2014

Wiley Interdisciplinary Reviews: Cognitive Science, 2013

Vision Research, 2010