Interactional Adequacy as a Factor in the Perception of Synthesized Speech (original) (raw)
Related papers
ProSynth: an integrated prosodic approach to device-independent, natural-sounding speech synthesis
Computer Speech & Language, 2000
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the acoustic richness of the speech signal reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by paying attention to systematic phonetic detail in the spectral, temporal and intonational domains produces a perceptually robust signal that is intelligible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech. We present examples of our approach to modelling systematic segmental, temporal and intonational detail and show how all are integrated in the prosodic structure. Preliminary tests to evaluate the effects of modelling systematic fine spectral detail, timing, and intonation suggest that the approach increases intelligibility and naturalness.
Sixth ISCA Workshop on Speech Synthesis
2007
In this paper, we report on an experiment that tested users' ability to understand the content of spoken auditory reminders. Users heard meeting reminders and medication reminders spoken in both a natural and a synthetic voice. Our results show that older users can understand synthetic speech as well as younger users provided that the prompt texts are well-designed, using familiar words and contextual cues. As soon as unfamiliar and complex words are introduced, users' hearing affects how well they can understand the synthetic voice, even if their hearing would pass common screening tests for speech synthesis experiments. Although hearing thresholds correlate best with users' performance, central auditory processing may also influence performance, especially when complex errors are made.
Synthesis of listener vocalizations: towards interactive speech synthesis
2011
Spoken and multi-modal dialogue systems start to use listener vocalizations, such as uh-huh and mm-hm, for natural interaction. Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: Where to synthesize a listener vocalization? What meaning should be conveyed through the synthesized vocalization? And, how to realize an appropriate listener vocalization with the intended meaning? This thesis addresses the latter question. The investigation starts with proposing a three-stage approach: (i) data collection, (ii) annotation, and (iii) realization. The first stage presents a method to collect natural listener vocalizations from German and British English professional actors in a recording studio. In the second stage, we explore a methodology for annotating listener vocalizations -- meaning and behavior (form) annotation. The third stage proposes a reali...
textscInpro_iSS: A Component for Just-In-Time Incremental Speech Synthesis
We present a component for incremental speech synthesis (iSS) and a set of applications that demonstrate its capabilities. This component can be used to increase the responsivity and naturalness of spoken interactive systems. While iSS can show its full strength in systems that generate output incrementally, we also discuss how even otherwise unchanged systems may profit from its capabilities.
INPRO_iSS: a component for just-in-time incremental speech synthesis
Proceedings of the Acl 2012 System Demonstrations, 2012
We present a component for incremental speech synthesis (iSS) and a set of applications that demonstrate its capabilities. This component can be used to increase the responsivity and naturalness of spoken interactive systems. While iSS can show its full strength in systems that generate output incrementally, we also discuss how even otherwise unchanged systems may profit from its capabilities.
2019
Speech synthesis applications have become an ubiquity, in navigation systems, digital assistants or as screen or audio book readers. Despite their impact on the acceptability of the systems in which they are embedded, and despite the fact that different applications probably need different types of TTS voices, TTS evaluation is still largely treated as an isolated problem. Even though there is strong agreement among researchers that the mainstream approaches to Text-to-Speech (TTS) evaluation are often insufficient and may even be misleading, there exist few clear-cut suggestions as to (1) how TTS evaluations may be realistically improved on a large scale, and (2) how such improvements may lead to an informed feedback for system developers and, ultimately, better systems relying on TTS. This paper reviews the current state-of-the-art in TTS evaluation, and suggests a novel user-centered research program for this area.
Evaluating prosodic processing for incremental speech synthesis
2012
Incremental speech synthesis (iSS) accepts input and produces output in consecutive chunks that only together result in a full utterance. Systems that use iSS thus have the ability to adapt their utterances while they are ongoing. However, starting to process with less than the full utterance available prohibits global optimization, leading to potentially suboptimal solutions. In this paper, we present a method for incrementalizing the symbolic pre-processing component of speech synthesis and assess the influence of varying "lookahead", i. e. knowledge about the rest of the utterance, on prosodic quality. We found that high quality incremental output can be achieved even with a lookahead of less than one phrase, allowing for timely system reaction.
Perception of synthetic speech generated by rule
Proceedings of the IEEE, 1985
As the use of voice response systems employing synthetic speech becomes more widespread in consumer products, industrial and military applications, and aids for the handicapped, it will be necessary to develop reliable methods of comparing different synthesis systems and of assessing how human observers perceive and respond to the speech generated by these systems. The selection of a specific voice response system for a particular application depends on a wide variety of factors only one of which is the inherent intelligibility of the speech generated by the synthesis coutines. In this paper, we describe the results of several studies that applied measures of phoneme intell;gibility, word recognition, and comprehension to assess the perception of synthetic speech. Several techniques were used to compare performance of different synthesis systems with natural speech and to learn more about how humans perceive synthetic speech generated by rule. Our findings suggest that the perception of synthetic speech depends on an interaction of several factors including the acoustic-phonetic p r o p erties of the speech signal, the requirements of the perceptual task, and the previous experience of the listener. Differences in percep tion between natural speech and high-quality synthetic speech appear to be related to the redundancy of the acoustic-phonetic information encoded in the speech signal.