Editorial Special Section on Expressive Speech Synthesis (original) (raw)

Speech synthesis and emotions: a compromise between flexibility and believability

2008

The synthesis of emotional speech is still an open question. The principal issue is how to introduce expressivity without compromising the naturalness of the synthetic speech provided by the state-of-the-art technology. In this paper two concatenative synthesis systems are described and some approaches to address this topic are proposed. For example, considering the intrinsic expressivity of certain speech acts, by exploiting the correlation between affective states and communicative functions, has proven an effective solution. This implies a different approach in the design of the speech databases as well as in the labelling and selection of the "expressive" units. In fact, beyond phonetic and prosodic criteria, linguistic and pragmatic aspects should also be considered. The management of units of different type (neutral vs expressive) is also an important issue.

Synthesis of Speech with Emotions

Proc. International Conference on Communication, Computers and Devices

This paper describes the methodology proposed by us for synthesizing speech with emotion. Our work starts with the pitch synchronous analysis of single phoneme utterances with natural emotion to obtain the linear prediction (LP) parameters. For synthesizing speech with emotion, we modify the pitch contour of a normal utterance of a single phoneme. We subsequently filter this signal using the LP parameters. The proposed technique can be used to improve the naturalness of voice in a text-to-speech system.

Improving TTS synthesis for emotional expressivity by a prosodic parameterization of affect based on linguistic analysis

Affective Speech Synthesis is quite important for various applications like storytelling, speech based user interfaces, computer games, etc. However, some studies revealed that Text-To-Speech (TTS) systems have tendency for not conveying a suitable emotional expressivity in their outputs. Due to the recent convergence of several analytical studies pertaining to affect and human speech, this problem can now be tackled by a new angle that has at its core an appropriate prosodic parameterization based on an intelligent detection of the affective clues of the input text. This, allied with recent findings on affective speech analysis, allows a suitable assignment of pitch accents, other prosodic parameters and signal properties that adhere to F0 and match the optimal parameterization for the emotion detected in the input text. Such approach allows the input text to be enriched with metainformation that assists efficiently the TTS system. Furthermore, the output of the TTS system is also postprocessed in order to enhance its affective content. Several preliminary tests confirm the validity of our approach and encourage us to continue its exploration.

A corpus-based approach to expressive speech synthesis

Fifth ISCA Workshop …, 2004

Human speech communication can be thought of as comprising two channels-the words themselves, and the style in which they are spoken. Each of these channels carries information. Today's most-advanced text-to-speech (TTS) systems such as [1],[2],[3],[4] fall far short of human speech because they offer only a single, fixed style of delivery, independent of the message. In this paper, we describe the IBM Expressive TTS Engine, which is able to add another channel by offering five speaking styles. These are: neutral declarative, conveying good news, conveying bad news, asking a question, and showing contrastive emphasis. In addition to generating speech in these five styles, our TTS system is also able to generate paralinguistic events such as sighs, breaths, and filled pauses which further enrich the style channel. We describe our methods for generating and evaluating expressive synthetic speech and paralinguistic effects. We show significant perceptual differences between expressive and neutral synthetic speech for each of our speaking styles. In addition, we describe how users have been empowered to easily communicate the desired expression to the TTS engine through our extensions [5] of the Speech Synthesis Markup Language (SSML) [6].

Emotional speech synthesis: Applications, history and possible future

Proc. ESSV, 2009

Emotional speech synthesis is an important part of the puzzle on the long way to human-like artificial human-machine interaction. During the way, lots of stations like emotional audio messages or believable characters in gaming will be reached. This paper discusses technical aspects of emotional speech synthesis, shows practical applications based on a higher level framework and highlights new developments concerning the realization of affective speech with non-uniform unit selection based synthesis and voice transformation techniques.

Expressive Speech Synthesis for Critical Situations

Computing and Informatics

Presence of appropriate acoustic cues of affective features in the synthesized speech can be a prerequisite for the proper evaluation of the semantic content by the message recipient. In the recent work the authors have focused on the research of expressive speech synthesis capable of generating naturally sounding synthetic speech at various levels of arousal. Automatic information and warning systems can be used to inform, warn, instruct and navigate people in dangerous, critical situations, and increase the effectiveness of crisis management and rescue operations. One of the activities in the frame of the EU SF project CRISIS was called “Extremely expressive (hyper-expressive) speech synthesis for urgent warning messages generation”. It was aimed at research and development of speech synthesizers with high naturalness and intelligibility capable of generating messages with various expressive loads. The synthesizers will be applicable to generating public alert and warning messages...

Can we Generate Emotional Pronunciations for Expressive Speech Synthesis?

IEEE Transactions on Affective Computing

In the field of expressive speech synthesis, a lot of work has been conducted on suprasegmental prosodic features while few has been done on pronunciation variants. However, prosody is highly related to the sequence of phonemes to be expressed. This article raises two issues in the generation of emotional pronunciations for TTS systems. The first issue consists in designing an automatic pronunciation generation method from text, while the second issue addresses the very existence of emotional pronunciations through experiments conducted on emotional speech. To do so, an innovative pronunciation adaptation method which automatically adapts canonical phonemes first to those labeled in the corpus used to create a synthetic voice, then to those labeled in an expressive corpus, is presented. This method consists in training conditional random fields pronunciation models with prosodic, linguistic, phonological and articulatory features. The analysis of emotional pronunciations reveals strong dependencies between prosody and phoneme assimilation or elisions. According to perceptual tests, the double adaptation allows to synthesize expressive speech samples of good quality, but emotion-specific pronunciations are too subtle to be perceived by testers.

Integrating a Voice Analysis-Synthesis System with a TTS Framework for Controlling Affect and Speaker Identity

2021

This paper reports an experiment exploring how a voice analysis-synthesis system, GlórCáil, can be used to add expressiveness to the synthetic voice in TTS systems. This implementation focuses on the Irish ABAIR text-to-speech (TTS) voices, where such voice control would facilitate many current/envisaged applications. GlórCáil allows voice control of synthesized speech and for this experiment was integrated into a DNN-based TTS framework. Utterances were generated with f0, voice quality and vocal tract parameter manipulations targeting shifts in speaker identity and in the affective coloring of utterances. Scaling factors used for the manipulations were suggested in an earlier study. They involved global changes without sentence-internal dynamic variation, with a view to ascertain whether such global shifts might alter listeners' perception of speaker identity and affect. Results demonstrate affect shifts compatible with expectations. However, there were confounding factors. The female-child voices were poorly differentiated, which was expected given the similarity in the scaling factors used. The affect transformations suggest the baseline voice used had an intrinsically sad quality so that there is weak differentiation between the sad and no emotion stimuli. Male angry voice was the least successful suggesting that dynamic, within-utterance variation is essential for the signaling of certain affects.

EmoVoice: a System to Generate Emotions in Speech

Generating emotions in speech is currently an important topic of research given the requirement of modern human-machine interaction systems to produce expressive speech. We present the EmoVoice system, which implements acoustic rules to simulate seven basic emotions in neutral speech. It uses the pitch-synchronous time-scaling (PSTS) of the excitation signal to change the prosody and the most relevant glottal source parameters related to voice quality. The system also transforms other parameters of the vocal source signal to produce different types of irregular voicing quality. The correlation of the speech parameters with the basic emotions was derived from measurements of the glottal parameters and from results reported by other authors. The evaluation of the system showed that it can generate recognizable emotions but improvements are still necessary to discriminate some pairs of emotions.

On the impact of excitation and spectral parameters for expressive statistical parametric speech synthesis

Computer Speech & Language, 2013

ABSTRACT This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones.