Integrating a Voice Analysis-Synthesis System with a TTS Framework for Controlling Affect and Speaker Identity (original) (raw)

Affect Expression: Global and Local Control of Voice Source Parameters

Speech Prosody 2022, 2022

This paper explores how the acoustic characteristics of the voice signal affect. It considers the proposition that the cueing of affect relies on variations in voice source parameters (including f0) that involve both global, uniform shifts across an utterance, and local, within-utterance changes, at prosodically relevant points. To test this, a perception test was conducted with stimuli where modifications were made to voice source parameters of a synthesised baseline utterance, to target angry and sad renditions. The baseline utterance was generated with the ABAIR Irish TTS system, for one male and one female voice. The voice parameter manipulations drew on earlier production and perception experiments, and involved three stimulus series: those with global, local and a combination of global and local adjustments. 65 listeners judged the stimuli as one of the following: angry, interested, no emotion, relaxed and sad, and indicated how strongly any affect was perceived. Results broadly support the initial proposition, in that the most effective signalling of both angry and sad affect tended to involve those stimuli which combined global and local adjustments. However, results for stimuli targeting angry were often judged as interested, indicating that the negative valence is not consistently cued by the manipulations in these stimuli.

Voice quality and f0 cues for affect expression: implications for synthesis

… European Conference on …, 2005

Synthesised stimuli were used to investigate how two notionally separable dimensions of tone-of-voice -voice quality and fundamental frequency -are involved in the expression of affect. Listeners were presented with three series of stimuli: (1) stimuli exemplifying different voice qualities, (2) stimuli all with modal voice quality but with different affect-related f 0 contours, and (3) stimuli incorporating variation in both voice quality and affect-related f 0 contours. A total of 15 stimuli were rated for 12 different affective attributes. Voice quality differentiation appears to account for the highest affect ratings overall, as indicated by the scores obtained for stimuli series and . The relatively weaker affect signalling of stimuli differentiated by f 0 alone corroborates findings in . It also suggests that for the generation of expressive, affectively coloured speech synthesis, it is not sufficient to manipulate only f 0 ; we also need to capture the voice quality dimension of the voice source.

Emotional Speech Datasets for English Speech Synthesis Purpose : A Review

In this paper, we review the datasets of emotional speech publicly available and their usability for state of the art speech synthesis. This is conditioned by several characteristics of these datasets: the quality of the recordings, the quantity of the data and the emotional content captured contained in the data. We then present a dataset that was recorded based on the observation of the needs in this area. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension.

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

ArXiv, 2018

In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work.

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -a Deep Learning approach

In this project, we aim to build a Text-to-Speech system able to produce speech with a controllable emotional expressiveness. We propose a methodology for solving this problem in three main steps. The first is the collection of emotional speech data. We discuss the various formats of existing datasets and their usability in speech generation. The second step is the development of a system to automatically annotate data with emotion/expressiveness features. We compare several techniques using transfer learning to extract such a representation through other tasks and propose a method to visualize and interpret the correlation between vocal and emotional features. The third step is the development of a deep learning-based system taking text and emotion/expressiveness as input and producing speech as output. We study the impact of fine tuning from a neutral TTS towards an emotional TTS in terms of intelligibility and perception of the emotion.

Improving TTS synthesis for emotional expressivity by a prosodic parameterization of affect based on linguistic analysis

Affective Speech Synthesis is quite important for various applications like storytelling, speech based user interfaces, computer games, etc. However, some studies revealed that Text-To-Speech (TTS) systems have tendency for not conveying a suitable emotional expressivity in their outputs. Due to the recent convergence of several analytical studies pertaining to affect and human speech, this problem can now be tackled by a new angle that has at its core an appropriate prosodic parameterization based on an intelligent detection of the affective clues of the input text. This, allied with recent findings on affective speech analysis, allows a suitable assignment of pitch accents, other prosodic parameters and signal properties that adhere to F0 and match the optimal parameterization for the emotion detected in the input text. Such approach allows the input text to be enriched with metainformation that assists efficiently the TTS system. Furthermore, the output of the TTS system is also postprocessed in order to enhance its affective content. Several preliminary tests confirm the validity of our approach and encourage us to continue its exploration.

Expressive speech synthesis: Evaluation of a voice quality centered coder on the different acoustic dimensions

Proc. Speech Prosody, 2006

Expressive speech is intrinsically multi-dimensional. Each acoustic dimension has specific weights depending on the nature of the expressed affects. The quantity of expressive information carried by each dimension separately (using Praat algorithms), as well as the processing implied to carry it (global value vs. contour) has been perceptively measured for a set of natural mono-syllabic utterances . It has been shown that no parameter alone is able to carry the whole emotion information, F0 contours or global values revealed to bring more information on positive expressions, voice quality and duration conveyed more information on negative expressions, and the intensity contours did not bring any significant information when used alone. These selected stimuli, expressing anxiety, disappointment, disgust, disquiet, joy, resignation and sadness were resynthesized with an LF-ARX algorithm, and evaluated in the same perceptive protocol extended to the three voice quality parameters (source, filter and residue). The comparison of results between natural, TD-PSOLA resynthesized and LF-ARX resynthesized stimuli (1) globally confirms the relative weights of each dimension (2) diagnoses local minor artifacts of resynthesis (3) validates the efficiency of the LF-ARX algorithm (4) measures the relative importance of each of the three LF-ARX parameters.

Speech synthesis and emotions: a compromise between flexibility and believability

2008

The synthesis of emotional speech is still an open question. The principal issue is how to introduce expressivity without compromising the naturalness of the synthetic speech provided by the state-of-the-art technology. In this paper two concatenative synthesis systems are described and some approaches to address this topic are proposed. For example, considering the intrinsic expressivity of certain speech acts, by exploiting the correlation between affective states and communicative functions, has proven an effective solution. This implies a different approach in the design of the speech databases as well as in the labelling and selection of the "expressive" units. In fact, beyond phonetic and prosodic criteria, linguistic and pragmatic aspects should also be considered. The management of units of different type (neutral vs expressive) is also an important issue.