GMM-based voice conversion applied to emotional speech synthesis (original) (raw)

Emotional Speech Conversion Using Pitch-Synchronous Harmonic and Non-harmonic Modeling of Speech

Communications in Computer and Information Science, 2013

In this paper, an emotional speech conversion method using pitchsynchronous harmonic and non-harmonic (PS-HNH) modeling of speech is proposed. The proposed method converts neutral speeches into expressive ones by controlling emotional parameters for each syllable of the neutral speech. To this end, the proposed method first carries out syllable labeling by Viterbi decoding using acoustic hidden Markov models of the neutral corpus. Next, the PS-HNH analysis is performed on the neutral speech to modify the emotional parameters by the linear modification model of target emotion in a syllable-wise manner. Finally, the modified parameters are synthesized back into the emotional speech by the PS-HNH synthesis. The performance of the proposed method is evaluated by a subjective AB preference test for four types of target emotions (fear, sadness, anger, and happiness). It is shown from the preference test that the proposed method give better speech quality than the conventional method that is based on speech transformation and representation using adaptive interpolation of weighted spectrum (STRAIGHT).

Voice GMM modelling for FESTIVAL/MBROLA emotive TTS synthesis

Interspeech 2006

Voice quality is recognized to play an important role for the rendering of emotions in verbal communication. In this paper we explore the effectiveness of a processing framework for voice transformations finalized to the analysis and synthesis of emotive speech. We use a GMM-based model to compute the differences between an MBROLA voice and an anger voice, and we address the modification of the MBROLA voice spectra by using a set of spectral conversion functions trained on the data. We propose to organize the speech data for the training in such way that the target emotive speech data and the diphone database used for the text-to-speech synthesis, both come from the same speaker. A copy-synthesis procedure is used to produce synthesis speech utterances where pitch patterns, phoneme duration, and principal speaker characteristics are the same as in the target emotive utterances. This results in a better isolation of the voice quality differences due to the emotive arousal. Three different models to represent voice quality differences are applied and compared. The models are all based on a GMM representation of the acoustic space. The performance of these models is discussed and the experimental results and assessment are presented.

Emotion transplantation through adaptation in HMM-based speech synthesis

Computer Speech & Language, 2015

This paper proposes an emotion transplantation method capable of modifying a synthetic speech model through the use of CSMAPLR adaptation in order to incorporate emotional information learned from a different speaker model while maintaining the identity of the original speaker as much as possible. The proposed method relies on learning both emotional and speaker identity information by means of their adaptation function from an average voice model, and combining them into a single cascade transform capable of imbuing the desired emotion into the target speaker. This method is then applied to the task of transplanting four emotions (anger, happiness, sadness and surprise) into 3 male speakers and 3 female speakers and evaluated in a number of perceptual tests. The results of the evaluations show how the perceived naturalness for emotional text significantly favors the use of the proposed transplanted emotional speech synthesis when compared to traditional neutral speech synthesis, evidenced by a big increase in the perceived emotional strength of the synthesized utterances at a slight cost in speech quality. A final evaluation with a robotic laboratory assistant application shows how by using emotional speech we can significantly increase the students' satisfaction with the dialog system, proving how the proposed emotion transplantation system provides benefits in real applications.

On the limitations of voice conversion techniques in emotion identification tasks

2007

The growing interest in emotional speech synthesis urges effective emotion conversion techniques to be explored. This paper estimates the relevance of three speech components (spectral envelope, residual excitation and prosody) for synthesizing identifiable emotional speech, in order to be able to customize the voice conversion techniques to the specific characteristics of each emotion. The analysis has been based on listening a set of synthetic mixed-emotional utterances that draw their speech components from emotional and neutral recordings. Results prove the importance of transforming residual excitation for the identification of emotions that are not fully conveyed through prosodic means (such as cold anger or sadness in our Spanish corpus).

Syllabic Pitch Tuning for Neutral-to-Emotional Voice Conversion

2015

Prosody plays an important role in both identification and synthesis of emotionalized speech. Prosodic features like pitch are usually estimated and altered at a segmental level based on short windows of speech (where the signal is expected to be quasi-stationary). This results in a frame-wise change of acoustical parameters for synthesizing emotionalized speech. In order to convert a neutral speech to an emotional speech from the same user, it might be better to alter the pitch parameters at the suprasegmental level like at the syllable-level since the changes in the signal are more subtle and smooth. In this paper we aim to show that the pitch transformation in a neutral-to-emotional voice conversion system may result in a better speech quality output if the transformations are performed at the supra-segmental (syllable) level rather than a frame-level change. Subjective evaluation results are shown to demonstrate if the naturalness, speaker similarity and the emotion recognition ...

Emotional speech synthesis: Applications, history and possible future

Proc. ESSV, 2009

Emotional speech synthesis is an important part of the puzzle on the long way to human-like artificial human-machine interaction. During the way, lots of stations like emotional audio messages or believable characters in gaming will be reached. This paper discusses technical aspects of emotional speech synthesis, shows practical applications based on a higher level framework and highlights new developments concerning the realization of affective speech with non-uniform unit selection based synthesis and voice transformation techniques.

TRANSFORMATION CODING FOR EMOTION SPEECH TRANSLATION: A REVIEW

This paper present a brief review on the current development for speech feature extraction and its usage in speech transformation for emotion speech translation. The process of prosody modeling and its application to speech transformation is reviewed in this paper. The past approaches of speech transformation from neutral to a emotion, for creating a synthetic speech signal is reviewed. The process of transformation and its relevant process of recognition, processing, and transformation is presented. A proposing system architecture for the transformation of a given neutral speech signal to its equivalent emotion speech is presented. The proposing architecture of speech transformation model is focused for its robustness in speech coding, transformation and accuracy in coding.

High quality voice conversion based on Gaussian mixture model with dynamic frequency warping

2001

In the voice conversion algorithm based on the Gaussian Mixture Model (GMM), quality of the converted speech is degraded because the converted spectrum is exceedingly smoothed. In this paper, we newly propose the GMM-based algorithm with the Dynamic Frequency Warping (DFW) to avoid the over-smoothing. We also propose that the converted spectrum is calculated by mixing the GMM-based converted spectrum and the DFW-based converted spectrum, to avoid the deterioration of conversion-accuracy on speaker individuality. Results of the evaluation experiments clarify that the converted speech quality is better than that of the GMMbased algorithm, and the conversion-accuracy on speaker individuality is the same as that of the GMM-based algorithm in the proposed algorithm with the proper weight for mixing spectra.

STATISTICAL SPECTRAL ENVELOPE TRANSFORMATION APPLIED TO EMOTIONAL SPEECH

pd.istc.cnr.it, 2010

Transformation of sound by statistical techniques is a promising method for a new range of digital audio effects. In this paper a data driven voice transformation algorithm is used to alter the timbre of a neutral (non-emotional) voice in order to reproduce a particular emotional vocal timbre.