Can we Generate Emotional Pronunciations for Expressive Speech Synthesis? (original) (raw)
Related papers
Synthesis of Speech with Emotions
Proc. International Conference on Communication, Computers and Devices
This paper describes the methodology proposed by us for synthesizing speech with emotion. Our work starts with the pitch synchronous analysis of single phoneme utterances with natural emotion to obtain the linear prediction (LP) parameters. For synthesizing speech with emotion, we modify the pitch contour of a normal utterance of a single phoneme. We subsequently filter this signal using the LP parameters. The proposed technique can be used to improve the naturalness of voice in a text-to-speech system.
Affective Speech Synthesis is quite important for various applications like storytelling, speech based user interfaces, computer games, etc. However, some studies revealed that Text-To-Speech (TTS) systems have tendency for not conveying a suitable emotional expressivity in their outputs. Due to the recent convergence of several analytical studies pertaining to affect and human speech, this problem can now be tackled by a new angle that has at its core an appropriate prosodic parameterization based on an intelligent detection of the affective clues of the input text. This, allied with recent findings on affective speech analysis, allows a suitable assignment of pitch accents, other prosodic parameters and signal properties that adhere to F0 and match the optimal parameterization for the emotion detected in the input text. Such approach allows the input text to be enriched with metainformation that assists efficiently the TTS system. Furthermore, the output of the TTS system is also postprocessed in order to enhance its affective content. Several preliminary tests confirm the validity of our approach and encourage us to continue its exploration.
Emotion transplantation through adaptation in HMM-based speech synthesis
Computer Speech & Language, 2015
This paper proposes an emotion transplantation method capable of modifying a synthetic speech model through the use of CSMAPLR adaptation in order to incorporate emotional information learned from a different speaker model while maintaining the identity of the original speaker as much as possible. The proposed method relies on learning both emotional and speaker identity information by means of their adaptation function from an average voice model, and combining them into a single cascade transform capable of imbuing the desired emotion into the target speaker. This method is then applied to the task of transplanting four emotions (anger, happiness, sadness and surprise) into 3 male speakers and 3 female speakers and evaluated in a number of perceptual tests. The results of the evaluations show how the perceived naturalness for emotional text significantly favors the use of the proposed transplanted emotional speech synthesis when compared to traditional neutral speech synthesis, evidenced by a big increase in the perceived emotional strength of the synthesized utterances at a slight cost in speech quality. A final evaluation with a robotic laboratory assistant application shows how by using emotional speech we can significantly increase the students' satisfaction with the dialog system, proving how the proposed emotion transplantation system provides benefits in real applications.
Speech synthesis and emotions: a compromise between flexibility and believability
2008
The synthesis of emotional speech is still an open question. The principal issue is how to introduce expressivity without compromising the naturalness of the synthetic speech provided by the state-of-the-art technology. In this paper two concatenative synthesis systems are described and some approaches to address this topic are proposed. For example, considering the intrinsic expressivity of certain speech acts, by exploiting the correlation between affective states and communicative functions, has proven an effective solution. This implies a different approach in the design of the speech databases as well as in the labelling and selection of the "expressive" units. In fact, beyond phonetic and prosodic criteria, linguistic and pragmatic aspects should also be considered. The management of units of different type (neutral vs expressive) is also an important issue.
Synthesis of Emotional Speech by Prosody Modification of Vowel Segments of Neutral Speech
SSRN Electronic Journal, 2019
Speech is viewed as a combination of voiced and unvoiced regions. Voiced speech is produced due to vibration of the vocal cords. The vibrating pattern of vocal cords is different in different emotions. During production of some consonant sound units, vocal cords do not vibrate. Therefore, consonants are less effective for emotion generation in speech signal. In this paper, we have considered only vowel regions for emotion synthesis using three prosody parameters duration, intensity and pitch patterns. Vowel like regions (VLR) is identified using vowel onset and offset points. Onset and offset points are starting and ending points of the vowel like regions. It is observed that during emotional synthesis from neutral speech mainly vowel regions of speech utterance are modified significantly. Our experimental result shows that the emotion synthesis using only prosody modification of VLR is significantly better than emotion synthesis of prosody modification at syllable level and it is also very effective in time consideration. The average mean opinion score is calculated using only vowel level prosody modification. The average mean opinion scores for angry, happy and fear emotional speeches are 3.85, 3.60 and 4.03, respectively. These mean opinion scores are better than syllable level prosody modification which are 3.56, 3.17 and 3.92 for angry, happy and fear emotions, respectively.
Emotion extractor: A methodology to implement prosody features in speech synthesis
Electronic Computer Technology …, 2010
This paper presents the methodology to extract emotion from the text at real time and add the expression to the documents contents during speech synthesis. To understand the existence of emotions self assessment test was carried out on set of documents and preliminary rules were formulated for three basic emotions: Pleasure, Arousal and Dominance. These rules are used in an automated procedure that assigns emotional state values to document contents. These values are then used by speech synthesizer to add emotions to speech. The system is language independent and content free.
Emotional speech synthesis: Applications, history and possible future
Proc. ESSV, 2009
Emotional speech synthesis is an important part of the puzzle on the long way to human-like artificial human-machine interaction. During the way, lots of stations like emotional audio messages or believable characters in gaming will be reached. This paper discusses technical aspects of emotional speech synthesis, shows practical applications based on a higher level framework and highlights new developments concerning the realization of affective speech with non-uniform unit selection based synthesis and voice transformation techniques.
Emotional Text to Speech Synthesis: A Review
IJARCCE, 2017
Several attempts have been done to add emotional effects to synthesized speech and several prototypes and fully operational systems have been built based on different synthesis techniques. Butfor Indian languages, there is still a lack of fully operational text to speech synthesis system with emotional effects. This paper aims to give an overview of what has been done in this field for some of the Indian Languages and highlights different issues faced during the development.
Articulatory features for expressive speech synthesis
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012
This paper describes some of the results from the project entitled "New Parameterization for Emotional Speech Synthesis" held at the Summer 2011 JHU CLSP workshop. We describe experiments on how to use articulatory features as a meaningful intermediate representation for speech synthesis. This parameterization not only allows us to reproduce natural sounding speech but also allows us to generate stylistically varying speech. We show methods for deriving articulatory features from speech, predicting articulatory features from text and reconstructing natural sounding speech from the predicted articulatory features. The methods were tested on clean speech databases in English and German, as well as databases of emotionally and personality varying speech. The resulting speech was evaluated both objectively, using techniques normally used for emotion identification, and subjectively, using crowd-sourcing.