ISCA Archive VALIDATION OF AN ACOUSTICAL MODELLING OF EMOTIONAL EXPRESSION IN SPANISH USING SPEECH SYNTHESIS TECHNIQUES (original) (raw)

Development of an emotional speech synthesiser in Spanish

1999

Currently, an essential point in speech synthesis is the addressing of the variability of human speech. One of the main sources of this diversity is the emotional state of the speaker. Most of the recent work in this area has been focused on the prosodic aspects of speech and on rule-based formantsynthesis experiments. Even when adopting an improved voice source, we cannot achieve a smiling happy voice or the menacing quality of cold anger. For this reason, we have performed two experiments aimed at developing a concatenative emotional synthesiser, a synthesiser that can copy the quality of an emotional voice without an explicit mathematical model.

Analysis and modelling of emotional speech in Spanish

1999

The importance of speech prosody for conveying emotional information has been extensively underlined in the literature. Major elements such as pitch, tempo and stress are presented as the main acoustic correlates of emotion in human speech. Nevertheless, as several authors have shown, voice quality is also a relevant feature in emotion recognition. In this paper, we present the prosodic analysis, modelling and evaluation of the Spanish Emotional Speech Database including four emotions: happiness, sadness, cold anger and surprise. Our results show that, for Spanish, the contribution of prosody to the recognisability of the uttered emotion greatly varies from one to another, with sadness and surprise being more supra segmental, and happiness and cold anger being rather segmental.

Spanish emotional speech: towards concatenative synthesis

1998

Currently, a key point in recognition and synthesis tasks is the addressing of the variability of human speech. One of the main sources of this diversity is the emotional state of the speaker. Speech under emotional conditions can be modelled as a deviation from neutral voice. Most of the recent work in emotional synthesis has been focused on the prosodic aspects of this kind of speech. In a paper at ICSLP'98 [1], we present a thorough study of emotional speech in Spanish, and its application to TTS, including a prototype system that simulates emotional speech using a commercial synthesiser.

Spanish Expressive Voices: corpus for emotion research in Spanish

2008

A new emotional multimedia database has been recorded and aligned. The database comprises speech and video recordings of one actor and one actress simulating a neutral state and the Big Six emotions: happiness, sadness, anger, surprise, fear and disgust. Due to a careful design and its size (more than 100 minutes per emotion), the recorded database allows comprehensive studies on emotional speech synthesis, prosodic modelling, speech conversion, far-field speech recognition and speech and video-based emotion identification. The database has been automatically labelled for prosodic purposes (5% was manually revised). The whole database has been validated thorough objective and perceptual tests, achieving a validation score as high as 89%.

Towards a Gendered Mexican Spanish Emotive Speech Synthetic Voice

International Journal of Signal Processing Systems, 2016

A new Mexican Spanish voice was created from a set of emotive recordings (neutral, happy, sad and angry) taken from two speakers (male and female). All recordings were used to generate a single database, from this database we extracted the emotional information of each phrase and added new tags to the phonetic transcription to select the correct gender and emotion during training and synthesis time.  Index Terms-emotive speech synthesis, HTS synthesis technique, hidden Markov models, Mel frequency cepstral coefficients I. INTRODUCTION At the beginning of this century, a synthesis system was created using Hidden Markov Models (HMM's). This system selects subphonemes from centroid subphonems of VQ clustering. It was created by Dr. Tokuda and his group, it was called 'HMM-based Text to Speech' (HTS) [1]-[3]. Also the FESTIVAL group created a synthesis system based on HMM's, it was called CLUSTERGEN [4], with similar voice naturalness to the Tokuda System [5], [6]; The most important difference between HTS and CLUSTERGEN is the last one does not use an impulse filter, it takes the subphonemes directly from the corpus. At HTS, the subphonemes are represented by MFCC's, F0 and time duration. Other core change at this century, was the use of other parametrization instead of MFCC's. The most famous is STRAIGHT [7], [8]. It uses a set of parameters from the spectral envelope of the subphoneme, this one is in database. A natural voice was created for Mexican Spanish, at our laboratory, using HTS [5], [6], obtaining good results. Once this baseline has been generated, a new set of recordings were now generated, to obtain a new synthetic voice with emotive components (neutral, happy, sad and angry). Two sets of emotive recordings, one for male and other for female were used, as well as one purely neutral set. Three database sets were created, one for each recording (male and female) and an additional one using all voices. In the combined database, gender and mood tags were used to select the appropriate set of parameters during synthesis time.

Speech synthesis and emotions: a compromise between flexibility and believability

2008

The synthesis of emotional speech is still an open question. The principal issue is how to introduce expressivity without compromising the naturalness of the synthetic speech provided by the state-of-the-art technology. In this paper two concatenative synthesis systems are described and some approaches to address this topic are proposed. For example, considering the intrinsic expressivity of certain speech acts, by exploiting the correlation between affective states and communicative functions, has proven an effective solution. This implies a different approach in the design of the speech databases as well as in the labelling and selection of the "expressive" units. In fact, beyond phonetic and prosodic criteria, linguistic and pragmatic aspects should also be considered. The management of units of different type (neutral vs expressive) is also an important issue.

Verification of acoustical correlates of emotional speech using formant-synthesis

… of the ISCA Workshop on Speech and …, 2000

This paper explores the perceptual relevance of acoustical correlates of emotional speech by means of speech synthesis. Besides, the research aims at the development of »emotion− rules« which enable an optimized speech synthesis system to generate emotional speech. Two investigations using this synthesizer are described: 1) the systematic variation of selec− ted acoustical features to gain a preliminary impression regar− ding the importance of certain acoustical features for emotion− al expression, and 2) the specific manipulation of a stimulus spoken under emotionally neutral condition to investigate fur− ther the effect of certain features and the overall ability of the synthesizer to generate recognizable emotional expression. It is shown that this approach is indeed capable of generating emotional speech that is recognized almost as well as utteran− ces realized by actors.

Analysis of emotions in Mexican Spanish speech

Classification of emotions was conducted for Mexican Spanish. Four different sets of features were used to find the best differentiation of eight emotions taken from the recordings of three poems spoken by a professional announcer. The sets of features included statistics of the fundamental frequency and the first four formants of the speech signal, the duration of pauses and the time frame intensity. The classification was made using an unsupervised neural network architecture based on a self-organized map. By considering each set of features separately, the 30ms time frame intensity was that which presented the best results splitting similar emotions like sadness, contempt, and melancholy from other kinds of emotional states like happiness, anger and derision. The results were improved by adding the mean value of the fundamental frequency to the time frame intensity; results in each poem showed the eight emotional states, including an emotion defined as normal, but the performance...

Synthesis of Speech with Emotions

Proc. International Conference on Communication, Computers and Devices

This paper describes the methodology proposed by us for synthesizing speech with emotion. Our work starts with the pitch synchronous analysis of single phoneme utterances with natural emotion to obtain the linear prediction (LP) parameters. For synthesizing speech with emotion, we modify the pitch contour of a normal utterance of a single phoneme. We subsequently filter this signal using the LP parameters. The proposed technique can be used to improve the naturalness of voice in a text-to-speech system.