Towards a standard set of acoustic features for the processing of emotion in speech (original) (raw)

A state of the art review on emotional speech databases

2003

Thirty-two emotional speech databases are reviewed. Each database consists of a corpus of human speech pronounced under different emotional conditions. A basic description of each database and its applications is provided. The conclusion of this study is that automated emotion recognition on these databases cannot achieve a correct classification that exceeds 50% for the four basic emotions, i.e., twice as much as random selection. Second, natural emotions cannot be easily classified as simulated ones (i.e., acting) can be. Third, the most common emotions searched for in decreasing frequency of appearance are anger, sadness, happiness, fear, disgust, joy, surprise, and boredom.

Speech Emotion Analysis: Exploring the Role of Context

IEEE Transactions on Multimedia, 2000

Automated analysis of human affective behavior has attracted increasing attention in recent years. With the research shift toward spontaneous behavior, many challenges have come to surface ranging from database collection strategies to the use of new feature sets (e.g., lexical cues apart from prosodic features). Use of contextual information, however, is rarely addressed in the field of affect expression recognition, yet it is evident that affect recognition by human is largely influenced by the context information. Our contribution in this paper is threefold. First, we introduce a novel set of features based on cepstrum analysis of pitch and intensity contours. We evaluate the usefulness of these features on two different databases: Berlin Database of emotional speech (EMO-DB) and locally collected audiovisual database in car settings (CVRRCar-AVDB). The overall recognition accuracy achieved for seven emotions in the EMO-DB database is over 84% and over 87% for three emotion classes in CVRRCar-AVDB. This is based on tenfold stratified cross validation. Second, we introduce the collection of a new audiovisual database in an automobile setting (CVRRCar-AVDB). In this current study, we only use the audio channel of the database. Third, we systematically analyze the effects of different contexts on two different databases. We present context analysis of subject and text based on speaker/text-dependent/-independent analysis on EMO-DB. Furthermore, we perform context analysis based on gender information on EMO-DB and CVRRCar-AVDB. The results based on these analyses are promising.

Techniques for the phonetic description of emotional speech

… Tutorial and Research Workshop (ITRW) on Speech …, 2000

It is inconceivable that there could be information present in the speech signal that could be detected by the human auditory system but which is not accessible to acoustic analysis and phonetic categorisation. We know that humans can reliably recognise a range of emotions produced by speakers of their own language on the basis of the acoustic signal alone, yet it appears that our ability to identify the relevant acoustic correlates is at present rather limited. This paper proposes that we have to build a bridge between the human perceptual experience and the measurable properties of the acoustic signal by developing an analytic framework based partly on auditory analysis. A possible framework is outlined which is based on the work of the Reading/Leeds Emotional Speech Database. The project was funded by ESRC Grant no. R000235285.

Emotion from Speakers to Listeners: Perception and Prosodic Characterization of Affective Speech

Speaker Classification II, 2007

VI Preface specific phenomenon of irregular phonation or laryngealization and thereby point out the inherent problem of speaker-dependency, which relates the problems of speaker identification and emotion recognition with each other. The juristic implications of acquiring knowledge about the speaker on the basis of his or her speech in the context of emotion recognition is addressed by Erik Eriksson and his co-authors, discussing, "inter alia, assessment of emotion in others, witness credibility, forensic investigation, and training of law enforcement officers."

Emotional speech: Towards a new generation of databases

Speech Communication

Research on speech and emotion is moving from a period of exploratory research into one where there is a prospect of substantial applications, notably in human–computer interaction. Progress in the area relies heavily on the development of appropriate databases. This paper ...

Acoustic markers impacting discrimination accuracy of emotions in voice

2022

The quality of communication depends on how accurately the listener perceives the intended message. In addition to understanding the words, listeners are expected to interpret the speaker's accompanying emotional tone. However, it is not always clear why a neutral voice can be perceived as affective or vice versa. The present study aimed to investigate the differences between the acoustic profiles of angry, happy, and neutral emotions and to identify the acoustic markers that can lead to misperception of emotions conveyed through the voice. The study employed an encoding-decoding approach. Ten professional actors recorded the Latvian word /laba:/ in neutral, happy, and angry intonations, and thirty-two agematched respondents were asked to identify the emotion conveyed in the heard voice sample. A complete acoustic analysis was conducted for each voice sample using PRAAT, which included fundamental frequency (F 0), intensity level (IL), spectral (HNR) and cepstral parameters (CPPs), and duration of a produced word (DPW). The vocal expressions of emotions were analyzed from both encoding and decoding perspectives. The results showed statistically significant differences in the acoustic parameters that distinguish vocally expressed happy and angry emotions from neutral voices and acoustic parameters that were different between happy and angry emotions.

Interdependencies among Voice Source Parameters in Emotional Speech

IEEE Transactions on Affective Computing, 2011

Emotions have strong effects on the voice production mechanisms and consequently on voice characteristics. The magnitude of these effects, measured using voice source parameters, and the interdependencies among parameters have not been examined. To better understand these relationships, voice characteristics were analyzed in 10 actors' productions of a sustained/a/ vowel in five emotions. Twelve acoustic parameters were studied and grouped according to their physiological backgrounds, three related to subglottal pressure, five related to the transglottal airflow waveform derived from inverse filtering the audio signal, and four related to vocal fold vibration. Each emotion appeared to possess a specific combination of acoustic parameters reflecting a specific mixture of physiologic voice control parameters. Features related to subglottal pressure showed strong within-group and betweengroup correlations, demonstrating the importance of accounting for vocal loudness in voice analyses. Multiple discriminant analysis revealed that a parameter selection that was based, in a principled fashion, on production processes could yield rather satisfactory discrimination outcomes (87.1 percent based on 12 parameters and 78 percent based on three parameters). The results of this study suggest that systems to automatically detect emotions use a hypothesis-driven approach to selecting parameters that directly reflect the physiological parameters underlying voice and speech production.

On the Influence of Phonetic Content Variation for Acoustic Emotion Recognition

Lecture Notes in Computer Science, 2008

Acoustic Modeling in today's emotion recognition engines employs general models independent of the spoken phonetic content. This seems to work well enough given sufficient instances to cover for a broad variety of phonetic structures and emotions at the same time. However, data is usually sparse in the field and the question arises whether unit specific models as word emotion models could outperform the typical general models. In this respect this paper tries to answer the question how strongly acoustic emotion models depend on the textual and phonetic content. We investigate the influence on the turn and word level by use of state-of-the-art techniques for frame and word modeling on the well-known public Berlin Emotional Speech and Speech Under Simulated and Actual Stress databases. In the result it is clearly shown that the phonetic structure does strongly influence the accuracy of emotion recognition.