Andreas Wendemuth | Otto-von-Guericke-Universität Magdeburg (original) (raw)
Papers by Andreas Wendemuth
2008 IEEE Spoken Language Technology Workshop, 2008
An utterance can be conceived as a hidden sequence of semantic concepts expressed in words or phr... more An utterance can be conceived as a hidden sequence of semantic concepts expressed in words or phrases. The problem of understanding the meaning underlying a spoken utterance in a dialog system can be partly solved by decoding the hidden sequence of semantic concepts from the observed sequence of words. In this paper, we describe a hierarchical HMM-based semantic concept labeling model trained on semantically unlabeled data. The hierarchical model is compared with a flatconcept based model in terms of performance, ambiguity resolution ability and expressive power of the output. It is shown that the proposed method outperforms the flat-concept model in these points.
User satisfaction is an important aspect of human-computer interaction (HCI)-if a user is not sat... more User satisfaction is an important aspect of human-computer interaction (HCI)-if a user is not satisfied, he or she might not be willing to use such a system. Therefore, it is crucial to HCI applications to be able to recognise the user satisfaction level in order to react in an appropriate way. For such recognition tasks, data-driven methods have proven to deliver useful and robust results. But a data-driven user satisfaction model needs labelled and reliable data, which is not always easy in the case of user satisfaction, since it is not accessible directly. In the investigation presented here, the users are asked directly about their satisfaction level regarding their performance in a task during a close-to-real-life HCI. This results in a one-to-one mapping between a satisfaction level and the expressed vocal characteristics in the user's utterance. This data is then used to build a model to recognise satisfied and dissatisfied user states-as a first step towards a general model of the user's satisfaction state.
One objective of affective computing is the automatic processing of human emotions. Considering h... more One objective of affective computing is the automatic processing of human emotions. Considering human speech, filled pauses are one of the cues giving insight into the emotional state of a human being. Filled pauses are short speech events without a specified semantic meaning, but they have a variety of communicative and affective functions. The detection and processing of such speech events can help a technical system to recognise the affective state of the user. To solve this task using machine learning methods, huge amounts of annotated data and thus human resources are necessary. In this paper we introduce an efficient approach for semiautomatic labelling of filled pauses aiming at finding as many of them as possible with minimal effort. We investigate to which extent such an approach can reduce the effort of manual transcription of filled pauses. By using our approach, we could for the first time quantify that the time necessary for the human supervised verification can be reduced by up to 85% compared to a full manual annotation.
As the recognition of emotion from speech has matured to a degree where it becomes applicable in ... more As the recognition of emotion from speech has matured to a degree where it becomes applicable in real-life settings, it is time for a realistic view on obtainable performances. Most studies tend to overestimation in this respect: acted data is often used rather than spontaneous data, results are reported on pre-selected prototypical data, and true speaker disjunctive partitioning is still less common than simple cross-validation. A considerably more realistic impression can be gathered by inter-set evaluation: we therefore show results employing six standard databases in a cross-corpora evaluation experiment. To better cope with the observed high variances, different types of normalization are investigated. 1.8 k individual evaluations in total indicate the crucial performance inferiority of inter-to intracorpus testing.
Computer Speech & Language, Sep 1, 2018
T a g g e d P Markov Models, and Support Vector Machines (SVMs) (for both cf. e.g. Anagnostopoulo... more T a g g e d P Markov Models, and Support Vector Machines (SVMs) (for both cf. e.g. Anagnostopoulos et al., 2012; Schuller 8 et al., 2011a), applied to larger datasets. An overview on such improvements is given for instance in Baimbetov 9 et al. (2015), Schuller (2015).
Frontiers in computer science, Jul 14, 2022
Objective: Acoustic addressee detection is a challenge that arises in human group interactions, a... more Objective: Acoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural interaction on the same level as human interactions, many studies focused on the acoustic analyses of speech. The aim of this survey is to give an overview on the different studies and compare them in terms of utilized features, datasets, as well as classification architectures, which has so far been not conducted. Methods: The survey followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guidelines. We included all studies which were analyzing acoustic and/or acoustic characteristics of speech utterances to automatically detect the addressee. For each study, we describe the used dataset, feature set, classification architecture, performance, and other relevant findings. Results: 1,581 studies were screened, of which 23 studies met the inclusion criteria. The majority of studies utilized German or English speech corpora. Twenty-six percent of the studies were tested on in-house datasets, where only limited information is available. Nearly 40% of the studies employed hand-crafted feature sets, the other studies mostly rely on Interspeech ComParE 2013 feature set or Log-FilterBank Energy and Log Energy of Short-Time Fourier Transform features. 12 out of 23 studies used deep-learning approaches, the other 11 studies used classical machine learning methods. Nine out of 23 studies furthermore employed a classifier fusion. Conclusion: Speech-based automatic addressee detection is a relatively new research domain. Especially by using vast amounts of material or sophisticated models, device-directed speech is distinguished from non-device-directed speech. Furthermore, a clear distinction between in-house datasets and pre-existing ones can be drawn and a clear trend toward pre-defined larger feature sets (with partly used feature selection methods) is apparent.
ITG Symposium of Speech Communication, 2016
Cognitive technologies, 2017
A demonstration of a successful multimodal dynamic human-computer interaction (HCI) in which the ... more A demonstration of a successful multimodal dynamic human-computer interaction (HCI) in which the system adapts to the current situation and the users state is provided using the scenario of purchasing a train ticket. This scenario demonstrates that Companion Systems are facing the challenge of analyzing and interpreting explicit and implicit observations obtained from sensors under changing environmental conditions. In a dedicated experimental setup, a wide range of sensors was used to capture the situative context and the user, comprising video and audio capturing devices, laser scanners, a touch screen, and a depth sensor. Explicit signals describe user's direct interaction with the system such as interaction gestures, speech and touch input. Implicit signals are not directly addressed to the system, they comprise the users situative context, his or her gesture, speech, body pose, facial expressions and prosody. Both multimodally fused explicit signals and interpreted information from implicit signals steer the application component which was kept deliberately robust. The application offers stepwise dialogs gathering the most relevant information for purchasing a train ticket, where the dialog steps are sensitive and adaptable within processing time to the interpreted signals and data. We further highlight the system's potentials of a fast-track ticket purchase when several information indicate a hurried user.
Cognitive technologies, 2017
In general, humans interact with each other using multiple modalities. The main channels are spee... more In general, humans interact with each other using multiple modalities. The main channels are speech, facial expressions, and gesture. But also bio-physiological data such as biopotentials can convey valuable information which can be used to interpret the communication in a dedicated way. A Companion-System can use these modalities to perform an efficient human-computer interaction (HCI). To do so, the multiple sources need to be analyzed and combined in technical systems. However, so far only few studies have been published dealing with the fusion of three or even more such modalities. This chapter addresses the necessary processing steps in the development of a multimodal system applying fusion approaches.
Cognitive technologies, 2017
During system interaction, the user’s emotions and intentions shall be adequately determined and ... more During system interaction, the user’s emotions and intentions shall be adequately determined and predicted to recognize tendencies in his or her interests and dispositions. This allows for the design of an evolving search user interface (ESUI) which adapts to changes in the user’s emotional reaction and the users’ needs and claims.
Lecture Notes in Computer Science, 2017
Most technical communication systems use speech compression codecs to save transmission bandwidth... more Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an \(\sim \)3.3% absolute higher performance than the uncompressed samples.
Springer eBooks, 2017
an emotion is a mental and physiological state associated with a wide variety of feelings, though... more an emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behavior. Emotions are subjective experiences, or experienced from an individual point of view. Emotion is often associated with mood, temperament, personality, and disposition. Hence, in this paper method for detection of human emotions is discussed based on the acoustic features like pitch, energy etc. The proposed system is using the traditional MFCC approach [2] and then using nearest neighbor algorithm for the classification. Emotions has been classified separately for male and female based on the fact male and female voice has altogether different range [1][4] so MFCC varies considerably for the two.
Emotion recognition in far-field speech is challenging due to various acoustic factors. The prese... more Emotion recognition in far-field speech is challenging due to various acoustic factors. The present contribution especially considers dominant lowfrequency room modes which are often found in small rooms and cause variations in the low-frequency acoustical response at various listening locations. The impact of this spatial variation on low-level descriptors, used for feature sets in speech emotion recognition, has not been analysed in detail so far. This shortfall will be addressed in this paper, by utilising the well-known benchmark dataset EMO-DB providing emotionally coloured speech of high quality. The measured room response of a speaker cabin is compared with artificial approximations of its frequency response in the low frequency range. Two techniques were applied to obtain the approximations: The first technique uses multiple resonant filters in the low frequency region, whose parameters are determined by a leastsquares fit. The second technique used a modified version of the cabin's amplitude spectrum, that was set to unity for higher frequencies and transformed to minimum phase and to time domain. To be able to identify the impact of room modes on the low-level descriptors, correlation coefficients between the "clean" and modified EMO-DB utterances are calculated and compared to each other. Furthermore, a speech emotion recognition system is used to identify the impact on the recognition performance.
2020 IEEE International Conference on Human-Machine Systems (ICHMS), Sep 1, 2020
One of the core problems of machine learning applications, and in turn when recognizing Emotions ... more One of the core problems of machine learning applications, and in turn when recognizing Emotions from speech, is the difficulty to decide which measurable features are the ones containing the relevant information concerning the emotion classification task. As there is a wide variety of feature sets extractable from audio signals, which all have different origins as well as advantages and disadvantages, one should choose a method that provides an easy search for relevant features and helps with the selection process. The novelty in our contribution is using methods concentrating on the visual distinction of spectrograms generated from emotionally loaded utterances. The aim was the improvement of the search for strongly emotion-dependent areas in spectrograms, which can be used for easy and efficient emotion classification tasks. For this we employed methods proven to work for similar problems with spectrograms. In this research, the Oriented FAST Rotated BRIEF (ORB) was selected as feature extraction algorithm, a method which is based on the Binary Robust Independent Elementary Features (BRIEF) extraction. The local features were computed from the Smartkom database, which were translated from audio recordings into visual spectrogram representations. Afterwards a Support Vector Machine (SVM) classifier was trained to recognize the emotions for a seven-case (Emotion Classes) and two-case (Valence/Arousal Distinction) problem. This proposed method attained high recall scores on this specific database comparable or in excess of similar methods in the literature. Different parameter settings, like window length, step size in spectrogram creation, denoising of spectrograms and the number of keypoints per spectrogram were analyzed to validate the impact on the classification performance.
A new dataset, the Restaurant Booking Corpus (RBC) is introduced,comprising 90 telephone dialogs ... more A new dataset, the Restaurant Booking Corpus (RBC) is introduced,comprising 90 telephone dialogs of 30 German speaking students (10 males, 20females) interacting either with one out of two different technical dialogue systemsor with a human conversational partner. The aim of the participants was to reservea table each at three different restaurants for four persons under certain constraints(late dinner time for one day, sitting outside, reachable via public transport, avail-ability of vegetarian food, getting the directions to the restaurant). The purpose ofthis constraints was to enable a longer and realistic conversation over all three calls.This dataset is explicitly designed to eliminate certain factors influencing the rolemodel of the interlocutor: the effect of a visible counterpart, the speech content,and the dialog domain. Furthermore, AttrakDiff is used to evaluate the correct im-plementation of the conversational systems. A human annotation and an automaticrecognition is pursued to verify that the speech characteristics are indistinguishablefor the human-directed and the device-directed calls.
In automatic analyses of speech and emotion recognition, it has to beensured that training and te... more In automatic analyses of speech and emotion recognition, it has to beensured that training and test conditions are similar. The presented study aims toinvestigate the influence of certain room acoustics on common features used foremotion recognition. As a benchmark database this study focuses on the BerlinDatabase of Emotional Speech. The following rooms were analysed: a) modernlecture hall, b) older lecture hall, and c) staircase. For all rooms and their differentrecording setups, different acoustic measures were captured. The speech record-ings analysed in this paper were realized only at the ideal locations within therooms. Afterwards, 52 features (LLDs of emobase) were automatically extractedusing OpenSMILE and a sample-wise statistical analysis (pairedt-test) was carriedout. Therefore, the number of acoustically degraded features and its effect sizecan be linked to the acoustic parameters of the different recording experiments. Asresult, 15% of the degraded samples show a highly significant difference regard-ing all considered rooms. Especially MFCCs account for approximate 50% of thedegradation. Furthermore, the degradation is analysed depending on the emotionand room acoustic.
Intelligent systems reference library, 2019
Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-... more Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-centric individualisation. Users expect devices to interact intelligently, to anticipate their needs, and to adapt to their behaviour. To do so, companion-like solutions have to take into account the affective and dispositional state of the user, and therefore to be trained and modified using interaction data and corpora. We argue that, in this context, big data alone is not purposeful, since important effects are obscured, and since high-quality annotation is too costly. We encourage the collection and use of enriched data. We report on recent trends in this field, presenting methodologies for collecting data with rich disposition variety and predictable classifications based on a careful design and standardised psychological assessments. Besides socio-demographic information and personality traits, we also use speech events to improve user state models. Furthermore, we present possibilities to increase the amount of enriched data in cross-corpus or intra-corpus way based on recent learning approaches. Finally, we highlight particular recent neural recognition approaches feasible for smaller datasets, and covering temporal aspects.
Lecture Notes in Computer Science, 2018
As technical systems around us aim at a more natural interaction, the task of automatic emotion r... more As technical systems around us aim at a more natural interaction, the task of automatic emotion recognition from speech receives an ever growing attention. One important question still remains unresolved: The definition of the most suitable features across different data types. In the present paper, we employed a random-forest based feature selection known from other research fields in order to select the most important features for three benchmark datasets. Investigating feature selection on the same corpus as well as across corpora, we achieved an increase in performance using only 40 to 60% of the features of the wellknown emobase feature set.
2008 IEEE Spoken Language Technology Workshop, 2008
An utterance can be conceived as a hidden sequence of semantic concepts expressed in words or phr... more An utterance can be conceived as a hidden sequence of semantic concepts expressed in words or phrases. The problem of understanding the meaning underlying a spoken utterance in a dialog system can be partly solved by decoding the hidden sequence of semantic concepts from the observed sequence of words. In this paper, we describe a hierarchical HMM-based semantic concept labeling model trained on semantically unlabeled data. The hierarchical model is compared with a flatconcept based model in terms of performance, ambiguity resolution ability and expressive power of the output. It is shown that the proposed method outperforms the flat-concept model in these points.
User satisfaction is an important aspect of human-computer interaction (HCI)-if a user is not sat... more User satisfaction is an important aspect of human-computer interaction (HCI)-if a user is not satisfied, he or she might not be willing to use such a system. Therefore, it is crucial to HCI applications to be able to recognise the user satisfaction level in order to react in an appropriate way. For such recognition tasks, data-driven methods have proven to deliver useful and robust results. But a data-driven user satisfaction model needs labelled and reliable data, which is not always easy in the case of user satisfaction, since it is not accessible directly. In the investigation presented here, the users are asked directly about their satisfaction level regarding their performance in a task during a close-to-real-life HCI. This results in a one-to-one mapping between a satisfaction level and the expressed vocal characteristics in the user's utterance. This data is then used to build a model to recognise satisfied and dissatisfied user states-as a first step towards a general model of the user's satisfaction state.
One objective of affective computing is the automatic processing of human emotions. Considering h... more One objective of affective computing is the automatic processing of human emotions. Considering human speech, filled pauses are one of the cues giving insight into the emotional state of a human being. Filled pauses are short speech events without a specified semantic meaning, but they have a variety of communicative and affective functions. The detection and processing of such speech events can help a technical system to recognise the affective state of the user. To solve this task using machine learning methods, huge amounts of annotated data and thus human resources are necessary. In this paper we introduce an efficient approach for semiautomatic labelling of filled pauses aiming at finding as many of them as possible with minimal effort. We investigate to which extent such an approach can reduce the effort of manual transcription of filled pauses. By using our approach, we could for the first time quantify that the time necessary for the human supervised verification can be reduced by up to 85% compared to a full manual annotation.
As the recognition of emotion from speech has matured to a degree where it becomes applicable in ... more As the recognition of emotion from speech has matured to a degree where it becomes applicable in real-life settings, it is time for a realistic view on obtainable performances. Most studies tend to overestimation in this respect: acted data is often used rather than spontaneous data, results are reported on pre-selected prototypical data, and true speaker disjunctive partitioning is still less common than simple cross-validation. A considerably more realistic impression can be gathered by inter-set evaluation: we therefore show results employing six standard databases in a cross-corpora evaluation experiment. To better cope with the observed high variances, different types of normalization are investigated. 1.8 k individual evaluations in total indicate the crucial performance inferiority of inter-to intracorpus testing.
Computer Speech & Language, Sep 1, 2018
T a g g e d P Markov Models, and Support Vector Machines (SVMs) (for both cf. e.g. Anagnostopoulo... more T a g g e d P Markov Models, and Support Vector Machines (SVMs) (for both cf. e.g. Anagnostopoulos et al., 2012; Schuller 8 et al., 2011a), applied to larger datasets. An overview on such improvements is given for instance in Baimbetov 9 et al. (2015), Schuller (2015).
Frontiers in computer science, Jul 14, 2022
Objective: Acoustic addressee detection is a challenge that arises in human group interactions, a... more Objective: Acoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural interaction on the same level as human interactions, many studies focused on the acoustic analyses of speech. The aim of this survey is to give an overview on the different studies and compare them in terms of utilized features, datasets, as well as classification architectures, which has so far been not conducted. Methods: The survey followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guidelines. We included all studies which were analyzing acoustic and/or acoustic characteristics of speech utterances to automatically detect the addressee. For each study, we describe the used dataset, feature set, classification architecture, performance, and other relevant findings. Results: 1,581 studies were screened, of which 23 studies met the inclusion criteria. The majority of studies utilized German or English speech corpora. Twenty-six percent of the studies were tested on in-house datasets, where only limited information is available. Nearly 40% of the studies employed hand-crafted feature sets, the other studies mostly rely on Interspeech ComParE 2013 feature set or Log-FilterBank Energy and Log Energy of Short-Time Fourier Transform features. 12 out of 23 studies used deep-learning approaches, the other 11 studies used classical machine learning methods. Nine out of 23 studies furthermore employed a classifier fusion. Conclusion: Speech-based automatic addressee detection is a relatively new research domain. Especially by using vast amounts of material or sophisticated models, device-directed speech is distinguished from non-device-directed speech. Furthermore, a clear distinction between in-house datasets and pre-existing ones can be drawn and a clear trend toward pre-defined larger feature sets (with partly used feature selection methods) is apparent.
ITG Symposium of Speech Communication, 2016
Cognitive technologies, 2017
A demonstration of a successful multimodal dynamic human-computer interaction (HCI) in which the ... more A demonstration of a successful multimodal dynamic human-computer interaction (HCI) in which the system adapts to the current situation and the users state is provided using the scenario of purchasing a train ticket. This scenario demonstrates that Companion Systems are facing the challenge of analyzing and interpreting explicit and implicit observations obtained from sensors under changing environmental conditions. In a dedicated experimental setup, a wide range of sensors was used to capture the situative context and the user, comprising video and audio capturing devices, laser scanners, a touch screen, and a depth sensor. Explicit signals describe user's direct interaction with the system such as interaction gestures, speech and touch input. Implicit signals are not directly addressed to the system, they comprise the users situative context, his or her gesture, speech, body pose, facial expressions and prosody. Both multimodally fused explicit signals and interpreted information from implicit signals steer the application component which was kept deliberately robust. The application offers stepwise dialogs gathering the most relevant information for purchasing a train ticket, where the dialog steps are sensitive and adaptable within processing time to the interpreted signals and data. We further highlight the system's potentials of a fast-track ticket purchase when several information indicate a hurried user.
Cognitive technologies, 2017
In general, humans interact with each other using multiple modalities. The main channels are spee... more In general, humans interact with each other using multiple modalities. The main channels are speech, facial expressions, and gesture. But also bio-physiological data such as biopotentials can convey valuable information which can be used to interpret the communication in a dedicated way. A Companion-System can use these modalities to perform an efficient human-computer interaction (HCI). To do so, the multiple sources need to be analyzed and combined in technical systems. However, so far only few studies have been published dealing with the fusion of three or even more such modalities. This chapter addresses the necessary processing steps in the development of a multimodal system applying fusion approaches.
Cognitive technologies, 2017
During system interaction, the user’s emotions and intentions shall be adequately determined and ... more During system interaction, the user’s emotions and intentions shall be adequately determined and predicted to recognize tendencies in his or her interests and dispositions. This allows for the design of an evolving search user interface (ESUI) which adapts to changes in the user’s emotional reaction and the users’ needs and claims.
Lecture Notes in Computer Science, 2017
Most technical communication systems use speech compression codecs to save transmission bandwidth... more Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an \(\sim \)3.3% absolute higher performance than the uncompressed samples.
Springer eBooks, 2017
an emotion is a mental and physiological state associated with a wide variety of feelings, though... more an emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behavior. Emotions are subjective experiences, or experienced from an individual point of view. Emotion is often associated with mood, temperament, personality, and disposition. Hence, in this paper method for detection of human emotions is discussed based on the acoustic features like pitch, energy etc. The proposed system is using the traditional MFCC approach [2] and then using nearest neighbor algorithm for the classification. Emotions has been classified separately for male and female based on the fact male and female voice has altogether different range [1][4] so MFCC varies considerably for the two.
Emotion recognition in far-field speech is challenging due to various acoustic factors. The prese... more Emotion recognition in far-field speech is challenging due to various acoustic factors. The present contribution especially considers dominant lowfrequency room modes which are often found in small rooms and cause variations in the low-frequency acoustical response at various listening locations. The impact of this spatial variation on low-level descriptors, used for feature sets in speech emotion recognition, has not been analysed in detail so far. This shortfall will be addressed in this paper, by utilising the well-known benchmark dataset EMO-DB providing emotionally coloured speech of high quality. The measured room response of a speaker cabin is compared with artificial approximations of its frequency response in the low frequency range. Two techniques were applied to obtain the approximations: The first technique uses multiple resonant filters in the low frequency region, whose parameters are determined by a leastsquares fit. The second technique used a modified version of the cabin's amplitude spectrum, that was set to unity for higher frequencies and transformed to minimum phase and to time domain. To be able to identify the impact of room modes on the low-level descriptors, correlation coefficients between the "clean" and modified EMO-DB utterances are calculated and compared to each other. Furthermore, a speech emotion recognition system is used to identify the impact on the recognition performance.
2020 IEEE International Conference on Human-Machine Systems (ICHMS), Sep 1, 2020
One of the core problems of machine learning applications, and in turn when recognizing Emotions ... more One of the core problems of machine learning applications, and in turn when recognizing Emotions from speech, is the difficulty to decide which measurable features are the ones containing the relevant information concerning the emotion classification task. As there is a wide variety of feature sets extractable from audio signals, which all have different origins as well as advantages and disadvantages, one should choose a method that provides an easy search for relevant features and helps with the selection process. The novelty in our contribution is using methods concentrating on the visual distinction of spectrograms generated from emotionally loaded utterances. The aim was the improvement of the search for strongly emotion-dependent areas in spectrograms, which can be used for easy and efficient emotion classification tasks. For this we employed methods proven to work for similar problems with spectrograms. In this research, the Oriented FAST Rotated BRIEF (ORB) was selected as feature extraction algorithm, a method which is based on the Binary Robust Independent Elementary Features (BRIEF) extraction. The local features were computed from the Smartkom database, which were translated from audio recordings into visual spectrogram representations. Afterwards a Support Vector Machine (SVM) classifier was trained to recognize the emotions for a seven-case (Emotion Classes) and two-case (Valence/Arousal Distinction) problem. This proposed method attained high recall scores on this specific database comparable or in excess of similar methods in the literature. Different parameter settings, like window length, step size in spectrogram creation, denoising of spectrograms and the number of keypoints per spectrogram were analyzed to validate the impact on the classification performance.
A new dataset, the Restaurant Booking Corpus (RBC) is introduced,comprising 90 telephone dialogs ... more A new dataset, the Restaurant Booking Corpus (RBC) is introduced,comprising 90 telephone dialogs of 30 German speaking students (10 males, 20females) interacting either with one out of two different technical dialogue systemsor with a human conversational partner. The aim of the participants was to reservea table each at three different restaurants for four persons under certain constraints(late dinner time for one day, sitting outside, reachable via public transport, avail-ability of vegetarian food, getting the directions to the restaurant). The purpose ofthis constraints was to enable a longer and realistic conversation over all three calls.This dataset is explicitly designed to eliminate certain factors influencing the rolemodel of the interlocutor: the effect of a visible counterpart, the speech content,and the dialog domain. Furthermore, AttrakDiff is used to evaluate the correct im-plementation of the conversational systems. A human annotation and an automaticrecognition is pursued to verify that the speech characteristics are indistinguishablefor the human-directed and the device-directed calls.
In automatic analyses of speech and emotion recognition, it has to beensured that training and te... more In automatic analyses of speech and emotion recognition, it has to beensured that training and test conditions are similar. The presented study aims toinvestigate the influence of certain room acoustics on common features used foremotion recognition. As a benchmark database this study focuses on the BerlinDatabase of Emotional Speech. The following rooms were analysed: a) modernlecture hall, b) older lecture hall, and c) staircase. For all rooms and their differentrecording setups, different acoustic measures were captured. The speech record-ings analysed in this paper were realized only at the ideal locations within therooms. Afterwards, 52 features (LLDs of emobase) were automatically extractedusing OpenSMILE and a sample-wise statistical analysis (pairedt-test) was carriedout. Therefore, the number of acoustically degraded features and its effect sizecan be linked to the acoustic parameters of the different recording experiments. Asresult, 15% of the degraded samples show a highly significant difference regard-ing all considered rooms. Especially MFCCs account for approximate 50% of thedegradation. Furthermore, the degradation is analysed depending on the emotionand room acoustic.
Intelligent systems reference library, 2019
Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-... more Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-centric individualisation. Users expect devices to interact intelligently, to anticipate their needs, and to adapt to their behaviour. To do so, companion-like solutions have to take into account the affective and dispositional state of the user, and therefore to be trained and modified using interaction data and corpora. We argue that, in this context, big data alone is not purposeful, since important effects are obscured, and since high-quality annotation is too costly. We encourage the collection and use of enriched data. We report on recent trends in this field, presenting methodologies for collecting data with rich disposition variety and predictable classifications based on a careful design and standardised psychological assessments. Besides socio-demographic information and personality traits, we also use speech events to improve user state models. Furthermore, we present possibilities to increase the amount of enriched data in cross-corpus or intra-corpus way based on recent learning approaches. Finally, we highlight particular recent neural recognition approaches feasible for smaller datasets, and covering temporal aspects.
Lecture Notes in Computer Science, 2018
As technical systems around us aim at a more natural interaction, the task of automatic emotion r... more As technical systems around us aim at a more natural interaction, the task of automatic emotion recognition from speech receives an ever growing attention. One important question still remains unresolved: The definition of the most suitable features across different data types. In the present paper, we employed a random-forest based feature selection known from other research fields in order to select the most important features for three benchmark datasets. Investigating feature selection on the same corpus as well as across corpora, we achieved an increase in performance using only 40 to 60% of the features of the wellknown emobase feature set.