Ingo Siegert | Otto-von-Guericke-Universität Magdeburg (original) (raw)

Papers by Ingo Siegert

Research paper thumbnail of Automatic differentiation of form-function-relations of the discourse particle "hm" in a naturalistic human-computer interaction

The development of speech-controlled assistance systems has gained more importance in this day an... more The development of speech-controlled assistance systems has gained more importance in this day and time. Application ranges from driver assistance systems in the automotive sector to daily use in mobile devices such as smart- phones or tablets. To ensure the reliability of these Systems; not only the meaning of the pure spoken text, but also meta-information about the user or dialogue func- tions such as attention or turn-taking have to be perceived and processed. This further information is transmitted through intonation of words or sentences. In human communication, discourse particles serve to forward information, without interrupting the speaker. For the German language J.E. Schmidt empirically discovered seven types of form-function-concurrences on the isolated DP "hm". To also be considered in human-computer interaction, it is useful to be able to distin- guish these different meanings of the DP "hm". In this paper we present an automatic classification-met...

Research paper thumbnail of Classification of Functional-Meanings of Non-isolated Discourse Particles in Human-Human-Interaction

Lecture Notes in Computer Science, 2016

Research paper thumbnail of Discourse Particles in Human-Human and Human-Computer Interaction – Analysis and Evaluation

Lecture Notes in Computer Science, 2016

Research paper thumbnail of AUDIO COMPRESSION AND ITS IMPACT ON EMOTION RECOGNITION IN AFFECTIVE COMPUTING

Enabling a natural (human-like) spoken conversation with technical systems requires affective inf... more Enabling a natural (human-like) spoken conversation with technical systems requires affective information, contained in spoken language, to be intelligibly transmitted. This study investigates the role of speech and music codecs for affect intelligibility. A decoding and encoding of affective speech was employed from the well-known EMO-DB corpus. Using four state-of-the-art acoustic codecs and different bit-rates, the spectral error and the human affect recognition ability in labeling experiments were investigated and set in relation to results of automatic recognition of base emotions. Through this approach, the general affect intelligibility as well as the emotion specific intelligibility was analyzed. Considering the results of the conducted automatic recognition experiments, the SPEEX codec configuration with a bit-rate of 6.6 kbit/s is recommended to achieve a high compression and overall good UARs for all emotions.

Research paper thumbnail of Simulated group meetings -insights from sociology and engineering

In the last decades tremendous improvements in accous-tic modelling of speech for automatic speec... more In the last decades tremendous improvements in accous-tic modelling of speech for automatic speech recognition were made. Nonetheless, the interaction between humans and computer systems via an automatic speech recogntion interface still regulary leads to unsatis-fied users. In this paper recordings of 3 simulated group meetings with limited vocabulary are analysed from two perspectives to find starting points to overcome this problem. One perspective is given by a computer engineer and the other by a sociologist. Our intention is to provide in-sights into the dynamics of group meetings which might lead to more robust and adaptive Automatic Speech Recognition systems and to en-hanced sociological understanding of group situations from a different point of view.

Research paper thumbnail of Describing Human Emotions Through Mathematical Modelling

To design a companion technology we focus on the appraisal theory model to predict emotions and d... more To design a companion technology we focus on the appraisal theory model to predict emotions and determine the appropriate system behaviour to support Human-Computer-Interaction. Until now, the implementation of emotion processing was hindered by the fact that the theories needed originate from diverging research areas, hence divergent research techniques and result representations are present. Since this difficulty arises repeatedly in interdisciplinary research, we investigated the use of mathematical modelling as an unifying language to translate the coherence of appraisal theory. We found that the mathematical category theory supports the modelling of human emotions according to the appraisal theory model and hence assists the implementation.

Research paper thumbnail of Appropriate emotional labelling of non-acted speech using basic emotions, geneva emotion wheel and self assessment manikins

2011 IEEE International Conference on Multimedia and Expo, 2011

In emotion recognition from speech, a good transcription and annotation of given material is cruc... more In emotion recognition from speech, a good transcription and annotation of given material is crucial. Moreover, the question of how to find good emotional labels for new data material is a basic issue. It is not only the question of which emotion labels to choose, it is also a matter of how labellers can cope with annotation methods. In this paper, we present our investigations for emotional labelling with three different methods (Basic Emotions, Geneva Emotion Wheel and Self Assessment Manikins) and compare them in terms of emotion coverage and usability. We show that emotion labels derived from Geneva Emotion Wheel or Self Assessment Manikins fulfill our requirements, but Basic Emotions are not feasible for emotion labelling from spontaneous speech.

Research paper thumbnail of Multimodal affect recognition in spontaneous HCI environment

2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC 2012), 2012

ABSTRACT Human Computer Interaction (HCI) is known to be a multimodal process. In this paper we w... more ABSTRACT Human Computer Interaction (HCI) is known to be a multimodal process. In this paper we will show results of experiments for affect recognition, with non-acted, affective multimodal data from the new Last Minute Corpus (LMC). This corpus is more related to real HCI applications than other known data sets where affective behavior is elicited untypically for HCI.We utilize features from three modalities: facial expressions, prosody and gesture. The results show, that even simple fusion architectures can reach respectable results compared to other approaches. Further we could show, that probably not all features and modalities contribute substantially to the classification process, where prosody and eye blink frequency seem most contributing in the analyzed dataset.

Research paper thumbnail of Investigating the form-function-relation of the discourse particle “hm” in a naturalistic human-computer interaction

ABSTRACT A verbal human interaction consists of several information layers. Apart from the pure t... more ABSTRACT A verbal human interaction consists of several information layers. Apart from the pure textual information, further details regarding the speaker’s feelings, believes and social relations are transmitted. The additional information is encoded through the acoustics e.g. through speaking style or intonation. Especially the intonation conveys specific information about the speakers communicative relation and his attitude towards the actual dialogue. Since the intonation is influenced by semantic and grammatical information, it is advisable to investigate the intonation of so-called discourse particles as “hm” or “uhm”: They can’t be inflected but can be emphasized and are occurring at crucial communicative points. Discourse particles have the same intonation curves (pitch-contours) as whole sentences and thus may indicate the same functional meanings. For the German language J. E. Schmidt empirically discovered seven types of form-function- concurrences on the isolated discourse particle “hm”. For successful speech-controlled naturalistic human-computer interaction (HCI) the pure textual information as well as the individual skills, preferences, and affective states of the user have to be known. Therefore, it seems reasonable to consider the discourse particle in HCI, as well. To determin the dialogue-function, methods are needed that preserves the pitch-contours and are feasible to assign them to defined form-prototypes. Furthermore, the question whether different pitch-contours are occurring in naturalistic HCI and if they are congruent with the findings by linguists has to be investigated. In this paper we present first results on the extraction and correlation and investigate the different form-function-relations of the discourse particle “hm” in the naturalistic LAST MINUTE corpus and answer the question which form-function relations can be expected in naturalistic HCI.

Research paper thumbnail of Annotation and Classification of Changes of Involvement in Group Conversation

2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013

ABSTRACT The detection of involvement in a conversation is important to assess the level humans a... more ABSTRACT The detection of involvement in a conversation is important to assess the level humans are participating in either a human-human or human-computer interaction. Especially, detecting changes in a group's involvement in a multi-party interaction is of interest to distinguish several constellations in the group itself. This information can further be used in situations where technical support of meetings is favoured, for instance, focusing a camera, switching microphones, etc. Moreover, this information could also help to improve the performance of technical systems applied in human-machine interaction. In this paper, we concentrate on video material given by the Table Talk corpus. Therefore, we introduce a way of annotating and classifying changes of involvement and discuss the reliability of the annotation. Further, we present classification results based on video features using Multi-Layer Networks.

Research paper thumbnail of The Influence of Context Knowledge for Multi-modal Affective Annotation

Lecture Notes in Computer Science, 2013

In emotion recognition from speech, a good transcription and annotation of given material is cruc... more In emotion recognition from speech, a good transcription and annotation of given material is crucial. Moreover, the question of how to find good emotional labels for new data material is a basic issue. An important question is how the context influences the decision of the annotator. In this paper, we present our investigations for emotional labelling on natural multimodal data with and without the computer's responses as one main context-information within an natural human computer interaction. We show that for emotional labels the computer's responses did not influence the decision of the annotators.

Research paper thumbnail of Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

Journal on Multimodal User Interfaces, 2013

ABSTRACT To enable a naturalistic human–computer interaction the recognition of emotions and inte... more ABSTRACT To enable a naturalistic human–computer interaction the recognition of emotions and intentions experiences increased attention and several modalities are comprised to cover all human communication abilities. For this reason, naturalistic material is recorded, where the subjects are guided through an interaction with crucial points, but with the freedom to react individually. This material captures realistic user reactions but lacks of clear labels. So, a good transcription and annotation of the given material is essential. For that, the assignment of human annotators has become widely accepted. A good measurement for the reliability of labelled material is the inter-rater agreement. In this paper we investigate the achieved inter-rater agreement utilizing Krippendorff’s alpha for emotional annotated interaction corpora and present methods to improve the reliability, we show that the reliabilities obtained with different methods does not differ much, so a choice could rely on other aspects. Furthermore, a multimodal presentation of the items in their natural order increases the reliability.

Research paper thumbnail of Automatic differentiation of form-function-relations of the discourse particle "hm" in a naturalistic human-computer interaction

The development of speech-controlled assistance systems has gained more importance in this day an... more The development of speech-controlled assistance systems has gained more importance in this day and time. Application ranges from driver assistance systems in the automotive sector to daily use in mobile devices such as smart- phones or tablets. To ensure the reliability of these Systems; not only the meaning of the pure spoken text, but also meta-information about the user or dialogue func- tions such as attention or turn-taking have to be perceived and processed. This further information is transmitted through intonation of words or sentences. In human communication, discourse particles serve to forward information, without interrupting the speaker. For the German language J.E. Schmidt empirically discovered seven types of form-function-concurrences on the isolated DP "hm". To also be considered in human-computer interaction, it is useful to be able to distin- guish these different meanings of the DP "hm". In this paper we present an automatic classification-met...

Research paper thumbnail of Emotion and Disposition Detection in Medical Machines: Chances and Challenges

Intelligent Systems, Control and Automation: Science and Engineering, 2014

ABSTRACT Machines designed for medical applications beyond usual data acquisition and processing ... more ABSTRACT Machines designed for medical applications beyond usual data acquisition and processing need to cooperate with and adapt to humans in order to fulfill their supportive tasks. Technically, medical machines are therefore considered as affective systems, capable of detecting, assessing and adapting to emotional states and dispositional changes in users. One of the upcoming applications of affective systems is the use as supportive machines involved in the psychiatric disorder diagnose and therapy process. These machines have the additional requirement of being capable to control persuasive dialogues in order to obtain relevant patient data despite disadvantageous set-ups. These automated abilities of technical systems combined with enhanced processing, storage and observational capabilities raise both chances and challenges in medical applications. We focus on analyzing the objectivity, reliability and validity of current techniques used to determine the emotional states of speakers from speech and the arising implications. We discuss the underlying technical and psychological models and analyze recent machine assessment results of emotional states obtained through dialogues. Conclusively we discuss the involvement of affective systems as medical machines in the psychiatric diagnostics process and therapy sessions with respect to the technical and ethical circumstances.

Research paper thumbnail of Audio-Based Pre-classification for Semi-automatic Facial Expression Coding

Lecture Notes in Computer Science, 2013

Research paper thumbnail of Human Behaviour in HCI: Complex Emotion Detection through Sparse Speech Features

Lecture Notes in Computer Science, 2013

ABSTRACT To obtain a more human-like interaction with technical systems, those have to be adaptab... more ABSTRACT To obtain a more human-like interaction with technical systems, those have to be adaptable to the users’ individual skills, preferences, and current emotional state. In human-human interaction (HHI) the behaviour of the speaker is characterised by semantic and prosodic cues, given as short feedback signals called discourse particle (DP). These signals minimally communicate certain dialogue functions such as attention, understanding, confirmation, or other attitudinal reactions. Thus, these signals play an important role in the progress and coordination of interaction. They allow the dialogue partners to inform each other of their behavioural or affective state without interrupting the ongoing dialogue. Vocal communication provides acoustic details revealing the speaker’s feelings, believes, and social relations. Incorporating DPs in human computer interaction (HCI) systems will allow the detection of complex emotions, which are currently hard to access. Complex emotions in turn are closely related to human behaviour. Hence, integrating automatic DP detection and complex emotion assignment in HCI systems provides a first approach to the integration of human behaviour understanding in HCI systems. In this article we present methods allowing to extract the pitch-contour of DPs and to assign complex emotions to observed DPs. We investigate the occurrences of DPs in naturalistic HCI and show that DPs may be assigned to complex emotions automatically. Furthermore, we show that DPs are indeed related to behaviour, showing an age-gender specific usage during naturalistic HCI. Additionally, we prove that DPs may be used to automatically detect and classify complex emotions during HCI.

Research paper thumbnail of Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech

Cognitive Computation, 2014

ABSTRACT For successful human-machine-interaction (HCI) the pure textual information and the indi... more ABSTRACT For successful human-machine-interaction (HCI) the pure textual information and the individual skills, preferences, and affective states of the user must be known. Therefore, as a starting point, the user's actual affective state has to be recognized. In this work we investigated how additional knowledge, for example age and gender of the user, can be used to improve recognition of affective state. Two methods from automatic speech recognition are used to incorporate age and gender differences in recognition of affective state: speaker group-dependent (SGD) modelling and vocal tract length normalisation (VTLN). The investigations were performed on four corpora with acted and natural affected speech. Different features and two methods of classification (Gaussian mixture models (GMMs) and multi-layer perceptrons (MLPs)) were used. In addition, the effects of channel compensation and contextual characteristics were analysed. The results are compared with our own baseline results and with results reported in the literature. Two hypotheses were tested. First, incorporation of age information further improves speaker group-dependent modelling. Second, acoustic normalization does not achieve the same improvement as achieved by speaker group-dependent modelling, because the age and gender of a speaker affects the way emotions are expressed.

Research paper thumbnail of Fusion of Fragmentary Classifier Decisions for Affective State Recognition

Lecture Notes in Computer Science, 2013

ABSTRACT Real human-computer interaction systems based on different modalities face the problem t... more ABSTRACT Real human-computer interaction systems based on different modalities face the problem that not all information channels are always available at regular time steps. Nevertheless an estimation of the current user state is required at anytime to enable the system to interact instantaneously based on the available modalities. A novel approach to decision fusion of fragmentary classifications is therefore proposed and empirically evaluated for audio and video signals of a corpus of non-acted user behavior. It is shown that visual and prosodic analysis successfully complement each other leading to an outstanding performance of the fusion architecture.

Research paper thumbnail of Discourse Particles and User Characteristics in Naturalistic Human-Computer Interaction

Lecture Notes in Computer Science, 2014

Research paper thumbnail of Emotion Detection in HCI: From Speech Features to Emotion Space

12th IFAC,IFIP,IFORS,IEA Symposium on Analysis, Design, and Evaluation of Human-Machine Systems, 2013, 2013

ABSTRACT Control mechanisms in modern Human-Computer Interaction (HCI) underwent a paradigm shift... more ABSTRACT Control mechanisms in modern Human-Computer Interaction (HCI) underwent a paradigm shift from textual or display-based control to more intuitive control mechanisms, such as speech, gesture and mimic. Especially speech provides a high information density, delivering information about the speaker's inner state as well as his intention and demand. While word-based analyses allow to understand the speaker's request, further speech characteristics reveal the speakers emotion, intention and motivation. Therefore, emotion detection from speech became signi�cant in modern HCI applications. However, the results from the disciplines involved in the emotion detection are not easily merged. Engineers developing voice controlled HCI systems work in \feature spaces", relying on technically measurable acoustic and spectral features. Psychologists analysing and identifying emotions work in emotion categories, schemes or dimensional emotion spaces, describing emotions in terms of quantities and qualities of human notable expressions. While engineering methods notice the slightest variations in speech, emotion theories allow to compare and identify emotions, but must rely on human judgements. However, both perceptions are essential and must be combined to allow machines to allocate a�ective states during HCI. To provide a link between machine measurable variations in emotional speech and dimensional emotion theory, signi�cant features describing emotions must be identi�ed and analysed regarding their transferability to emotion space. In this article we present a justi�able feature selection for emotion detection from speech and show how to relate measurable features to emotions.We discuss our transformation model and validate both feature selection and model, based on a selection of the Emo-DB corpus.

Research paper thumbnail of Automatic differentiation of form-function-relations of the discourse particle "hm" in a naturalistic human-computer interaction

The development of speech-controlled assistance systems has gained more importance in this day an... more The development of speech-controlled assistance systems has gained more importance in this day and time. Application ranges from driver assistance systems in the automotive sector to daily use in mobile devices such as smart- phones or tablets. To ensure the reliability of these Systems; not only the meaning of the pure spoken text, but also meta-information about the user or dialogue func- tions such as attention or turn-taking have to be perceived and processed. This further information is transmitted through intonation of words or sentences. In human communication, discourse particles serve to forward information, without interrupting the speaker. For the German language J.E. Schmidt empirically discovered seven types of form-function-concurrences on the isolated DP "hm". To also be considered in human-computer interaction, it is useful to be able to distin- guish these different meanings of the DP "hm". In this paper we present an automatic classification-met...

Research paper thumbnail of Classification of Functional-Meanings of Non-isolated Discourse Particles in Human-Human-Interaction

Lecture Notes in Computer Science, 2016

Research paper thumbnail of Discourse Particles in Human-Human and Human-Computer Interaction – Analysis and Evaluation

Lecture Notes in Computer Science, 2016

Research paper thumbnail of AUDIO COMPRESSION AND ITS IMPACT ON EMOTION RECOGNITION IN AFFECTIVE COMPUTING

Enabling a natural (human-like) spoken conversation with technical systems requires affective inf... more Enabling a natural (human-like) spoken conversation with technical systems requires affective information, contained in spoken language, to be intelligibly transmitted. This study investigates the role of speech and music codecs for affect intelligibility. A decoding and encoding of affective speech was employed from the well-known EMO-DB corpus. Using four state-of-the-art acoustic codecs and different bit-rates, the spectral error and the human affect recognition ability in labeling experiments were investigated and set in relation to results of automatic recognition of base emotions. Through this approach, the general affect intelligibility as well as the emotion specific intelligibility was analyzed. Considering the results of the conducted automatic recognition experiments, the SPEEX codec configuration with a bit-rate of 6.6 kbit/s is recommended to achieve a high compression and overall good UARs for all emotions.

Research paper thumbnail of Simulated group meetings -insights from sociology and engineering

In the last decades tremendous improvements in accous-tic modelling of speech for automatic speec... more In the last decades tremendous improvements in accous-tic modelling of speech for automatic speech recognition were made. Nonetheless, the interaction between humans and computer systems via an automatic speech recogntion interface still regulary leads to unsatis-fied users. In this paper recordings of 3 simulated group meetings with limited vocabulary are analysed from two perspectives to find starting points to overcome this problem. One perspective is given by a computer engineer and the other by a sociologist. Our intention is to provide in-sights into the dynamics of group meetings which might lead to more robust and adaptive Automatic Speech Recognition systems and to en-hanced sociological understanding of group situations from a different point of view.

Research paper thumbnail of Describing Human Emotions Through Mathematical Modelling

To design a companion technology we focus on the appraisal theory model to predict emotions and d... more To design a companion technology we focus on the appraisal theory model to predict emotions and determine the appropriate system behaviour to support Human-Computer-Interaction. Until now, the implementation of emotion processing was hindered by the fact that the theories needed originate from diverging research areas, hence divergent research techniques and result representations are present. Since this difficulty arises repeatedly in interdisciplinary research, we investigated the use of mathematical modelling as an unifying language to translate the coherence of appraisal theory. We found that the mathematical category theory supports the modelling of human emotions according to the appraisal theory model and hence assists the implementation.

Research paper thumbnail of Appropriate emotional labelling of non-acted speech using basic emotions, geneva emotion wheel and self assessment manikins

2011 IEEE International Conference on Multimedia and Expo, 2011

In emotion recognition from speech, a good transcription and annotation of given material is cruc... more In emotion recognition from speech, a good transcription and annotation of given material is crucial. Moreover, the question of how to find good emotional labels for new data material is a basic issue. It is not only the question of which emotion labels to choose, it is also a matter of how labellers can cope with annotation methods. In this paper, we present our investigations for emotional labelling with three different methods (Basic Emotions, Geneva Emotion Wheel and Self Assessment Manikins) and compare them in terms of emotion coverage and usability. We show that emotion labels derived from Geneva Emotion Wheel or Self Assessment Manikins fulfill our requirements, but Basic Emotions are not feasible for emotion labelling from spontaneous speech.

Research paper thumbnail of Multimodal affect recognition in spontaneous HCI environment

2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC 2012), 2012

ABSTRACT Human Computer Interaction (HCI) is known to be a multimodal process. In this paper we w... more ABSTRACT Human Computer Interaction (HCI) is known to be a multimodal process. In this paper we will show results of experiments for affect recognition, with non-acted, affective multimodal data from the new Last Minute Corpus (LMC). This corpus is more related to real HCI applications than other known data sets where affective behavior is elicited untypically for HCI.We utilize features from three modalities: facial expressions, prosody and gesture. The results show, that even simple fusion architectures can reach respectable results compared to other approaches. Further we could show, that probably not all features and modalities contribute substantially to the classification process, where prosody and eye blink frequency seem most contributing in the analyzed dataset.

Research paper thumbnail of Investigating the form-function-relation of the discourse particle “hm” in a naturalistic human-computer interaction

ABSTRACT A verbal human interaction consists of several information layers. Apart from the pure t... more ABSTRACT A verbal human interaction consists of several information layers. Apart from the pure textual information, further details regarding the speaker’s feelings, believes and social relations are transmitted. The additional information is encoded through the acoustics e.g. through speaking style or intonation. Especially the intonation conveys specific information about the speakers communicative relation and his attitude towards the actual dialogue. Since the intonation is influenced by semantic and grammatical information, it is advisable to investigate the intonation of so-called discourse particles as “hm” or “uhm”: They can’t be inflected but can be emphasized and are occurring at crucial communicative points. Discourse particles have the same intonation curves (pitch-contours) as whole sentences and thus may indicate the same functional meanings. For the German language J. E. Schmidt empirically discovered seven types of form-function- concurrences on the isolated discourse particle “hm”. For successful speech-controlled naturalistic human-computer interaction (HCI) the pure textual information as well as the individual skills, preferences, and affective states of the user have to be known. Therefore, it seems reasonable to consider the discourse particle in HCI, as well. To determin the dialogue-function, methods are needed that preserves the pitch-contours and are feasible to assign them to defined form-prototypes. Furthermore, the question whether different pitch-contours are occurring in naturalistic HCI and if they are congruent with the findings by linguists has to be investigated. In this paper we present first results on the extraction and correlation and investigate the different form-function-relations of the discourse particle “hm” in the naturalistic LAST MINUTE corpus and answer the question which form-function relations can be expected in naturalistic HCI.

Research paper thumbnail of Annotation and Classification of Changes of Involvement in Group Conversation

2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013

ABSTRACT The detection of involvement in a conversation is important to assess the level humans a... more ABSTRACT The detection of involvement in a conversation is important to assess the level humans are participating in either a human-human or human-computer interaction. Especially, detecting changes in a group's involvement in a multi-party interaction is of interest to distinguish several constellations in the group itself. This information can further be used in situations where technical support of meetings is favoured, for instance, focusing a camera, switching microphones, etc. Moreover, this information could also help to improve the performance of technical systems applied in human-machine interaction. In this paper, we concentrate on video material given by the Table Talk corpus. Therefore, we introduce a way of annotating and classifying changes of involvement and discuss the reliability of the annotation. Further, we present classification results based on video features using Multi-Layer Networks.

Research paper thumbnail of The Influence of Context Knowledge for Multi-modal Affective Annotation

Lecture Notes in Computer Science, 2013

In emotion recognition from speech, a good transcription and annotation of given material is cruc... more In emotion recognition from speech, a good transcription and annotation of given material is crucial. Moreover, the question of how to find good emotional labels for new data material is a basic issue. An important question is how the context influences the decision of the annotator. In this paper, we present our investigations for emotional labelling on natural multimodal data with and without the computer's responses as one main context-information within an natural human computer interaction. We show that for emotional labels the computer's responses did not influence the decision of the annotators.

Research paper thumbnail of Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements

Journal on Multimodal User Interfaces, 2013

ABSTRACT To enable a naturalistic human–computer interaction the recognition of emotions and inte... more ABSTRACT To enable a naturalistic human–computer interaction the recognition of emotions and intentions experiences increased attention and several modalities are comprised to cover all human communication abilities. For this reason, naturalistic material is recorded, where the subjects are guided through an interaction with crucial points, but with the freedom to react individually. This material captures realistic user reactions but lacks of clear labels. So, a good transcription and annotation of the given material is essential. For that, the assignment of human annotators has become widely accepted. A good measurement for the reliability of labelled material is the inter-rater agreement. In this paper we investigate the achieved inter-rater agreement utilizing Krippendorff’s alpha for emotional annotated interaction corpora and present methods to improve the reliability, we show that the reliabilities obtained with different methods does not differ much, so a choice could rely on other aspects. Furthermore, a multimodal presentation of the items in their natural order increases the reliability.

Research paper thumbnail of Automatic differentiation of form-function-relations of the discourse particle "hm" in a naturalistic human-computer interaction

The development of speech-controlled assistance systems has gained more importance in this day an... more The development of speech-controlled assistance systems has gained more importance in this day and time. Application ranges from driver assistance systems in the automotive sector to daily use in mobile devices such as smart- phones or tablets. To ensure the reliability of these Systems; not only the meaning of the pure spoken text, but also meta-information about the user or dialogue func- tions such as attention or turn-taking have to be perceived and processed. This further information is transmitted through intonation of words or sentences. In human communication, discourse particles serve to forward information, without interrupting the speaker. For the German language J.E. Schmidt empirically discovered seven types of form-function-concurrences on the isolated DP "hm". To also be considered in human-computer interaction, it is useful to be able to distin- guish these different meanings of the DP "hm". In this paper we present an automatic classification-met...

Research paper thumbnail of Emotion and Disposition Detection in Medical Machines: Chances and Challenges

Intelligent Systems, Control and Automation: Science and Engineering, 2014

ABSTRACT Machines designed for medical applications beyond usual data acquisition and processing ... more ABSTRACT Machines designed for medical applications beyond usual data acquisition and processing need to cooperate with and adapt to humans in order to fulfill their supportive tasks. Technically, medical machines are therefore considered as affective systems, capable of detecting, assessing and adapting to emotional states and dispositional changes in users. One of the upcoming applications of affective systems is the use as supportive machines involved in the psychiatric disorder diagnose and therapy process. These machines have the additional requirement of being capable to control persuasive dialogues in order to obtain relevant patient data despite disadvantageous set-ups. These automated abilities of technical systems combined with enhanced processing, storage and observational capabilities raise both chances and challenges in medical applications. We focus on analyzing the objectivity, reliability and validity of current techniques used to determine the emotional states of speakers from speech and the arising implications. We discuss the underlying technical and psychological models and analyze recent machine assessment results of emotional states obtained through dialogues. Conclusively we discuss the involvement of affective systems as medical machines in the psychiatric diagnostics process and therapy sessions with respect to the technical and ethical circumstances.

Research paper thumbnail of Audio-Based Pre-classification for Semi-automatic Facial Expression Coding

Lecture Notes in Computer Science, 2013

Research paper thumbnail of Human Behaviour in HCI: Complex Emotion Detection through Sparse Speech Features

Lecture Notes in Computer Science, 2013

ABSTRACT To obtain a more human-like interaction with technical systems, those have to be adaptab... more ABSTRACT To obtain a more human-like interaction with technical systems, those have to be adaptable to the users’ individual skills, preferences, and current emotional state. In human-human interaction (HHI) the behaviour of the speaker is characterised by semantic and prosodic cues, given as short feedback signals called discourse particle (DP). These signals minimally communicate certain dialogue functions such as attention, understanding, confirmation, or other attitudinal reactions. Thus, these signals play an important role in the progress and coordination of interaction. They allow the dialogue partners to inform each other of their behavioural or affective state without interrupting the ongoing dialogue. Vocal communication provides acoustic details revealing the speaker’s feelings, believes, and social relations. Incorporating DPs in human computer interaction (HCI) systems will allow the detection of complex emotions, which are currently hard to access. Complex emotions in turn are closely related to human behaviour. Hence, integrating automatic DP detection and complex emotion assignment in HCI systems provides a first approach to the integration of human behaviour understanding in HCI systems. In this article we present methods allowing to extract the pitch-contour of DPs and to assign complex emotions to observed DPs. We investigate the occurrences of DPs in naturalistic HCI and show that DPs may be assigned to complex emotions automatically. Furthermore, we show that DPs are indeed related to behaviour, showing an age-gender specific usage during naturalistic HCI. Additionally, we prove that DPs may be used to automatically detect and classify complex emotions during HCI.

Research paper thumbnail of Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech

Cognitive Computation, 2014

ABSTRACT For successful human-machine-interaction (HCI) the pure textual information and the indi... more ABSTRACT For successful human-machine-interaction (HCI) the pure textual information and the individual skills, preferences, and affective states of the user must be known. Therefore, as a starting point, the user's actual affective state has to be recognized. In this work we investigated how additional knowledge, for example age and gender of the user, can be used to improve recognition of affective state. Two methods from automatic speech recognition are used to incorporate age and gender differences in recognition of affective state: speaker group-dependent (SGD) modelling and vocal tract length normalisation (VTLN). The investigations were performed on four corpora with acted and natural affected speech. Different features and two methods of classification (Gaussian mixture models (GMMs) and multi-layer perceptrons (MLPs)) were used. In addition, the effects of channel compensation and contextual characteristics were analysed. The results are compared with our own baseline results and with results reported in the literature. Two hypotheses were tested. First, incorporation of age information further improves speaker group-dependent modelling. Second, acoustic normalization does not achieve the same improvement as achieved by speaker group-dependent modelling, because the age and gender of a speaker affects the way emotions are expressed.

Research paper thumbnail of Fusion of Fragmentary Classifier Decisions for Affective State Recognition

Lecture Notes in Computer Science, 2013

ABSTRACT Real human-computer interaction systems based on different modalities face the problem t... more ABSTRACT Real human-computer interaction systems based on different modalities face the problem that not all information channels are always available at regular time steps. Nevertheless an estimation of the current user state is required at anytime to enable the system to interact instantaneously based on the available modalities. A novel approach to decision fusion of fragmentary classifications is therefore proposed and empirically evaluated for audio and video signals of a corpus of non-acted user behavior. It is shown that visual and prosodic analysis successfully complement each other leading to an outstanding performance of the fusion architecture.

Research paper thumbnail of Discourse Particles and User Characteristics in Naturalistic Human-Computer Interaction

Lecture Notes in Computer Science, 2014

Research paper thumbnail of Emotion Detection in HCI: From Speech Features to Emotion Space

12th IFAC,IFIP,IFORS,IEA Symposium on Analysis, Design, and Evaluation of Human-Machine Systems, 2013, 2013

ABSTRACT Control mechanisms in modern Human-Computer Interaction (HCI) underwent a paradigm shift... more ABSTRACT Control mechanisms in modern Human-Computer Interaction (HCI) underwent a paradigm shift from textual or display-based control to more intuitive control mechanisms, such as speech, gesture and mimic. Especially speech provides a high information density, delivering information about the speaker's inner state as well as his intention and demand. While word-based analyses allow to understand the speaker's request, further speech characteristics reveal the speakers emotion, intention and motivation. Therefore, emotion detection from speech became signi�cant in modern HCI applications. However, the results from the disciplines involved in the emotion detection are not easily merged. Engineers developing voice controlled HCI systems work in \feature spaces", relying on technically measurable acoustic and spectral features. Psychologists analysing and identifying emotions work in emotion categories, schemes or dimensional emotion spaces, describing emotions in terms of quantities and qualities of human notable expressions. While engineering methods notice the slightest variations in speech, emotion theories allow to compare and identify emotions, but must rely on human judgements. However, both perceptions are essential and must be combined to allow machines to allocate a�ective states during HCI. To provide a link between machine measurable variations in emotional speech and dimensional emotion theory, signi�cant features describing emotions must be identi�ed and analysed regarding their transferability to emotion space. In this article we present a justi�able feature selection for emotion detection from speech and show how to relate measurable features to emotions.We discuss our transformation model and validate both feature selection and model, based on a selection of the Emo-DB corpus.