AVEC 2012–The Continuous Audio/Visual Emotion Challenge (original) (raw)

AVEC 2011 – The First International Audio/Visual Emotion Challenge

TheAudio/VisualEmotionChallengeandWorkshop(AVEC 2011) is the first competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and audiovisual emotion analysis, with all participants competing under strictly the same conditions. This paper first describes the challenge par- ticipation conditions. Next follows the data used – the SEMAINE corpus – and its partitioning into train, development, and test partitions for the challenge with labelling in four dimensions, namely activity, expectation, power, and valence. Further, audio and video baseline features are intro- duced as well as baseline results that use these features for the three sub-challenges of audio, video, and audiovisual emotion recognition.

An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild

2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021

In this work we tackle the task of video-based audiovisual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW2). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage both bodily and contextual features, as part of a broader emotion recognition framework. We choose to use a standard CNN-RNN cascade as the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the RGB input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Aff-Wild2 dataset verify the validity of our intuitive multi-stream and multi-modal approach towards emotion recognition "in-the-wild". Emphasis is being laid on the the beneficial influence of the human body and scene context, as aspects of the emotion recognition process that have been left relatively unexplored up to this point. All the code was implemented using PyTorch 1 and is publicly available 2 .

Audio-Visual Emotion Recognition in Video Clips

IEEE Transactions on Affective Computing, 2017

This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues. From the audio channel, Mel-Frequency Cepstral Coefficients, Filter Bank Energies and prosodic features are extracted. For the visual part, two strategies are considered. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames, which are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to key-frames summarizing videos. Finally, confidence outputs of all the classifiers from all the modalities are used to define a new feature space to be learned for final emotion label prediction, in a late fusion/stacking fashion. The experiments conducted on the SAVEE, eNTERFACE'05, and RML databases show significant performance improvements by our proposed system in comparison to current alternatives, defining the current state-of-the-art in all three databases.

The INTERSPEECH 2009 emotion challenge

2009

The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed -such as cross-validation or percentage splits without proper instance definition -prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in desperate need of more spontaneous and less prototypical data. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus [1] serves as basis with clearly defined test and training partitions incorporating speaker independence and different room acoustics as needed in most reallife settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.

EmotiW 2016: video and group-level emotion recognition challenges

Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016

This paper discusses the baseline for the Emotion Recognition in the Wild (EmotiW) 2016 challenge. Continuing on the theme of automatic affect recognition 'in the wild', the EmotiW challenge 2016 consists of two sub-challenges: an audio-video based emotion and a new group-based emotion recognition sub-challenges. The audio-video based subchallenge is based on the Acted Facial Expressions in the Wild (AFEW) database. The group-based emotion recognition sub-challenge is based on the Happy People Images (HAPPEI) database. We describe the data, baseline method, challenge protocols and the challenge results. A total of 22 and 7 teams participated in the audio-video based emotion and group-based emotion sub-challenges, respectively.

Affective state level recognition in naturalistic facial and vocal expressions

2014

Naturalistic affective expressions change at a rate much slower than the typical rate at which video or audio is recorded. This increases the probability that consecutive recorded instants of expressions represent the same affective content. In this paper, we exploit such a relationship to improve the recognition performance of continuous naturalistic affective expressions. Using datasets of naturalistic affective expressions (AVEC 2011 audio and video dataset, PAINFUL video dataset) continuously labeled over time and over different dimensions, we analyze the transitions between levels of those dimensions (e.g., transitions in pain intensity level). We use an information theory approach to show that the transitions occur very slowly and hence suggest modeling them as first-order Markov models. The dimension levels are considered to be the hidden states in the Hidden Markov Model (HMM) framework. Their discrete transition and emission matrices are trained by using the labels provided with the training set. The recognition problem is converted into a best path-finding problem to obtain the best hidden states sequence in HMMs. This is a key difference from previous use of HMMs as classifiers. Modeling of the transitions between dimension levels is integrated in a multistage approach, where the first level performs a mapping between the affective expression features and a soft decision value (e.g., an affective dimension level), and further classification stages are modeled as HMMs that refine that mapping by taking into account the temporal relationships between the output decision labels. The experimental results for each of the unimodal datasets show overall performance to be significantly above that of a standard classification system that does not take into account temporal relationships. In particular, the results on the AVEC 2011 audio dataset outperform all other systems presented at the international competition.

Visual and audio analysis of movies video for emotion detection@ Emotional Impact of Movies task MediaEval 2018

2018

This work reports the methodology that CERTH-ITI team developed so as to recognize the emotional impact that movies have to its viewers in terms of valence/arousal and fear. More Specifically,deep convolutional neural networks and several machine learning techniques are utilized to extract visual features and classify them based on the predicted model, while audio features are also taken into account in the fear scenario, leading to highly accurate recognition rates.

iVectors for Continuous Emotion Recognition

This work proposes the use of the iVectors paradigm for performing continuous emotion recognition. To do so, a segmentation of the audio stream with a fixed-length sliding window is performed, in order to obtain a temporal context which is enough for capturing the emotional information of the speech. These segments are projected into the iVectors space, and the continuous emotion labels are learnt by canonical correlation analysis. A voice activity detection strategy is incorporated to the system in order to ignore the non-speech segments, which do not provide any information about the emotional state of the speaker, and to recreate a real-world scenario. Results on the framework of the Audiovisual Emotion Challenge (AVEC) 2013 show the potential of this approach for the emotion recognition task, obtaining promising results as well as using low-dimensional data representation.

The motion in emotion — A CERT based approach to the FERA emotion challenge

Face and Gesture 2011, 2011

This paper assesses the performance of measures of facial expression dynamics derived from the Computer Expression Recognition Toolbox (CERT) for classifying emotions in the Facial Expression Recognition and Analysis (FERA) Challenge. The CERT system automatically estimates facial action intensity and head position using learned appearancebased models on single frames of video. CERT outputs were used to derive a representation of the intensity and motion in each video, consisting of the extremes of displacement, velocity and acceleration. Using this representation, emotion detectors were trained on the FERA training examples. Experiments on the released portion of the FERA dataset are presented, as well as results on the blind test. No consideration of subject identity was taken into account in the blind test. The F1 scores were well above the baseline criterion for success.

Continuous Analysis of Affect from Voice and Face

Human affective behavior is multimodal, continuous and complex. Despite major advances within the affective computing research field, modeling, analyzing, interpreting and responding to human affective behavior still remains a challenge for automated systems. Therefore, affective and behavioral computing researchers have recently invested increased effort in exploring how to best model, analyze and interpret the subtlety, complexity and continuity of affective behavior in terms of latent dimensions (e.g., arousal, power and valence) and appraisals, rather than in terms of a small number of discrete emotion categories (e.g., happiness and sadness). This chapter aims to (i) give a brief overview of the existing efforts and the major accomplishments in modeling and analysis of emotional expressions in dimensional and continuous space while focusing on open issues and new challenges in the field, and (ii) introduce a representative approach for multimodal continuous analysis of affect from voice and face.