AVEC 2012–The Continuous Audio/Visual Emotion Challenge (original) (raw)
AVEC 2011 – The First International Audio/Visual Emotion Challenge
TheAudio/VisualEmotionChallengeandWorkshop(AVEC 2011) is the first competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and audiovisual emotion analysis, with all participants competing under strictly the same conditions. This paper first describes the challenge par- ticipation conditions. Next follows the data used – the SEMAINE corpus – and its partitioning into train, development, and test partitions for the challenge with labelling in four dimensions, namely activity, expectation, power, and valence. Further, audio and video baseline features are intro- duced as well as baseline results that use these features for the three sub-challenges of audio, video, and audiovisual emotion recognition.
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021
In this work we tackle the task of video-based audiovisual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW2). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage both bodily and contextual features, as part of a broader emotion recognition framework. We choose to use a standard CNN-RNN cascade as the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the RGB input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Aff-Wild2 dataset verify the validity of our intuitive multi-stream and multi-modal approach towards emotion recognition "in-the-wild". Emphasis is being laid on the the beneficial influence of the human body and scene context, as aspects of the emotion recognition process that have been left relatively unexplored up to this point. All the code was implemented using PyTorch 1 and is publicly available 2 .
Audio-Visual Emotion Recognition in Video Clips
IEEE Transactions on Affective Computing, 2017
This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues. From the audio channel, Mel-Frequency Cepstral Coefficients, Filter Bank Energies and prosodic features are extracted. For the visual part, two strategies are considered. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames, which are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to key-frames summarizing videos. Finally, confidence outputs of all the classifiers from all the modalities are used to define a new feature space to be learned for final emotion label prediction, in a late fusion/stacking fashion. The experiments conducted on the SAVEE, eNTERFACE'05, and RML databases show significant performance improvements by our proposed system in comparison to current alternatives, defining the current state-of-the-art in all three databases.
The INTERSPEECH 2009 emotion challenge
2009
The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed -such as cross-validation or percentage splits without proper instance definition -prevents exact reproducibility. Further, in order to face more realistic scenarios, the community is in desperate need of more spontaneous and less prototypical data. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus [1] serves as basis with clearly defined test and training partitions incorporating speaker independence and different room acoustics as needed in most reallife settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.
EmotiW 2016: video and group-level emotion recognition challenges
Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016
This paper discusses the baseline for the Emotion Recognition in the Wild (EmotiW) 2016 challenge. Continuing on the theme of automatic affect recognition 'in the wild', the EmotiW challenge 2016 consists of two sub-challenges: an audio-video based emotion and a new group-based emotion recognition sub-challenges. The audio-video based subchallenge is based on the Acted Facial Expressions in the Wild (AFEW) database. The group-based emotion recognition sub-challenge is based on the Happy People Images (HAPPEI) database. We describe the data, baseline method, challenge protocols and the challenge results. A total of 22 and 7 teams participated in the audio-video based emotion and group-based emotion sub-challenges, respectively.
AVEC 2018 Workshop and Challenge
Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop - AVEC'18
The Audio/Visual Emotion Challenge and Workshop (AVEC 2018) "Bipolar disorder, and cross-cultural affect recognition" is the eighth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings, respectively.
Video and Image based Emotion Recognition Challenges in the Wild
Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015
The third Emotion Recognition in the Wild (EmotiW) challenge 2015 consists of an audio-video based emotion and static image based facial expression recognition sub-challenges, which mimics real-world conditions. The two sub-challenges are based on the Acted Facial Expression in the Wild (AFEW) 5.0 and the Static Facial Expression in the Wild (SFEW) 2.0 databases, respectively. The paper describes the data, baseline method, challenge protocol and the challenge results. A total of 12 and 17 teams participated in the video based emotion and image based expression sub-challenges, respectively.
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and biosignals. Most state-of-the-art methods for audiovisual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the intermodal relationships, while reducing the heterogeneity between features. In particular, it computes cross-attention weights based on the correlation between joint feature representations, and that of individual modalities. By deploying a joint A-V feature representation into the crossattention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results 1 on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on test set (validation set). This is a significant improvement over the baseline of third challenge of Affective Behavior Analysis in-the-wild (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).
Emotion Recognition In The Wild Challenge 2014 (Baseline, Data and Protocol)
Proceedings of the 16th International Conference, 2014
The Second Emotion Recognition In The Wild Challenge (EmotiW) 2014 consists of an audio-video based emotion classification challenge, which mimics the real-world conditions. Traditionally, emotion recognition has been performed on data captured in constrained lab-controlled like environment. While this data was a good starting point, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the exponential increase in the number of video clips being uploaded online, it is worthwhile to explore the performance of emotion recognition methods that work 'in the wild'. The goal of this Grand Challenge is to carry forward the common platform defined during EmotiW 2013, for evaluation of emotion recognition methods in real-world conditions. The database in the 2014 challenge is the Acted Facial Expression In Wild (AFEW) 4.0, which has been collected from movies showing close-to-real-world conditions. The paper describes the data partitions, the baseline method and the experimental protocol.
Subjective Evaluation of Basic Emotions from Audio–Visual Data
Sensors
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio–visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio–visual emotion database collected from TV broadcasts such as soap-operas and movies, called the IIIT-H Audio–Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio–visual data in English. Using data of all three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral, and sad) based on category labeling and for two dimensions, namely arousal (active or passive) and valence (positive or negative), based on dimensional labeling. The results indicated that the participants’ perception of emotions was remarkably different between the audio-alone, video-alone, and audio–video data. Th...