Attention-based word-level contextual feature extraction and cross-modality fusion for sentiment analysis and emotion classification (original) (raw)

Multi‐level feature optimization and multimodal contextual fusion for sentiment analysis and emotion classification

Computational Intelligence, 2020

The availability of the humongous amount of multimodal content on the internet, the multimodal sentiment classification, and emotion detection has become the most researched topic. The feature selection, context extraction, and multi-modal fusion are the most important challenges in multimodal sentiment classification and affective computing. To address these challenges this paper presents multilevel feature optimization and multimodal contextual fusion technique. The evolutionary computing based feature selection models extract a subset of features from multiple modalities. The contextual information between the neighboring utterances is extracted using bidirectional long-short-term-memory at multiple levels. Initially, bimodal fusion is performed by fusing a combination of two unimodal modalities at a time and finally, trimodal fusion is performed by fusing all three modalities. The result of the proposed method is demonstrated using two publically available datasets such as CMU-MOSI for sentiment classification and IEMOCAP for affective computing. Incorporating a subset of features and contextual information, the proposed model obtains better classification accuracy than the two

Attention-based Multi-modal Sentiment Analysis and Emotion Detection in Conversation using RNN

International Journal of Interactive Multimedia and Artificial Intelligence, 2020

The availability of an enormous quantity of multimodal data and its widespread applications, automatic sentiment analysis and emotion classification in the conversation has become an interesting research topic among the research community. The interlocutor state, context state between the neighboring utterances and multimodal fusion play an important role in multimodal sentiment analysis and emotion detection in conversation. In this article, the recurrent neural network (RNN) based method is developed to capture the interlocutor state and contextual state between the utterances. The pair-wise attention mechanism is used to understand the relationship between the modalities and their importance before fusion. First, two-two combinations of modalities are fused at a time and finally, all the modalities are fused to form the trimodal representation feature vector. The experiments are conducted on three standard datasets such as IEMOCAP, CMU-MOSEI, and CMU-MOSI. The proposed model is evaluated using two metrics such as accuracy and F1-Score and the results demonstrate that the proposed model performs better than the standard baselines.

Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis

—Multimodal sentiment analysis involves identifying sentiment in videos and is a developing field of research. Unlike current works, which model utterances individually, we propose a recurrent model that is able to capture contextual information among utterances. In this paper, we also introduce attention-based networks for improving both context learning and dynamic feature fusion. Our model shows 6-8% improvement over the state of the art on a benchmark dataset.

Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In recent times, multi-modal analysis has been an emerging and highly sought-after field at the intersection of natural language processing, computer vision, and speech processing. The prime objective of such studies is to leverage the diversified information, (e.g., textual, acoustic and visual), for learning a model. The effective interaction among these modalities often leads to a better system in terms of performance. In this paper, we introduce a recurrent neural network based approach for the multi-modal sentiment and emotion analysis. The proposed model learns the intermodal interaction among the participating modalities through an auto-encoder mechanism. We employ a context-aware attention module to exploit the correspondence among the neighboring utterances. We evaluate our proposed approach for five standard multi-modal affect analysis datasets. Experimental results suggest the efficacy of the proposed model for both sentiment and emotion analysis over various existing state-of-the-art systems.

Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition

Proceedings of the 22nd Conference on Computational Natural Language Learning

Natural human communication is nuanced and inherently multi-modal. Humans possess specialised sensoria for processing vocal, visual, and linguistic, and para-linguistic information, but form an intricately fused percept of the multi-modal data stream to provide a holistic representation. Analysis of emotional content in face-to-face communication is a cognitive task to which humans are particularly attuned, given its sociological importance, and poses a difficult challenge for machine emulation due to the subtlety and expressive variability of cross-modal cues. Inspired by the empirical success of recent so-called End-To-End Memory Networks (Sukhbaatar et al., 2015), we propose an approach based on recursive multi-attention with a shared external memory updated over multiple gated iterations of analysis. We evaluate our model across several large multimodal datasets and show that global contextualised memory with gated memory update can effectively achieve emotion recognition.

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

2018

The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms t...

Multimodal Emotion Recognition Based on Deep Temporal Features Using Cross-Modal Transformer and Self-Attention

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Multimodal speech emotion recognition (MSER) is an emerging and challenging field of research due to its more robust characteristics than unimodal. However, in multimodal approaches, the interactive relations for model building using different modalities of speech representations for emotion recognition have not been well investigated yet. To address this issue, we introduce a new approach to capturing the deep temporal features of audio and text. The audio features are learned with a convolution neural network (CNN) and a Bi-directional Gated Recurrent Unit (Bi-GRU) network. The textual features are represented by GloVe word embedding along with Bi-GRU. A cross-modal transformers block is designed for multimodal learning to capture better inter-and intra-interactions and temporal information between the audio and textual features. Further, a self-attention (SA) network is employed to select more important emotional information from the fused multimodal features. We evaluate the proposed method on the IEMOCAP dataset on four emotion classes (i.e., angry, neutral, sad, and happy). The proposed method performs significantly better than the most recent state-of-the-art MSER methods.

Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)

Emotion recognition has become a popular topic of interest, especially in the field of human computer interaction. Previous works involve unimodal analysis of emotion, while recent efforts focus on multimodal emotion recognition from vision and speech. In this paper, we propose a new method of learning about the hidden representations between just speech and text data using convolutional attention networks. Compared to the shallow model which employs simple concatenation of feature vectors, the proposed attention model performs much better in classifying emotion from speech and text data contained in the CMU-MOSEI dataset.

Multimodal Sentiment Analysis with Multi-perspective Fusion Network Focusing on Sense Attentive Language

Lecture Notes in Computer Science, 2020

Multimodal sentiment analysis aims to learn a joint representation of multiple features. As demonstrated by previous studies, it is shown that the language modality may contain more semantic information than that of other modalities. Based on this observation, we propose a Multi-perspective Fusion Network(MPFN) focusing on Sense Attentive Language for multimodal sentiment analysis. Different from previous studies, we use the language modality as the main part of the final joint representation, and propose a multi-stage and uni-stage fusion strategy to get the fusion representation of the multiple modalities to assist the final language-dominated multimodal representation. In our model, a Sense-Level Attention Network is proposed to dynamically learn the word representation which is guided by the fusion of the multiple modalities. As in turn, the learned language representation can also help the multi-stage and uni-stage fusion of the different modalities. In this way, the model can jointly learn a well integrated final representation focusing on the language and the interactions between the multiple modalities both on multi-stage and uni-stage. Several experiments are carried on the CMU-MOSI, the CMU-MOSEI and the YouTube public datasets. The experiments show that our model performs better or competitive results compared with the baseline models.

Multimodal Sentiment Analysis: A Systematic review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions

Zenodo (CERN European Organization for Nuclear Research), 2022

Sentiment analysis (SA), a buzzword in the fields of artificial intelligence (AI) and natural language processing (NLP), is gaining popularity. Due to numerous SA applications, there is an increasing need to automate the procedure of analysing the user's feelings concerning any products or services. Multimodal Sentiment Analysis (MSA), a branch of sentiment analysis that uses many modalities, is a rapidly growing topic of study as more and more opinions are expressed through videos rather than just text. Recent advances in machine learning are used by MSA to advance. At each stage of the MSA, the most recent developments in machine learning and deep learning are used, including sentiment polarity recognition, multimodal features extraction, and multimodal fusion with reduced error rates and increased speed. This research paper categorises several recent developments in MSA designs into 10 categories and focuses mostly on the primary taxonomy and recently published Multimodal Fusion architectures. The 10 categories are: early fusion, late fusion, hybrid, model-level fusion, tensor fusion, hierarchical, bi-modal, attention-based, quantum-based, and word-level fusion. The primary contribution of this manuscript is a study of the advantages and disadvantages of various architectural developments in MSA fusion. It also talks about future scope, uses in other industries, and research shortages.