Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers (original) (raw)

Effectively leveraging Multi-modal Features for Movie Genre Classification

2022

Movie genre classification has been widely studied in recent years due to its various applications in video editing, summarization, and recommendation. Prior work has typically addressed this task by predicting genres based solely on the visual content. As a result, predictions from these methods often perform poorly for genres such as documentary or musical, since non-visual modalities like audio or language play an important role in correctly classifying these genres. In addition, the analysis of long videos at frame level is always associated with high computational cost and makes the prediction less efficient. To address these two issues, we propose a Multi-Modal approach leveraging shot information 3 , MMShot, to classify video genres in an efficient and effective way. We evaluate our method on MovieNet and Condensed Movies for genre classification, achieving 17%∼21% improvement on mean Average Precision (mAP) over the state-of-the-art. Extensive experiments are conducted to demonstrate the ability of MMShot for long video analysis and uncover the correlations between genres and multiple movie elements. We also demonstrate our approach's ability to generalize by evaluating the scene boundary detection task, achieving 1.1% improvement on Average Precision (AP) over the state-of-the-art.

A multimodal approach for multi-label movie genre classification

Multimedia Tools and Applications, 2020

Movie genre classification is a challenging task that has increasingly attracted the attention of researchers. The number of movie consumers interested in taking advantage of automatic movie genre classification is growing rapidly thanks to the popularization of media streaming service providers. In this paper, we addressed the multi-label classification of the movie genres in a multimodal way. For this purpose, we created a dataset composed of trailer video clips, subtitles, synopses, and movie posters taken from 152,622 movie titles from The Movie Database (TMDb) 1. The dataset was carefully curated and organized, and it was also made available 2 as a contribution of this work. Each movie of the dataset was labeled according to a set of eighteen genre labels. We extracted features from these data using different kinds of descriptors, namely Mel Frequency Cepstral Coefficients (MFCCs), Statistical Spectrum Descriptor (SSD), Local Binary Pattern (LBP) with spectrograms, Long-Short Term Memory (LSTM), and Convolutional Neural Networks (CNN). The descriptors were evaluated using different classifiers, such as BinaryRelevance and ML-kNN. We have also investigated the performance of the combination of different classifiers/features using a late fusion strategy, which obtained encouraging results. Based on the F-Score metric, our best result, 0.628, was obtained by the fusion of a classifier created using LSTM on the synopses, and a classifier created using CNN on movie trailer frames. When considering the AUC-PR metric, the best result, 0.673, was also achieved by combining those representations, but in addition, a classifier based on LSTM created from the subtitles was used. These results corroborate the existence of complementarity among classifiers based on different sources of information in this field of application. As far as we know, this is the most comprehensive study developed in terms of the diversity of multimedia sources of information to perform movie genre classification.

Multimodal KDK Classifier for Automatic Classification of Movie Trailers

International Journal of Recent Technology and Engineering (IJRTE), 2019

Movie trailer classification is a field of automation of analyzing the movie trailers and classify them into one of the various genres. In this paper, we proposed a classifier to identify the genre of a movie trailer by analyzing it's audio and visual features simultaneously. Our Approach decomposes a trailer video into frames and audio file and then analyze them based on certain specific features to categorize them into four genres. Our aim was to minimize the number of parameters involved in analyzing the trailer since other papers use many arguments which are impractical. The proposed classifier was trained on 4 audio, 2 broad visual features extracted from over 900 movie trailers distributed across 4 different genres, namely Drama, Horror, Romance, and Action. The Classifier Model has been trained using Neural Networks and Convolutional Neural Networks. Our Classifier Model can be used in Recommendation Systems and various websites like IMDB for automation of the genre class...

Video Genre Classification using Convolutional Recurrent Neural Networks

International Journal of Advanced Computer Science and Applications

A wide amount of media in the internet is in the form of video files which have different formats and encodings. Easy identification and sorting of videos becomes a mammoth task if done manually. With an ever-increasing demand for video streaming and download, the Video Classification problem is brought into foresight for managing such large and unstructured data over the internet and locally. We present a solution for classifying videos into genres and locality by training a Convolutional Recurrent Neural Network. It involves feature extraction from video files in the form of frames and audio. The Neural Networks makes a suitable prediction. The final output layer will place the video in a certain genre. This problem could be applied to a vast number of applications including but not limited to search optimization, grouping, critic reviews, piracy detection, targeted advertisements, etc. We expect our fully trained model to identify, with acceptable accuracy, any video or video clip over the internet and thus eliminate the cumbersome problem of manual video classification.

Affect Recognition in a Realistic Movie Dataset Using a Hierarchical Approach

Affective content analysis has gained great attention in recent years and is an important challenge of content-based multimedia information retrieval. In this paper, a hierarchical approach is proposed for affect recognition in movie datasets. This approach has been verified on the AFEW dataset, showing an improvement in classification results compared to the baseline. In order to use all the visual sentiment aspects contained in the movies excerpts of a realistic dataset such as FilmStim, deep learning features trained on a large set of emotional images are added to the standard audio and visual features. The proposed approach will be integrated in a system that communicates the emotions of a movie to impaired people and contribute to improve their television experience.

The ICL-TUM-PASSAU Approach for the MediaEval 2015 "Affective Impact of Movies" Task

2015

In this paper we describe the Imperial College London, Technische Universitat Munchen and University of Passau (ICL+TUM+PASSAU) team approach to the MediaEval's "Affective Impact of Movies" challenge, which consists in the automatic detection of affective (arousal and valence) and violent content in movie excerpts. In addition to the baseline features, we computed spectral and energy related acoustic features, and the probability of various objects being present in the video. Random Forests, AdaBoost and Support Vector Machines were used as classification methods. Best results show that the dataset is highly challenging for both affect and violence detection tasks, mainly because of issues in inter-rater agreement and data scarcity.

Rethinking movie genre classification with fine-grained semantic clustering

ArXiv, 2020

Movie genre classification is an active research area in machine learning. However, due to the limited labels available, there can be large semantic variations between movies within a single genre definition. We expand these 'coarse' genre labels by identifying 'fine-grained' semantic information within the multi-modal content of movies. By leveraging pre-trained 'expert' networks, we learn the influence of different combinations of modes for multi-label genre classification. Using a contrastive loss, we continue to fine-tune this 'coarse' genre classification network to identify high-level intertextual similarities between the movies across all genre labels. This leads to a more 'fine-grained' and detailed clustering, based on semantic similarities while still retaining some genre information. Our approach is demonstrated on a newly introduced multi-modal 37,866,450 frame, 8,800 movie trailer dataset, MMX-Trailer-20, which includes pre-comput...

MMTF-14K: a multifaceted movie trailer feature dataset for recommendation and retrieval

Proceedings of the 9th ACM Multimedia Systems Conference (MMSys' 18), 2018

In this paper we propose a new dataset, i.e., the MMTF-14K multi- faceted dataset. It is primarily designed for the evaluation of video- based recommender systems, but it also supports the exploration of other multimedia tasks such as popularity prediction, genre classi- fication and auto-tagging (aka tag prediction). The data consists of 13,623 Hollywood-type movie trailers, ranked by 138,492 users, gen- erating a total of almost 12.5 million ratings. To address a broader community, metadata, audio and visual descriptors are also pre- computed and provided along with several baseline benchmarking results for uni-modal and multi-modal recommendation systems. This creates a rich collection of data for benchmarking results and which supports future development of this field.

Deep Features for Multimodal Emotion Classification

HAL (Le Centre pour la Communication Scientifique Directe), 2016

Understanding human emotion when perceiving audiovisual content is an exciting and important research avenue. Thus, there have been emerging attempts to predict the emotion elicited by video clips or movies recently. While most existing approaches focus either on single modality, i.e., only audio or visual data is exploited, or build on a multimodal scheme with late fusion, we propose a multimodal framework with early fusion scheme and target an emotion classification task. Our proposed mechanism presents the advantages of handling (1) the variation in video length, (2) the imbalance of audio and visual feature sizes, and (3) the middle-level fusion of audio and visual information such that a higher level feature representation can be learned jointly from the two modalities for classification. We evaluate the performance of the proposed approach on the international benchmark, i.e., the MediaEval 2015 Affective Impact of Movies 1 task, and show that it outperforms most state-of-the-art systems on arousal accuracy while using a much smaller feature size.

BOUN-NKU in MediaEval 2017 Emotional Impact of Movies Task

2017

In this paper, we present our approach for the Emotional Impact of Movies task of Mediaeval 2017 Challenge, involving multimodal fusion for predicting arousal and valence for movie clips. In our system, we have two pipelines. In the first one, we extracted audio/visual features, and used a combination of PCA, Fisher vector encoding, feature selection, and extreme learning machine classifiers. In the second one, we focused on the classifiers, rather than on feature selection.