Effectively leveraging Multi-modal Features for Movie Genre Classification (original) (raw)

A multimodal approach for multi-label movie genre classification

Multimedia Tools and Applications, 2020

Movie genre classification is a challenging task that has increasingly attracted the attention of researchers. The number of movie consumers interested in taking advantage of automatic movie genre classification is growing rapidly thanks to the popularization of media streaming service providers. In this paper, we addressed the multi-label classification of the movie genres in a multimodal way. For this purpose, we created a dataset composed of trailer video clips, subtitles, synopses, and movie posters taken from 152,622 movie titles from The Movie Database (TMDb) 1. The dataset was carefully curated and organized, and it was also made available 2 as a contribution of this work. Each movie of the dataset was labeled according to a set of eighteen genre labels. We extracted features from these data using different kinds of descriptors, namely Mel Frequency Cepstral Coefficients (MFCCs), Statistical Spectrum Descriptor (SSD), Local Binary Pattern (LBP) with spectrograms, Long-Short Term Memory (LSTM), and Convolutional Neural Networks (CNN). The descriptors were evaluated using different classifiers, such as BinaryRelevance and ML-kNN. We have also investigated the performance of the combination of different classifiers/features using a late fusion strategy, which obtained encouraging results. Based on the F-Score metric, our best result, 0.628, was obtained by the fusion of a classifier created using LSTM on the synopses, and a classifier created using CNN on movie trailer frames. When considering the AUC-PR metric, the best result, 0.673, was also achieved by combining those representations, but in addition, a classifier based on LSTM created from the subtitles was used. These results corroborate the existence of complementarity among classifiers based on different sources of information in this field of application. As far as we know, this is the most comprehensive study developed in terms of the diversity of multimedia sources of information to perform movie genre classification.

An in-depth evaluation of multimodal video genre categorization

2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI), 2013

In this paper we propose an in-depth evaluation of the performance of video descriptors to multimodal video genre categorization. We discuss the perspective of designing appropriate late fusion techniques that would enable to attain very high categorization accuracy, close to the one achieved with user-based text information. Evaluation is carried out in the context of the 2012 Video Genre Tagging Task of the MediaEval Benchmarking Initiative for Multimedia Evaluation, using a data set of up to 15.000 videos (3,200 hours of footage) and 26 video genre categories specific to web media. Results show that the proposed approach significantly improves genre categorization performance, outperforming other existing approaches. The main contribution of this paper is in the experimental part, several valuable interesting findings are reported that motivate further research on video genre classification.

Content-Based Video Description for Automatic Video Genre Categorization

Lecture Notes in Computer Science, 2012

In this paper, we propose an audio-visual approach to video genre categorization. Audio information is extracted at block-level, which has the advantage of capturing local temporal information. At temporal structural level, we asses action contents with respect to human perception. Further, color perception is quantified with statistics of color distribution, elementary hues, color properties and relationship of color. The last category of descriptors determines statistics of contour geometry. An extensive evaluation of this multi-modal approach based on on more than 91 hours of video footage is presented. We obtain average precision and recall ratios within [87% − 100%] and [77% − 100%], respectively, while average correct classification is up to 97%. Additionally, movies displayed according to feature-based coordinates in a virtual 3D browsing environment tend to regroup with respect to genre, which has potential application with real content-based browsing systems.

Video genre categorization and representation using audio-visual information

2012

We propose an audiovisual approach to video genre classification using content descriptors that exploit audio, color, temporal, and contour information. Audio information is extracted at blocklevel, which has the advantage of capturing local temporal information. At the temporal structure level, we consider action content in relation to human perception. Color perception is quantified using statistics of color distribution, elementary hues, color properties, and relationships between colors. Further, we compute statistics of contour geometry and relationships. The main contribution of our work lies in harnessing the descriptive power of the combination of these descriptors in genre classification. Validation was carried out on over 91 hours of video footage encompassing 7 common video genres, yielding average precision and recall ratios of 87%−100% and 77%−100%, respectively, and an overall average correct classification of up to 97%. Also, experimental comparison as part of 1 the MediaEval 2011 benchmarking campaign demonstrated the superiority of the proposed audiovisual descriptors over other existing approaches. Finally, we discuss a 3D video browsing platform that displays movies using feature-based coordinates and thus regroups them according to genre.

Automatic recognition of film genres

Proceedings of the third ACM international conference on Multimedia - MULTIMEDIA '95, 1995

Film genres in digital video can be detected automatically. In a three-step approach we analyze first the syntactic properties of digital films: color statistics, cut detection, camera motion, object motion and audio. In a second step we use these statistics to derive at a more abstract level film style attributes such as camera panning and zooming, speech and music. These are distinguishing properties for film genres, e.g. newscasts vs. sports vs. commercials. In the third and final step we map the detected style attributes to film genres. Algorithms for the three steps are presented in detail, and we report on initial experience with real videos. It is our goal to automatically classify the large body of existing video for easier access in digital video-on-demand databases.

A visual-based late-fusion framework for video genre classification

International Symposium on Signals, Circuits and Systems ISSCS2013, 2013

In this paper we investigate the performance of visual features in the context of video genre classification. We propose a late-fusion framework that employs color, texture, structural and salient region information. Experimental validation was carried out in the context of the MediaEval 2012 Genre Tagging Task using a large data set of more than 2,000 hours of footage and 26 video genres. Results show that the proposed approach significantly improves genre classification performance outperforming other existing approaches. Furthermore, we prove that our approach can help improving the performance of the more efficient text-based approaches. I.

Content-based Automatic Video Genre Identification

International Journal of Advanced Computer Science and Applications, 2019

Video content is evolving enormously with the heavy usage of internet and social media websites. Proper searching and indexing of such video content is a major challenge. The existing video search potentially relies on the information provided by the user, such as video caption, description and subsequent comments on the video. In such case, if users provide insufficient or incorrect information about the video genre, the video may not be indexed correctly and ignored during search and retrieval. This paper proposes a mechanism to understand the contents of video and categorize it as Music Video, Talk Show, Movie/Drama, Animation and Sports. For video classification, the proposed system uses audio and visual features like audio signal energy, zero crossing rate, spectral flux from audio and shot boundary, scene count and actor motion from video. The system is tested on popular Hollywood, Bollywood and YouTube videos to give an accuracy of 96%.

Video Genre Classification using Convolutional Recurrent Neural Networks

International Journal of Advanced Computer Science and Applications

A wide amount of media in the internet is in the form of video files which have different formats and encodings. Easy identification and sorting of videos becomes a mammoth task if done manually. With an ever-increasing demand for video streaming and download, the Video Classification problem is brought into foresight for managing such large and unstructured data over the internet and locally. We present a solution for classifying videos into genres and locality by training a Convolutional Recurrent Neural Network. It involves feature extraction from video files in the form of frames and audio. The Neural Networks makes a suitable prediction. The final output layer will place the video in a certain genre. This problem could be applied to a vast number of applications including but not limited to search optimization, grouping, critic reviews, piracy detection, targeted advertisements, etc. We expect our fully trained model to identify, with acceptable accuracy, any video or video clip over the internet and thus eliminate the cumbersome problem of manual video classification.

Multilevel profiling of situation and dialogue-based deep networks for movie genre classification using movie trailers

ArXiv, 2021

Automated movie genre classification has emerged as an active and essential area of research and exploration. Short duration movie trailers provide useful insights about the movie as video content consists of the cognitive and the affective level features. Previous approaches were focused upon either cognitive or affective content analysis. In this paper, we propose a novel multi-modality: situation, dialogue, and metadata-based movie genre classification framework that takes both cognition and affect-based features into consideration. A pre-features fusionbased framework that takes into account: situation-based features from a regular snapshot of a trailer that includes nouns and verbs providing the useful affect-based mapping with the corresponding genres, dialogue (speech) based feature from audio, metadata which together provides the relevant information for cognitive and affect based video analysis. We also develop the English movie trailer dataset (EMTD), which contains 2000 H...

Rethinking movie genre classification with fine-grained semantic clustering

ArXiv, 2020

Movie genre classification is an active research area in machine learning. However, due to the limited labels available, there can be large semantic variations between movies within a single genre definition. We expand these 'coarse' genre labels by identifying 'fine-grained' semantic information within the multi-modal content of movies. By leveraging pre-trained 'expert' networks, we learn the influence of different combinations of modes for multi-label genre classification. Using a contrastive loss, we continue to fine-tune this 'coarse' genre classification network to identify high-level intertextual similarities between the movies across all genre labels. This leads to a more 'fine-grained' and detailed clustering, based on semantic similarities while still retaining some genre information. Our approach is demonstrated on a newly introduced multi-modal 37,866,450 frame, 8,800 movie trailer dataset, MMX-Trailer-20, which includes pre-comput...