Categorization of Audio/Video Content using Spectrogram-based CNN (original) (raw)
Related papers
Journal of Autonomous Intelligence
In this paper, the innovative approach to sound classification by exploiting the potential of image processing techniques applied to spectrogram representations of audio signals is reviewed. This study shows the effectiveness of incorporating well-established image processing methodologies, such as filtering, segmentation, and pattern recognition, to enhance the feature extraction and classification performance of audio signals when transformed into spectrograms. An overview is provided of the mathematical methods shared by both image and spectrogram-based audio processing, focusing on the commonalities between the two domains in terms of the underlying principles, techniques, and algorithms. The proposed methodology leverages in particular the power of convolutional neural networks (CNNs) to extract and classify time-frequency features from spectrograms, capitalizing on the advantages of their hierarchical feature learning and robustness to translation and scale variations. Other d...
Using audio as a basis for recognition method using CNN
2020
In the last decade, Deep learning (DL) has emerged as the solution to the problems that are easy for a human to understand but very difficult to put in a way so that computers can solve it. Image recognition is one example of such problems. Convolutional Neural Networks (CNN) emerge as the best solution to image recognition problems. CNN consists of several convolutional layers and pooling layers. Over the years, many people have experimented with CNN models to use them for audio classification and results have been encouraging. In this paper, we have reviewed such attempts and try to analyze them.
A Convolutional Neural Network Approach for Acoustic Scene Classification
—This paper presents a novel application of convo-lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the " Detection and Classification of Acoustic Scenes and Events " (DCASE) challenges held in 2016 1 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner's score.
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Acoustic Scene Classification (ASC) is one of the core research problems in the field of Computational Sound Scene Analysis. In this work, we present SubSpectralNet, a novel model which captures discriminative features by incorporating frequency band-level differences to model soundscapes. Using mel-spectrograms, we propose the idea of using band-wise crops of the input time-frequency representations and train a convolutional neural network (CNN) on the same. We also propose a modification in the training method for more efficient learning of the CNN models. We first give a motivation for using sub-spectrograms by giving intuitive and statistical analyses and finally we develop a sub-spectrogram based CNN architecture for ASC. The system is evaluated on the public ASC development dataset provided for the "Detection and Classification of Acoustic Scenes and Events" (DCASE) 2018 Challenge. Our best model achieves an improvement of +14% in terms of classification accuracy with respect to the DCASE 2018 baseline system. Code and figures are available at https:
Audio Recognition using Mel Spectrograms and Convolution Neural Networks
2019
Automatic sound recognition has received heightened research interest in recent years due to its many potential applications. These include automatic labeling of video/audio content and real-time sound detection for robotics. While image classification is a heavily researched topic, sound identification is less mature. In this study, we take advantage of the robust machine learning techniques developed for image classification and apply them on the sound recognition problem. Raw audio data from the Freesound Dataset (FSD) provided by Kaggle is first converted to a spectrogram representation in order to apply these image classification techniques. We test and compare two approaches using deep convolutional neural networks (CNNs): 1.) Our own CNN architecture 2.) Transfer learning using the pre-trained VVG19 network. Using our self-developed architecture, we achieve a label-weighted label-ranking average precision (LWLARP) score and top-5 accuracy of 0.813 and 88.9%, respectively, whe...
Interspeech 2017, 2017
Enabling smart devices to infer about the environment using audio signals has been one of the several long-standing challenges in machine listening. The availability of public-domain datasets, e.g., Detection and Classification of Acoustic Scenes and Events (DCASE) 2016, enabled researchers to compare various algorithms on standard predefined tasks. Most of the current best performing individual acoustic scene classification systems utilize different spectrogram image based features with a Convolutional Neural Network (CNN) architecture. In this study, we first analyze the performance of a state-of-theart CNN system for different auditory image and spectrogram features, including Mel-scaled, logarithmically scaled, linearly scaled filterbank spectrograms, and Stabilized Auditory Image (SAI) features. Next, we benchmark an MFCC based Gaussian Mixture Model (GMM) SuperVector (SV) system for acoustic scene classification. Finally, we utilize the activations from the final layer of the CNN to form a SuperVector (SV) and use them as feature vectors for a Probabilistic Linear Discriminative Analysis (PLDA) classifier. Experimental evaluation on the DCASE 2016 database demonstrates the effectiveness of the proposed CNN-SV approach compared to conventional CNNs with a fully connected softmax output layer. Score fusion of individual systems provides up to 7% relative improvement in overall accuracy compared to the CNN baseline system.
IJERT-A Deep Learning CNN Model for TV Broadcast Audio Classification
International Journal of Engineering Research and Technology (IJERT), 2020
https://www.ijert.org/a-deep-learning-cnn-model-for-tv-broadcast-audio-classification https://www.ijert.org/research/a-deep-learning-cnn-model-for-tv-broadcast-audio-classification-IJERTV9IS110145.pdf In the media, there are many electronic devices used in our day to day life. Television plays a predominant role. A method using deep learning Convolution Neural Network is introduced here to classify TV programs into one of the five categories namely Advertisement, Cartoon, News, Songs and Sports, based on the analysis of audio content. The objective of this work is to develop a CNN architecture to classify the audio segments significantly. The required dataset is created from different channels of Television using TV tuner card and by downloading from you tube channels. The proposed CNN model gives the accuracy of 95 % for TV broadcast audio classification.
An Audio-Based Deep Learning Framework For BBC Television Programme Classification
2021 29th European Signal Processing Conference (EUSIPCO), 2021
This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained Convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabilities and detected sound events are then calculated to extract discriminative features representing the television programmes. Finally, the embedded features extracted are fed into a classifier for classifying the programmes into different genres. Our experiments are conducted over a dataset of 6,160 programmes belonging to nine genres labelled by the BBC. We achieve an average classification accuracy of 93.7% over 14-fold cross validation. This demonstrates the efficacy of the proposed framework for the task of audiobased classification of television programmes.
Convolutional Networks Used to Classify Video and Audio Data
Research Papers Faculty of Materials Science and Technology Slovak University of Technology, 2019
Deep learning is a kind of machine learning, and machine learning is a kind of artificial intelligence. Machine learning depicts groups of various technologies, and deep learning is one of them. The use of deep learning is an integral part of the current data classification practice in today’s world. This paper introduces the possibilities of classification using convolutional networks. Experiments focused on audio and video data show different approaches to data classification. Most experiments use the well-known pre-trained AlexNet network with various pre-processing types of input data. However, there are also comparisons of other neural network architectures, and we also show the results of training on small and larger datasets. The paper comprises description of eight different kinds of experiments. Several training sessions were conducted in each experiment with different aspects that were monitored. The focus was put on the effect of batch size on the accuracy of deep learnin...
Crowd emotional sounds: spectrogram-based analysis using convolutional neural network
2019
In this work, we introduce a methodology for the recognition of crowd emotions from crowd speech and sound in mass events. Different emotional categories can be encoded via frequency-amplitude features of emotional crowd speech. The proposed technique uses visual transfer learning applied to the input sound spectrograms. Spectrogram images are generated starting from snippets of fixed length taken from the original sound clip. The plots are then filtered and normalized concerning frequency and magnitude and then fed to a pretrained Convolutional Neural Network (CNN) for images (AlexNet) integrated with domain-specific categorical layers. The integrated CNN is re-trained with the labeled spectrograms of crowd emotion sounds in order to adapt and fine-tune the recognition of the crowd emotional categories. Preliminary experiments have been held on a dataset collecting publiclyavailable sound clips of different mass events for each class, including Joy, Anger and Neutral. While transfe...