SubSpectralNet – Using Sub-spectrogram Based Convolutional Neural Networks for Acoustic Scene Classification (original) (raw)

A Convolutional Neural Network Approach for Acoustic Scene Classification

—This paper presents a novel application of convo-lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the " Detection and Classification of Acoustic Scenes and Events " (DCASE) challenges held in 2016 1 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner's score.

Sound Context Classification based on Joint Learning Model and Multi-Spectrogram Features

International Journal of Computing

This article presents a deep learning framework applied for Acoustic Scene Classification (ASC), the task of classifying different environments from the sounds they produce. To successfully develop the framework, we firstly carry out a comprehensive analysis of spectrogram representation extracted from sound scene input, then propose the best multi-spectrogram combination for front-end feature extraction. In terms of back-end classification, we propose a novel joint learning model using a parallel architecture of Convolutional Neural Network (CNN) and Convolutional Recurrent Neural Network (C-RNN), which is able to learn efficiently both spatial features and temporal sequences of a spectrogram input. The experimental results have proved our proposed framework general and robust for ASC tasks by three main contributions. Firstly, the most effective spectrogram combination is indicated for specific datasets that none of publication previously analyzed. Secondly, our joint learning arc...

Acoustic Scene Classification Using a CNN-SuperVector System Trained with Auditory and Spectrogram Image Features

Interspeech 2017, 2017

Enabling smart devices to infer about the environment using audio signals has been one of the several long-standing challenges in machine listening. The availability of public-domain datasets, e.g., Detection and Classification of Acoustic Scenes and Events (DCASE) 2016, enabled researchers to compare various algorithms on standard predefined tasks. Most of the current best performing individual acoustic scene classification systems utilize different spectrogram image based features with a Convolutional Neural Network (CNN) architecture. In this study, we first analyze the performance of a state-of-theart CNN system for different auditory image and spectrogram features, including Mel-scaled, logarithmically scaled, linearly scaled filterbank spectrograms, and Stabilized Auditory Image (SAI) features. Next, we benchmark an MFCC based Gaussian Mixture Model (GMM) SuperVector (SV) system for acoustic scene classification. Finally, we utilize the activations from the final layer of the CNN to form a SuperVector (SV) and use them as feature vectors for a Probabilistic Linear Discriminative Analysis (PLDA) classifier. Experimental evaluation on the DCASE 2016 database demonstrates the effectiveness of the proposed CNN-SV approach compared to conventional CNNs with a fully connected softmax output layer. Score fusion of individual systems provides up to 7% relative improvement in overall accuracy compared to the CNN baseline system.

Cross-domain synergy: Leveraging image processing techniques for enhanced sound classification through spectrogram analysis using CNNs

Journal of Autonomous Intelligence

In this paper, the innovative approach to sound classification by exploiting the potential of image processing techniques applied to spectrogram representations of audio signals is reviewed. This study shows the effectiveness of incorporating well-established image processing methodologies, such as filtering, segmentation, and pattern recognition, to enhance the feature extraction and classification performance of audio signals when transformed into spectrograms. An overview is provided of the mathematical methods shared by both image and spectrogram-based audio processing, focusing on the commonalities between the two domains in terms of the underlying principles, techniques, and algorithms. The proposed methodology leverages in particular the power of convolutional neural networks (CNNs) to extract and classify time-frequency features from spectrograms, capitalizing on the advantages of their hierarchical feature learning and robustness to translation and scale variations. Other d...

Acoustic Scene Analysis and Classification Using Densenet Convolutional Neural Network

In this paper we present an account of state-of the-art in Acoustic Scene Classification (ASC), the task of environmental scenario classification through the sounds they produce. Our work aims to classify 50 different outdoor and indoor scenario using environmental sounds. We use a dataset ESC-50 from the IEEE challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this we propose to use 2000 different environmental audio recordings. In this method the raw audio data is converted into Mel-spectrogram and other characteristics like Tonnetz, Chroma and MFCC. The generated Mel-spectrogram is fed as an input to neural network for training. Our model follows structure of neural network in the form of convolution and pooling. With a focus on real time environmental classification and to overcome the problem of low generalization in the model, the paper introduced augmentation to achieve modified noise based audio by adding gaussian white noise. Active researche...

Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework

Digital Signal Processing, 2021

This article proposes an encoder-decoder network model for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording from its acoustic signature. We make use of multiple low-level spectrogram features at the front-end, transformed into higher level features through a well-trained CNN-DNN front-end encoder. The high level features and their combination (via a trained feature combiner) are then fed into different decoder models comprising random forest regression, DNNs and a mixture of experts, for back-end classification. We report extensive experiments to evaluate the accuracy of this framework for various ASC datasets, including LITIS Rouen and IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Task 1, 2017 Task 1, 2018 Tasks 1A & 1B and 2019 Tasks 1A & 1B. The experimental results highlight two main contributions; the first is an effective method for high-level feature extraction from multi-spectrogram input via the novel C-DNN architecture encoder network, and the second is the proposed decoder which enables the framework to achieve competitive results on various datasets. The fact that a single framework is highly competitive for several different challenges is an indicator of its robustness for performing general ASC tasks.

Audio Recognition using Mel Spectrograms and Convolution Neural Networks

2019

Automatic sound recognition has received heightened research interest in recent years due to its many potential applications. These include automatic labeling of video/audio content and real-time sound detection for robotics. While image classification is a heavily researched topic, sound identification is less mature. In this study, we take advantage of the robust machine learning techniques developed for image classification and apply them on the sound recognition problem. Raw audio data from the Freesound Dataset (FSD) provided by Kaggle is first converted to a spectrogram representation in order to apply these image classification techniques. We test and compare two approaches using deep convolutional neural networks (CNNs): 1.) Our own CNN architecture 2.) Transfer learning using the pre-trained VVG19 network. Using our self-developed architecture, we achieve a label-weighted label-ranking average precision (LWLARP) score and top-5 accuracy of 0.813 and 88.9%, respectively, whe...

Sound Context Classification Basing on Join Learning Model and Multi-Spectrogram Features

Cornell University - arXiv, 2020

In this paper, we present a deep learning framework applied for Acoustic Scene Classification (ASC), the task of classifying scene contexts from environmental input sounds. An ASC system generally comprises of two main steps, referred to as front-end feature extraction and back-end classification. In the first step, an extractor is used to extract low-level features from raw audio signals. Next, the discriminative features extracted are fed into and classified by a classifier,reporting accuracy results. Aim to develop a robust framework applied for ASC, we address exited issues of both the front-end and back-end components in an ASC system, thus present three main contributions: Firstly, we carry out a comprehensive analysis of spectrogram representation extracted from sound scene input, thus propose the best multispectrogram combinations. In terms of back-end classification, we propose a novel join learning architecture using parallel convolutional recurrent networks, which is effective to learn spatial features and temporal sequences of spectrogram input. Finally, good experimental results obtained over benchmark datasets of IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Task 1, 2017 Task 1, 2018 Task 1A & 1B, LITIS Rouen prove our proposed framework general and robust for ASC task.

Spectral images based environmental sound classification using CNN with meaningful data augmentation

Applied Acoustics, 2020

In this study, an effective approach of spectral images based on environmental sound classification using Convolutional Neural Networks (CNN) with meaningful data augmentation is proposed. The feature used in this approach is the Mel spectrogram. Our approach is to define features from audio clips in the form of spectrogram images. The randomly selected CNN models used in this experiment are, a 7-layer or a 9layer CNN learned from scratch. Also, various well-known deep learning structures with transfer learning and with a concept of freezing initial layers, training model, unfreezing the layers, again training the model with discriminative learning are considered. Three datasets, ESC-10, ESC-50, and Us8k are considered. As for the transfer learning methodology, 11 explicit pre-trained deep learning structures are used. In this study, instead of using those available data augmentation schemes for images, we proposed to have meaningful data augmentation by considering variations applied to the audio clips directly. The results show the effectiveness, robustness, and high accuracy of the proposed approach. The meaningful data augmentation can accomplish the highest accuracy with a lower error rate on all datasets by using transfer learning models. Among those used models, The ResNet-152 attained 99.04% for ESC-10 and 99.49% for Us8k datasets. DenseNet-161 gained 97.57% for ESC-50. From our understanding, they are the best-achieved results on these datasets.

A Low-Compexity Deep Learning FrameworkFor Acoustic Scene Classification

2021

In this paper, we presents a low-complexitydeep learning frameworks for acoustic scene classification(ASC). The proposed framework can be separated into threemain steps: Front-end spectrogram extraction, back-endclassification, and late fusion of predicted probabilities.First, we use Mel filter, Gammatone filter and ConstantQ Transfrom (CQT) to transform raw audio signal intospectrograms, where both frequency and temporal featuresare presented. Three spectrograms are then fed into threeindividual back-end convolutional neural networks (CNNs),classifying into ten urban scenes. Finally, a late fusion ofthree predicted probabilities obtained from three CNNs isconducted to achieve the final classification result. To reducethe complexity of our proposed CNN network, we applytwo model compression techniques: model restriction anddecomposed convolution. Our extensive experiments, whichare conducted on DCASE 2021 (IEEE AASP Challenge onDetection and Classification of Acoustic Scenes and Eve...