A Two-Stage Approach to Device-Robust Acoustic Scene Classification (original) (raw)

Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

ArXiv, 2020

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-b...

Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

arXiv (Cornell University), 2023

The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with prerecorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.

Wider or Deeper Neural Network Architecture for Acoustic Scene Classification with Mismatched Recording Devices

Cornell University - arXiv, 2022

In this paper, we present a robust and low complexity system for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording. We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue. To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction on the ASC baseline system. By conducting extensive experiments on the benchmark DCASE 2020 Task 1A Development dataset, we achieve the best model performing an accuracy of 69.9% and a low complexity of 2.4M trainable parameters, which is competitive to the state-of-the-art ASC systems and potential for real-life applications on edge devices.

A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification

arXiv (Cornell University), 2021

We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called Acoustic Lottery. Acoustic Lottery could largely compress an ASC model and attain a superior performance (validation accuracy of 79.4% and Log loss of 0.64) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.

A Convolutional Neural Network Approach for Acoustic Scene Classification

—This paper presents a novel application of convo-lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the " Detection and Classification of Acoustic Scenes and Events " (DCASE) challenges held in 2016 1 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner's score.

A Robust Framework for Acoustic Scene Classification

Interspeech 2019, 2019

Acoustic scene classification (ASC) using front-end timefrequency features and back-end neural network classifiers has demonstrated good performance in recent years. However a profusion of systems has arisen to suit different tasks and datasets, utilising different feature and classifier types. This paper aims at a robust framework that can explore and utilise a range of different time-frequency features and neural networks, either singly or merged, to achieve good classification performance. In particular, we exploit three different types of frontend time-frequency feature; log energy Mel filter, Gammatone filter and constant Q transform. At the back-end we evaluate effective a two-stage model that exploits a Convolutional Neural Network for pre-trained feature extraction, followed by Deep Neural Network classifiers as a post-trained feature adaptation model and classifier. We also explore the use of a data augmentation technique for these features that effectively generates a variety of intermediate data, reinforcing model learning abilities, particularly for marginal cases. We assess performance on the DCASE2016 dataset, demonstrating good classification accuracies exceeding 90%, significantly outperforming the DCASE2016 baseline and highly competitive compared to state-of-the-art systems.

Ensemble of Deep Neural Networks for Acoustic Scene Classification

Deep neural networks (DNNs) have recently achieved great success in a multitude of classification tasks. Ensembles of DNNs have been shown to improve the performance. In this paper, we explore the recent state-of-the-art DNNs used for image classification. We modified these DNNs and applied them to the task of acoustic scene classification. We conducted a number of experiments on the TUT Acoustic Scenes 2017 dataset to empirically compare these methods. Finally, we show that the ensemble of these DNNs improves the baseline score for DCASE-2017 Task 1 by 10%.

A Low-Compexity Deep Learning FrameworkFor Acoustic Scene Classification

2021

In this paper, we presents a low-complexitydeep learning frameworks for acoustic scene classification(ASC). The proposed framework can be separated into threemain steps: Front-end spectrogram extraction, back-endclassification, and late fusion of predicted probabilities.First, we use Mel filter, Gammatone filter and ConstantQ Transfrom (CQT) to transform raw audio signal intospectrograms, where both frequency and temporal featuresare presented. Three spectrograms are then fed into threeindividual back-end convolutional neural networks (CNNs),classifying into ten urban scenes. Finally, a late fusion ofthree predicted probabilities obtained from three CNNs isconducted to achieve the final classification result. To reducethe complexity of our proposed CNN network, we applytwo model compression techniques: model restriction anddecomposed convolution. Our extensive experiments, whichare conducted on DCASE 2021 (IEEE AASP Challenge onDetection and Classification of Acoustic Scenes and Eve...

AUDIO SCENE CLASSIFICATION USING ENHANCED CONVOLUTIONAL NEURAL NETWORKS FOR DCASE 2021 CHALLENGE

This technical report describes our system proposed for Task 1B-AudioVisual Scene Classification of the DCASE 2021 Challenge. Our system focuses in the audio signal based classification. The system has an architecture based on the combination of Convolutional Neural Networks and OpenL3 embeddings. The CNN consist of three stacked 2D convolutional layers to process the log-Mel spectrogram parameters obtained from the input signals. Additionally OpenL3 embeddings of the input signals are also calculated and merged with the output of the CNN stack. The resulting vector is fed to a classification block consisting of three fully connected layers. Mixup augmentation technique is applied to the training data and binaural data is also used as input to provide additional information. In this report, we describe the proposed systems in detail and compare them to the baseline approach using the provided development datasets.

A Layer-wise Score Level Ensemble Framework for Acoustic Scene Classification

2018 26th European Signal Processing Conference (EUSIPCO), 2018

Scene classification based on acoustic information is a challenging task due to various factors such as the nonstationary nature of the environment and multiple overlapping acoustic events. In this paper, we address the acoustic scene classification problem using SoundNet, a deep convolution neural network, pre-trained on raw audio signals. We propose a classification strategy by combining scores from each layer. This is based on the hypothesis that layers of the deep convolutional network learn complementary information and combining this layer-wise information provides better classification than the features extracted from an individual layer. In addition, we also propose a pooling strategy to reduce the dimensionality of features extracted from different layers of SoundNet. Our experiments on DCASE 2016 acoustic scene classification dataset reveals the effectiveness of this layer-wise ensemble approach. The proposed approach provides a relative improvement of approx. 30.85% over the classification accuracy provided by the best individual layer of SoundNet.