On the Effect of Coding Artifacts on Acoustic Scene Classification (original) (raw)

Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices

Significant effort s are being invested to bring state-of-the-art classification and recognition to edge devices with extreme resource constraints (memory, speed, and lack of GPU support). Here, we demonstrate the first deep network for acoustic recognition that is small, flexible and compression-friendly yet achieves state-of-the-art performance for raw audio classification. Rather than handcrafting a once-off solution, we present a generic pipeline that automatically converts a large deep convolutional network via compression and quantization into a network for resource-impoverished edge devices. After introducing ACDNet, which produces above state-of-the-art accuracy on ESC-10 (96.65%), ESC-50 (87.10%), Urban-Sound8K (84.45%) and AudioEvent (92.57%), we describe the compression pipeline and show that it allows us to achieve 97.22% size reduction and 97.28% FLOP reduction while maintaining close to state-of-the-art accuracy 96.25%, 83.65%, 78.27% and 89.69% on these datasets. We describe a successful implementation on a standard off-the-shelf microcontroller and, beyond laboratory benchmarks, report successful tests on real-world datasets.

Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

ArXiv, 2020

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-b...

Environmental Sound Classification on the Edge: Deep Acoustic Networks for Extremely Resource-Constrained Devices

2021

Significant efforts are being invested to bring the classification and recognition powers of desktop and cloud systemsdirectly to edge devices. The main challenge for deep learning on the edge is to handle extreme resource constraints(memory, CPU speed and lack of GPU support). We present an edge solution for audio classification that achieves close to state-of-the-art performance on ESC-50, the same benchmark used to assess large, non resource-constrained networks. Importantly, we do not specifically engineer thenetwork for edge devices. Rather, we present a universalpipeline that converts a large deep convolutional neuralnetwork (CNN) automatically via compression and quantization into a network suitable for resource-impoverishededge devices. We first introduce a new sound classification architecture, ACDNet, that produces above state-of-the-art accuracy on both ESC-10 and ESC-50 which are 96.75% and 87.05% respectively. We then compress ACDNet using a novel network-independent ap...

A Low-Compexity Deep Learning FrameworkFor Acoustic Scene Classification

2021

In this paper, we presents a low-complexitydeep learning frameworks for acoustic scene classification(ASC). The proposed framework can be separated into threemain steps: Front-end spectrogram extraction, back-endclassification, and late fusion of predicted probabilities.First, we use Mel filter, Gammatone filter and ConstantQ Transfrom (CQT) to transform raw audio signal intospectrograms, where both frequency and temporal featuresare presented. Three spectrograms are then fed into threeindividual back-end convolutional neural networks (CNNs),classifying into ten urban scenes. Finally, a late fusion ofthree predicted probabilities obtained from three CNNs isconducted to achieve the final classification result. To reducethe complexity of our proposed CNN network, we applytwo model compression techniques: model restriction anddecomposed convolution. Our extensive experiments, whichare conducted on DCASE 2021 (IEEE AASP Challenge onDetection and Classification of Acoustic Scenes and Eve...

A Two-Stage Approach to Device-Robust Acoustic Scene Classification

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finergrained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-theart accuracy on the development set, where our best system, a twostage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models.

Wider or Deeper Neural Network Architecture for Acoustic Scene Classification with Mismatched Recording Devices

Cornell University - arXiv, 2022

In this paper, we present a robust and low complexity system for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording. We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue. To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction on the ASC baseline system. By conducting extensive experiments on the benchmark DCASE 2020 Task 1A Development dataset, we achieve the best model performing an accuracy of 69.9% and a low complexity of 2.4M trainable parameters, which is competitive to the state-of-the-art ASC systems and potential for real-life applications on edge devices.

Pruning vs XNOR-Net: A Comprehensive Study of Deep Learning for Audio Classification on Edge-devices

2021

Deep Learning has celebrated resounding successes in many application areas of relevance to the Internet-of-Things, for example, computer vision and machine listening. To fully harness the power of deep leaning for the IoT, these technologies must ultimately be brought directly to the edge. The obvious challenge is that deep learning techniques can only be implemented on strictly resource-constrained edge devices if the models are radically downsized. This task relies on different model compression techniques, such as network pruning, quantization, and the recent advancement of XNOR-Net. This paper examines the suitability of these techniques for audio classification on microcontrollers. We present an XNOR-Net for end-to-end raw audio classification and a comprehensive empirical study comparing this approach with pruning-and-quantization methods. We show that raw audio classification with XNOR yields comparable performance to regular full precision networks for small numbers of clas...

A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification

arXiv (Cornell University), 2021

We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called Acoustic Lottery. Acoustic Lottery could largely compress an ASC model and attain a superior performance (validation accuracy of 79.4% and Log loss of 0.64) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.

AUDIO SCENE CLASSIFICATION USING ENHANCED CONVOLUTIONAL NEURAL NETWORKS FOR DCASE 2021 CHALLENGE

This technical report describes our system proposed for Task 1B-AudioVisual Scene Classification of the DCASE 2021 Challenge. Our system focuses in the audio signal based classification. The system has an architecture based on the combination of Convolutional Neural Networks and OpenL3 embeddings. The CNN consist of three stacked 2D convolutional layers to process the log-Mel spectrogram parameters obtained from the input signals. Additionally OpenL3 embeddings of the input signals are also calculated and merged with the output of the CNN stack. The resulting vector is fed to a classification block consisting of three fully connected layers. Mixup augmentation technique is applied to the training data and binaural data is also used as input to provide additional information. In this report, we describe the proposed systems in detail and compare them to the baseline approach using the provided development datasets.