A Deep Residual Network for Large-Scale Acoustic Scene Analysis (original) (raw)

AUDIO EVENT DETECTION AND LOCALIZATION WITH MULTITASK REGRESSION NETWORK Technical Report

2020

This technical report describes our submission to the DCASE 2020 Task 3 (Sound Event Localization and Detection (SELD)). In the submission, we propose a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems to use the mean squared error loss homogeneously for model training. The deep learning model features a recurrent convolutional neural network (CRNN) architecture coupled with self-attention mechanism. Experiments on the development set of the challenge’s SELD task demonstrate that the proposed system outperforms the DCASE 2020 SELD baseline across all the detection and localization metrics, reducing the overall SELD error (the combined metric) approximately 10% absolute.

Convolutional Neural Network based Audio Event Classification

Ksii Transactions on Internet and Information Systems, 2018

This paper proposes an audio event classification method based on convolutional neural networks (CNNs). CNN has great advantages of distinguishing complex shapes of image. Proposed system uses the features of audio sound as an input image of CNN. Mel scale filter bank features are extracted from each frame, then the features are concatenated over 40 consecutive frames and as a result, the concatenated frames are regarded as an input image. The output layer of CNN generates probabilities of audio event (e.g. dogs bark, siren, forest). The event probabilities for all images in an audio segment are accumulated, then the audio event having the highest accumulated probability is determined to be the classification result. This proposed method classified thirty audio events with the accuracy of 81.5% for the UrbanSound8K, BBC Sound FX, DCASE2016, and FREESOUND dataset.

Sound event detection using deep neural networks

TELKOMNIKA Telecommunication Computing Electronics and Control, 2020

We applied various architectures of deep neural networks for sound event detection and compared their performance using two different datasets. Feed forward neural network (FNN), convolutional neural network (CNN), recurrent neural network (RNN) and convolutional recurrent neural network (CRNN) were implemented using hyper-parameters optimized for each architecture and dataset. The results show that the performance of deep neural networks varied significantly depending on the learning rate, which can be optimized by conducting a series of experiments on the validation data over predetermined ranges. Among the implemented architectures, the CRNN performed best under all testing conditions, followed by CNN. Although RNN was effective in tracking the time-correlation information in audio signals,it exhibited inferior performance compared to the CNN and the CRNN. Accordingly, it is necessary to develop more optimization strategies for implementing RNN in sound event detection.

Acoustic Event Classification Using Convolutional Neural Networks

2017

The classification of human-made acoustic events is important for the monitoring and recognition of human activities or critical behavior. In our experiments on acoustic event classification for the utilization in the sector of health care, we defined different acoustic events which represent critical events for elderly or people with disabilities in ambient assisted living environments or patients in hospitals. This contribution presents our work for acoustic event classification using deep learning techniques. We implemented and trained various convolutional neural networks for the extraction of deep feature vectors making use of current best practices in neural network design to establish a baseline for acoustic event classification. We convert chunks of audio signals into magnitude spectrograms and treat acoustic events as images. Our data set contains 20 different acoustic events which were collected in two different recording sessions combining human and environmental sounds. ...

Convolutional Recurrent Neural Networks for Rare Sound Event Detection

2017

Sound events possess certain temporal and spectral structure in their time-frequency representations. The spectral content for the samples of the same sound event class may exhibit small shifts due to intra-class acoustic variability. Convolutional layers can be used to learn high-level, shift invariant features from time-frequency representations of acoustic samples, while recurrent layers can be used to learn the longer term temporal context from the extracted high-level features. In this paper, we propose combining these two in a convolutional recurrent neural network (CRNN) for rare sound event detection. The proposed method is evaluated over DCASE 2017 challenge dataset of individual sound event samples mixed with everyday acoustic scene samples. CRNN provides significant performance improvement over two other deep learning based methods mainly due to its capability of longer term temporal modeling.

DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection

ArXiv, 2017

This report presents our audio event detection system submitted for Task 2, "Detection of rare sound events", of DCASE 2017 challenge. The proposed system is based on convolutional neural networks (CNNs) and deep neural networks (DNNs) coupled with novel weighted and multi-task loss functions and state-of-the-art phase-aware signal enhancement. The loss functions are tailored for audio event detection in audio streams. The weighted loss is designed to tackle the common issue of imbalanced data in background/foreground classification while the multi-task loss enables the networks to simultaneously model the class distribution and the temporal structures of the target events for recognition. Our proposed systems significantly outperform the challenge baseline, improving F-score from 72.7% to 90.0% and reducing detection error rate from 0.53 to 0.18 on average on the development data. On the evaluation data, our submission obtains an average F1-score of 88.3% and an error rat...

Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016

This paper evaluates neural network (NN) based systems and compares them to Gaussian mixture model (GMM) and hidden Markov model (HMM) approaches for acoustic scene classification (SC) and polyphonic acoustic event detection (AED) that are applied to data of the "Detection and Classification of Acoustic Scenes and Events 2016" (DCASE'16) challenge, task 1 and task 3, respectively. For both tasks, the use of deep neural networks (DNNs) and features based on an amplitude modulation filterbank and a Gabor filterbank (GFB) are evaluated and compared to standard approaches. For SC, additionally a time-delay NN approach is proposed that enables analysis of long contextual information similar to recurrent NNs but with training efforts comparable to conventional DNNs. The SC system proposed for task 1 of the DCASE'16 challenge attains a recognition accuracy of 77.5%, which is 5.6% higher compared to the DCASE'16 baseline system. For the AED task, DNNs are adopted in tandem and hybrid approaches, i.e., as part of HMM-based systems. These systems are evaluated for the polyphonic data of task 3 from the DCASE'16 challenge. Several strategies to address the issue of polyphony are considered. It is shown that DNN-based systems perform less accurate than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM back end.

Spatio-Temporal Attention Pooling for Audio Scene Classification

Interspeech 2019, 2019

Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with betweenclass examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.

Hierarchic ConvNets Framework for Rare Sound Event Detection

Proceedings of EUSIPCO 2018, 2018

—In this paper, we propose a system for rare sound event detection using a hierarchical and multi-scaled approach based on Convolutional Neural Networks (CNN). The task consists on detection of event onsets from artificially generated mixtures. Spectral features are extracted from frames of the acoustic signals, then a first event detection stage operates as binary classifier at frame-rate and it proposes to the second stage contiguous blocks of frames which are assumed to contain a sound event. The second stage refines the event detection of the prior network, discarding blocks that contain background sounds wrongly classified by the first stage. Finally, the effective onset time of the active event is obtained. The performance of the algorithm has been assessed with the material provided for the second task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The achieved overall error rate and F-measure, resulting respectively equal to 0.22 and 88.50% on the evaluation dataset, significantly outperforms the challenge baseline and the system guarantees improved generalization performance with a reduced number of free network parameters w.r.t. other competitive algorithms.

CURE DATASET: LADDER NETWORKS FOR AUDIO EVENT CLASSIFICATION

IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim), August 21-23, 2019, Victoria, Canada., 2019

Audio event classification is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains cu-rated set of specific audio events most relevant for people with hearing loss. We propose a ladder network based audio event classifier that utilizes 5s sound recordings derived from the Freesound project. We adopted the state-of-the-art convolu-tional neural network (CNN) embeddings as audio features for this task. We also investigate extreme learning machine (ELM) for event classification. In this study, proposed classi-fiers are compared with support vector machine (SVM) base-line. We propose signal and feature normalization that aims to reduce the mismatch between different recordings scenarios. Firstly, CNN is trained on weakly labeled Audioset data. Next, the pre-trained model is adopted as feature extractor for proposed CURE corpus. We incorporate ESC-50 dataset as second evaluation set. Results and discussions validate the superiority of Ladder network over ELM and SVM classifier in terms of robustness and increased classification accuracy. While Ladder network is robust to data mismatches, simpler SVM and ELM classifiers are sensitive to such mismatches, where the proposed normalization techniques can play an important role. Experimental studies with ESC-50 and CURE corpora elucidate the differences in dataset complexity and robustness offered by proposed approaches.