Convolutional Neural Network based Audio Event Classification (original) (raw)

Acoustic Event Classification Using Convolutional Neural Networks

2017

The classification of human-made acoustic events is important for the monitoring and recognition of human activities or critical behavior. In our experiments on acoustic event classification for the utilization in the sector of health care, we defined different acoustic events which represent critical events for elderly or people with disabilities in ambient assisted living environments or patients in hospitals. This contribution presents our work for acoustic event classification using deep learning techniques. We implemented and trained various convolutional neural networks for the extraction of deep feature vectors making use of current best practices in neural network design to establish a baseline for acoustic event classification. We convert chunks of audio signals into magnitude spectrograms and treat acoustic events as images. Our data set contains 20 different acoustic events which were collected in two different recording sessions combining human and environmental sounds. ...

A Convolutional Neural Network Approach for Acoustic Scene Classification

—This paper presents a novel application of convo-lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the-art systems which competed in the " Detection and Classification of Acoustic Scenes and Events " (DCASE) challenges held in 2016 1 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner's score.

Identifying the Sound Event Recognition using Machine learning And Artificial Intelligence

Design Engineering, 2021

The analysis of sound information is extremely useful in a variety of applications such as multimedia information retrieval, audio surveillance, audio tagging, and forensic investigations. The analysis of an audio clip is performed in order to detect sound events. Applications of this technology include security systems, smart vehicle navigation and noise pollution monitoring. Sound Event Recognition (SER) is the focus of this research proposal. As compared to long-duration audio scenes, sound events have a short duration of about 100 to 500 milliseconds. A machine-learning model is being trained and tested in this paper, which can be incorporated in an automated data collection process. In this experiment, Convolutional Neural Network (CNN), Support Vector Machine (SVM), Hidden Markov Model (HMM) and Random Forest were compared.

Continuous robust sound event classification using time-frequency features and deep learning

PLOS ONE, 2017

The automatic detection and recognition of sound events by computers is a requirement for a number of emerging sensing and human computer interaction technologies. Recent advances in this field have been achieved by machine learning classifiers working in conjunction with time-frequency feature representations. This combination has achieved excellent accuracy for classification of discrete sounds. The ability to recognise sounds under real-world noisy conditions, called robust sound event classification, is an especially challenging task that has attracted recent research attention. Another aspect of real-word conditions is the classification of continuous, occluded or overlapping sounds, rather than classification of short isolated sound recordings. This paper addresses the classification of noise-corrupted, occluded, overlapped, continuous sound recordings. It first proposes a standard evaluation task for such sounds based upon a common existing method for evaluating isolated sound classification. It then benchmarks several high performing isolated sound classifiers to operate with continuous sound data by incorporating an energy-based event detection front end. Results are reported for each tested system using the new task, to provide the first analysis of their performance for continuous sound event detection. In addition it proposes and evaluates a novel Bayesian-inspired front end for the segmentation and detection of continuous sound recordings prior to classification.

CNN-based Segmentation and Classification of Sound Streams under realistic conditions

Proceedings of the 26th Pan-Hellenic Conference on Informatics

Audio datasets support the training and validation of Machine Learning algorithms in audio classification problems. Such datasets include different, arbitrarily chosen audio classes. We initially investigate a unifying approach, based on the mapping of audio classes according to the Audioset ontology. Using the ESC-10 audio dataset, a tree-like representation of its classes is created. In addition, we employ an audio similarity calculation tool based on the values of extracted features (spectrum centroid, the spectrum flux and the spectral roll-off). This way the audio classes are connected both semantically and in feature-based manner. Employing the same dataset, ESC-10, we perform sound classification using CNNbased algorithms, after transforming the sound excerpts into images (based on their Mel spectrograms). The YAMNet and VGGish networks are used for audio classification and the accuracy reaches 90%. We extend the classification algorithm with segmentation logic, so that it can be applied into more complex sound excerpts, where multiple sound types are included in a sequential and/or overlapping manner. Quantitative metrics are defined on the behavior of the combined segmentation and segmentation functionality, including two key parameters for the merging operation, the minimum duration of the identified sounds and the intervals. The qualitative metrics are related to the number of sound identification events for a concatenated sound excerpt of the dataset and per each sound class. This way the segmentation logic can operate in a fine-and coarse-grained manner while the dataset and the individual sound classes are characterized in terms of clearness and distinguishability.

Early Detection of Continuous and Partial Audio Events Using CNN

Interspeech 2018, 2018

Sound event detection is an extension of the static auditory classification task into continuous environments, where performance depends jointly upon the detection of overlapping events and their correct classification. Several approaches have been published to date which either develop novel classifiers or employ well-trained static classifiers with a detection front-end. This paper takes the latter approach, by combining a proven CNN classifier acting on spectrogram image features, with time-frequency shaped energy detection that identifies seed regions within the spectrogram that are characteristic of auditory energy events. Furthermore, the shape detector is optimised to allow early detection of events as they are developing. Since some sound events naturally have longer durations than others, waiting until completion of entire events before classification may not be practical in a deployed system. The early detection capability of the system is thus evaluated for the classification of partial events. Performance for continuous event detection is shown to be good, with accuracy being maintained well when detecting partial events.

Sound event detection using deep neural networks

TELKOMNIKA Telecommunication Computing Electronics and Control, 2020

We applied various architectures of deep neural networks for sound event detection and compared their performance using two different datasets. Feed forward neural network (FNN), convolutional neural network (CNN), recurrent neural network (RNN) and convolutional recurrent neural network (CRNN) were implemented using hyper-parameters optimized for each architecture and dataset. The results show that the performance of deep neural networks varied significantly depending on the learning rate, which can be optimized by conducting a series of experiments on the validation data over predetermined ranges. Among the implemented architectures, the CRNN performed best under all testing conditions, followed by CNN. Although RNN was effective in tracking the time-correlation information in audio signals,it exhibited inferior performance compared to the CNN and the CRNN. Accordingly, it is necessary to develop more optimization strategies for implementing RNN in sound event detection.

An Overview of Audio Event Detection Methods from Feature Extraction to Classification

Applied Artificial Intelligence, 2017

Audio streams, such as news broadcasting, meeting rooms, and special video comprise sound from an extensive variety of sources. The detection of audio events including speech, coughing, gunshots, etc. leads to intelligent audio event detection (AED). With substantial attention geared to AED for various types of applications, such as security, speech recognition, speaker recognition, home care, and health monitoring, scientists are now more motivated to perform extensive research on AED. The deployment of AED is actually amore complicated task when going beyond exclusively highlighting audio events in terms of feature extraction and classification in order to select the best features with high detection accuracy. To date, a wide range of different detection systems based on intelligent techniques have been utilized to create machine learning-based audio event detection schemes. Nevertheless, the preview study does not encompass any state-of-the-art reviews of the proficiency and significances of such methods for resolving audio event detection matters. The major contribution of this work entails reviewing and categorizing existing AED schemes into preprocessing, feature extraction, and classification methods. The importance of the algorithms and methodologies and their proficiency and restriction are additionally analyzed in this study. This research is expanded by critically comparing audio detection methods and algorithms according to accuracy and false alarms using different types of datasets.

Classifier Architectures for Acoustic Scenes and Events: Implications for DNNs, TDNNs, and Perceptual Features from DCASE 2016

This paper evaluates neural network (NN) based systems and compares them to Gaussian mixture model (GMM) and hidden Markov model (HMM) approaches for acoustic scene classification (SC) and polyphonic acoustic event detection (AED) that are applied to data of the "Detection and Classification of Acoustic Scenes and Events 2016" (DCASE'16) challenge, task 1 and task 3, respectively. For both tasks, the use of deep neural networks (DNNs) and features based on an amplitude modulation filterbank and a Gabor filterbank (GFB) are evaluated and compared to standard approaches. For SC, additionally a time-delay NN approach is proposed that enables analysis of long contextual information similar to recurrent NNs but with training efforts comparable to conventional DNNs. The SC system proposed for task 1 of the DCASE'16 challenge attains a recognition accuracy of 77.5%, which is 5.6% higher compared to the DCASE'16 baseline system. For the AED task, DNNs are adopted in tandem and hybrid approaches, i.e., as part of HMM-based systems. These systems are evaluated for the polyphonic data of task 3 from the DCASE'16 challenge. Several strategies to address the issue of polyphony are considered. It is shown that DNN-based systems perform less accurate than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM back end.

CURE DATASET: LADDER NETWORKS FOR AUDIO EVENT CLASSIFICATION

IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PacRim), August 21-23, 2019, Victoria, Canada., 2019

Audio event classification is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains cu-rated set of specific audio events most relevant for people with hearing loss. We propose a ladder network based audio event classifier that utilizes 5s sound recordings derived from the Freesound project. We adopted the state-of-the-art convolu-tional neural network (CNN) embeddings as audio features for this task. We also investigate extreme learning machine (ELM) for event classification. In this study, proposed classi-fiers are compared with support vector machine (SVM) base-line. We propose signal and feature normalization that aims to reduce the mismatch between different recordings scenarios. Firstly, CNN is trained on weakly labeled Audioset data. Next, the pre-trained model is adopted as feature extractor for proposed CURE corpus. We incorporate ESC-50 dataset as second evaluation set. Results and discussions validate the superiority of Ladder network over ELM and SVM classifier in terms of robustness and increased classification accuracy. While Ladder network is robust to data mismatches, simpler SVM and ELM classifiers are sensitive to such mismatches, where the proposed normalization techniques can play an important role. Experimental studies with ESC-50 and CURE corpora elucidate the differences in dataset complexity and robustness offered by proposed approaches.