Deep Convolutional Neural Networks for Environmental Sound Classification (original) (raw)

Automatic Environmental Sound Recognition (AESR) Using Convolutional Neural Network

International Journal of Modern Education and Computer Science, 2020

Automatic Environmental Sound Recognition (AESR) is an essential topic in modern research in the field of pattern recognition. We can convert a short audio file of a sound event into a spectrogram image and feed that image to the Convolutional Neural Network (CNN) for processing. Features generated from that image are used for the classification of various environmental sound events such as sea waves, fire cracking, dog barking, lightning, raining, and many more. We have used the log-mel spectrogram auditory feature for training our six-layer stack CNN model. We evaluated the accuracy of our model for classifying the environmental sounds in three publicly available datasets and achieved an accuracy of 92.9% in the urbansound8k dataset, 91.7% accuracy in the ESC-10 dataset, and 65.8% accuracy in the ESC-50 dataset. These results show remarkable improvement in precise environmental sound recognition using only stack CNN compared to multiple previous works, and also show the efficiency of the log-mel spectrogram feature in sound recognition compared to Mel Frequency Cepstral Coefficients (MFCC), Wavelet Transformation, and raw waveform. We have also experimented with the newly published Rectified Adam (RAdam) as the optimizer. Our study also shows a comparative analysis between the Adaptive Learning Rate Optimizer (Adam) and RAdam optimizer used in training the model to correctly classifying the environmental sounds from image recognition architecture.

A new pyramidal concatenated CNN approach for environmental sound classification

Applied Acoustics, 2020

Recently, there has been an incremental interest on Environmental Sound Classification (ESC), which is an important topic of the non-speech audio classification task. A novel approach, which is based on deep Convolutional Neural Networks (CNN), is proposed in this study. The proposed approach covers a bunch of stages such as pre-processing, deep learning based feature extraction, feature concatenation, feature reduction and classification, respectively. In the first stage, the input sound signals are denoised and are converted into sound images by using the Sort Time Fourier Transform (STFT) method. After sound images are formed, pre-trained CNN models are used for deep feature extraction. In this stage, VGG16, VGG19 and DenseNet201 models are considered. The feature extraction is performed in a pyramidal fashion which makes the dimension of the feature vector quite large. For both dimension reduction and the determination of the most efficient features, a feature selection mechanism is considered after feature concatenation stage. In the last stage of the proposed method, a Support Vector Machines (SVM) classifier is used. The efficiency of the proposed method is calculated on various ESC datasets such as ESC 10, ESC 50 and UrbanSound8K, respectively. The experimental works show that the proposed method produced 94.8%, 81.4% and 78.14% accuracy scores for ESC-10, ESC-50 and UrbanSound8K datasets. The obtained results are also compared with the state-of-the art methods achievements.

End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network

Expert Systems with Applications

In this paper, we present an end-to-end approach for environmental sound classification based on a 1D Convolution Neural Network (CNN) that learns a representation directly from the audio signal. Several convolutional layers are used to capture the signal's fine time structure and learn diverse filters that are relevant to the classification task. The proposed approach can deal with audio signals of any length as it splits the signal into overlapped frames using a sliding window. Different architectures considering several input sizes are evaluated, including the initialization of the first convolutional layer with a Gammatone filterbank that models the human auditory filter response in the cochlea. The performance of the proposed end-to-end approach in classifying environmental sounds was assessed on the UrbanSound8k dataset and the experimental results have shown that it achieves 89% of mean accuracy. Therefore, the propose approach outperforms most of the state-of-the-art approaches that use handcrafted features or 2D representations as input. Furthermore, the proposed approach has a small number of parameters compared to other architectures found in the literature, which reduces the amount of data required for training.

Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network

IEEE Access, 2019

In every aspect of human life, sound plays an important role. From personal security to critical surveillance, sound is a key element to develop the automated systems for these fields. Few systems are already in the market, but their efficiency is a point of concern for their implementation in real-life scenarios. The learning capabilities of the deep learning architectures can be used to develop the sound classification systems to overcome efficiency issues of the traditional systems. Our aim, in this paper, is to use the deep learning networks for classifying the environmental sounds based on the generated spectrograms of these sounds. We used the spectrogram images of environmental sounds to train the convolutional neural network (CNN) and the tensor deep stacking network (TDSN). We used two datasets for our experiment: ESC-10 and ESC-50. Both systems were trained on these datasets, and the achieved accuracy was 77% and 49% in CNN and 56% in TDSN trained on the ESC-10. From this experiment, it is concluded that the proposed approach for sound classification using the spectrogram images of sounds can be efficiently used to develop the sound classification and recognition systems. INDEX TERMS Deep learning, convolutional neural network, tensor deep stacking networks, spectrograms.

Deep Learning For Natural Sound Classification

2019

Nowadays, it is very common to use sensors for controlling the population of different animal species in a natural environment. A large number of sensors can be deployed in wide areas and they will capture information relentlessly, producing a huge amount of data. However, analysing the collected data by humans is a big challenge and for that reason, it is necessary to develop automated technologies in order to help experts on that task. Within this context, we present an automatic system to detect and classify sounds, especially those generated by birds and insects among other sounds that can be heard in a natural environment. For the development of the system, it has been necessary to generate a sound database. The recorded database consists of field recordings in three different Natural Parks, with sounds of several bird and insect species, as well as background noises. The automated system employs state of the art neural networks for detecting and classifying sound frames. Exper...

Acoustic Scene Analysis and Classification Using Densenet Convolutional Neural Network

In this paper we present an account of state-of the-art in Acoustic Scene Classification (ASC), the task of environmental scenario classification through the sounds they produce. Our work aims to classify 50 different outdoor and indoor scenario using environmental sounds. We use a dataset ESC-50 from the IEEE challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this we propose to use 2000 different environmental audio recordings. In this method the raw audio data is converted into Mel-spectrogram and other characteristics like Tonnetz, Chroma and MFCC. The generated Mel-spectrogram is fed as an input to neural network for training. Our model follows structure of neural network in the form of convolution and pooling. With a focus on real time environmental classification and to overcome the problem of low generalization in the model, the paper introduced augmentation to achieve modified noise based audio by adding gaussian white noise. Active researche...

Hybrid Computerized Method for Environmental Sound Classification

IEEE Access, 2020

Classification of environmental sounds plays a key role in security, investigation, robotics since the study of the sounds present in a specific environment can allow to get significant insights. Lack of standardized methods for an automatic and effective environmental sound classification (ESC) creates a need to be urgently satisfied. As a response to this limitation, in this paper, a hybrid model for automatic and accurate classification of environmental sounds is proposed. Optimum allocation sampling (OAS) is used to elicit the informative samples from each class. The representative samples obtained by OAS are turned into the spectrogram containing their time-frequency-amplitude representation by using a short-time Fourier transform (STFT). The spectrogram is then given as an input to pre-trained AlexNet and Visual Geometry Group (VGG)-16 networks. Multiple deep features are extracted using the pre-trained networks and classified by using multiple classification techniques namely decision tree (fine, medium, coarse kernel), k-nearest neighbor (fine, medium, cosine, cubic, coarse and weighted kernel), support vector machine, linear discriminant analysis, bagged tree and softmax classifiers. The ESC-10, a ten-class environmental sound dataset, is used for the evaluation of the methodology. An accuracy of 90.1%, 95.8%, 94.7%, 87.9%, 95.6%, and 92.4% is obtained with a decision tree, k-neared neighbor, support vector machine, linear discriminant analysis, bagged tree and softmax classifier respectively. The proposed method proved to be robust, effective, and promising in comparison with other existing state-of-the-art techniques, using the same dataset. INDEX TERMS Environmental sound classification, optimal allocation sampling, spectrogram, convolutional neural network, classification techniques.

An Ensemble One Dimensional Convolutional Neural Network with Bayesian Optimization for Environmental Sound Classification

Applied Sciences

With the growth of deep learning in various classification problems, many researchers have used deep learning methods in environmental sound classification tasks. This paper introduces an end-to-end method for environmental sound classification based on a one-dimensional convolution neural network with Bayesian optimization and ensemble learning, which directly learns features representation from the audio signal. Several convolutional layers were used to capture the signal and learn various filters relevant to the classification problem. Our proposed method can deal with any audio signal length, as a sliding window divides the signal into overlapped frames. Bayesian optimization accomplished hyperparameter selection and model evaluation with cross-validation. Multiple models with different settings have been developed based on Bayesian optimization to ensure network convergence in both convex and non-convex optimization. An UrbanSound8K dataset was evaluated for the performance of ...

Urban Sound Classification Using Convolutional Neural Network and Long Short Term Memory Based on Multiple Features

2020 Fourth International Conference On Intelligent Computing in Data Sciences (ICDS), 2020

There are many sounds all around us and our brain can easily and clearly identify them. Furthermore, our brain processes the received sound signals continuously and provides us with relevant environmental knowledge. Although not up to the level of accuracy of the brain, there are some smart devices which can extract necessary information from an audio signal, with the help of different algorithms. And as the days pass by more, more research is being conducted to ensure that accuracy level of this information extraction increases. Over the years several models like the CNN, ANN, RCNN and many machine learning techniques have been adopted to classify sound accurately and these have shown promising results in the recent years in distinguishing spectra-temporal pictures. For our research purpose, we are using seven features which are Chromagram, Mel-spectrogram, Spectral contrast, Tonnetz, MFCC, Chroma CENS and Chroma cqt.We have employed two models for the classification process of audio signals which are LSTM and CNN and the dataset used for the research is the UrbanSound8K. The novelty of the research lies in showing that the LSTM shows a better result in classification accuracy compared to CNN, when the MFCC feature is used. Furthermore, we have augmented the UrbanSound8K dataset to ensure that the accuracy of the LSTM is higher than the CNN in case of both the original dataset as well as the augmented one. Moreover, we have tested the accuracy of the models based on the features used. This has been done by using each of the features separately on each of the models, in addition to the two forms of feature stacking that we have performed. The first form of feature stacking contains the features Chromagram, Mel-spectrogram, Spectral contrast, Tonnetz, MFCC, while the second form of feature stacking contains MFCC, Melspectrogram, Chroma cqt and Chroma stft. Likewise, we have stacked features using different combinations to expand our research.In such a way it was possible, with our LSTM model, to reach an accuracy of 98.80%, which is state-of-the-art performance.