Deep Convolutional Neural Networks for Environmental Sound Classification (original) (raw)

Environmental sound classification using a regularized deep convolutional neural network with data augmentation

The adoption of the environmental sound classification (ESC) tasks increases very rapidly over recent years due to its broad range of applications in our daily routine life. ESC is also known as Sound Event Recognition (SER) which involves the context of recognizing the audio stream, related to various environmental sounds. Some frequent and common aspects like non-uniform distance between acoustic source and microphone, the difference in the framework, presence of numerous sounds sources in audio recordings and overlapping various sound events make this ESC problem much complex and complicated. This study is to employ deep convolutional neural networks (DCNN) with regularization and data enhancement with basic audio features that have verified to be efficient on ESC tasks. In this study, the performance of DCNN with max-pooling (Model-1) and without max-pooling (Model-2) function are examined. Three audio attribute extraction techniques, Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC) and Log-Mel, are considered for the ESC-10, ESC-50, and Urban sound (US8K) datasets. Furthermore, to avoid the risk of overfitting due to limited numbers of data, this study also introduces offline data augmentation techniques to enhance the used datasets with a combination of L2 regularization. The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets. For ESC-10, ESC-50 and US8K, the highest achieved accuracies are 94.94%, 89.28%, and 95.37% respectively. The experimental results show that the proposed approach can accomplish the best performance on environment sound classification problems.

CnnSound: Convolutional Neural Networks for the Classification of Environmental Sounds

2020 The 4th International Conference on Advances in Artificial Intelligence, 2020

The classification of environmental sounds (ESC) has been increasingly studied in recent years. The main reason is that environmental sounds are part of our daily life, and associating them with our environment that we live in is important in several aspects as ESC is used in areas such as managing smart cities, determining location from environmental sounds, surveillance systems, machine hearing, environment monitoring. The ESC is however more difficult than other sounds because there are too many parameters that generate background noise in the ESC, which makes the sound more difficult to model and classify. The main aim of this study is therefore to develop more robust convolution neural networks architecture (CNN). For this purpose, 150 different CNN-based models were designed by changing the number of layers and values of their tuning parameters used in the layers. In order to test the accuracy of the models, the Urbansound8k environmental sound database was used. The sounds in...

The Application and Improvement of Deep Neural Networks in Environmental Sound Recognition

Applied Sciences, 2020

Neural networks have achieved great results in sound recognition, and many different kinds of acoustic features have been tried as the training input for the network. However, there is still doubt about whether a neural network can efficiently extract features from the raw audio signal input. This study improved the raw-signal-input network from other researches using deeper network architectures. The raw signals could be better analyzed in the proposed network. We also presented a discussion of several kinds of network settings, and with the spectrogram-like conversion, our network could reach an accuracy of 73.55% in the open-audio-dataset “Dataset for Environmental Sound Classification 50” (ESC50). This study also proposed a network architecture that could combine different kinds of network feeds with different features. With the help of global pooling, a flexible fusion way was integrated into the network. Our experiment successfully combined two different networks with differen...

Spectral images based environmental sound classification using CNN with meaningful data augmentation

Applied Acoustics, 2020

In this study, an effective approach of spectral images based on environmental sound classification using Convolutional Neural Networks (CNN) with meaningful data augmentation is proposed. The feature used in this approach is the Mel spectrogram. Our approach is to define features from audio clips in the form of spectrogram images. The randomly selected CNN models used in this experiment are, a 7-layer or a 9layer CNN learned from scratch. Also, various well-known deep learning structures with transfer learning and with a concept of freezing initial layers, training model, unfreezing the layers, again training the model with discriminative learning are considered. Three datasets, ESC-10, ESC-50, and Us8k are considered. As for the transfer learning methodology, 11 explicit pre-trained deep learning structures are used. In this study, instead of using those available data augmentation schemes for images, we proposed to have meaningful data augmentation by considering variations applied to the audio clips directly. The results show the effectiveness, robustness, and high accuracy of the proposed approach. The meaningful data augmentation can accomplish the highest accuracy with a lower error rate on all datasets by using transfer learning models. Among those used models, The ResNet-152 attained 99.04% for ESC-10 and 99.49% for Us8k datasets. DenseNet-161 gained 97.57% for ESC-50. From our understanding, they are the best-achieved results on these datasets.

Deep Learning-based Environmental Sound Classification Using Feature Fusion and Data Enhancement

Computers, Materials & Continua

Environmental sound classification (ESC) involves the process of distinguishing an audio stream associated with numerous environmental sounds. Some common aspects such as the framework difference, overlapping of different sound events, and the presence of various sound sources during recording make the ESC task much more complicated and complex. This research is to propose a deep learning model to improve the recognition rate of environmental sounds and reduce the model training time under limited computation resources. In this research, the performance of transformer and convolutional neural networks (CNN) are investigated. Seven audio features, chromagram, Mel-spectrogram, tonnetz, Mel-Frequency Cepstral Coefficients (MFCCs), delta MFCCs, delta-delta MFCCs and spectral contrast, are extracted from the UrbanSound8K, ESC-50, and ESC-10, databases. Moreover, this research also employed three data enhancement methods, namely, white noise, pitch tuning, and time stretch to reduce the risk of overfitting issue due to the limited audio clips. The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on enhanced database. For UrbanSound8K, ESC-50, and ESC-10, the highest attained accuracies are 0.98, 0.94, and 0.97 respectively. The experimental results reveal that the proposed technique can achieve the best performance for ESC problems.

Automatic Environmental Sound Recognition (AESR) Using Convolutional Neural Network

International Journal of Modern Education and Computer Science, 2020

Automatic Environmental Sound Recognition (AESR) is an essential topic in modern research in the field of pattern recognition. We can convert a short audio file of a sound event into a spectrogram image and feed that image to the Convolutional Neural Network (CNN) for processing. Features generated from that image are used for the classification of various environmental sound events such as sea waves, fire cracking, dog barking, lightning, raining, and many more. We have used the log-mel spectrogram auditory feature for training our six-layer stack CNN model. We evaluated the accuracy of our model for classifying the environmental sounds in three publicly available datasets and achieved an accuracy of 92.9% in the urbansound8k dataset, 91.7% accuracy in the ESC-10 dataset, and 65.8% accuracy in the ESC-50 dataset. These results show remarkable improvement in precise environmental sound recognition using only stack CNN compared to multiple previous works, and also show the efficiency of the log-mel spectrogram feature in sound recognition compared to Mel Frequency Cepstral Coefficients (MFCC), Wavelet Transformation, and raw waveform. We have also experimented with the newly published Rectified Adam (RAdam) as the optimizer. Our study also shows a comparative analysis between the Adaptive Learning Rate Optimizer (Adam) and RAdam optimizer used in training the model to correctly classifying the environmental sounds from image recognition architecture.

A new pyramidal concatenated CNN approach for environmental sound classification

Applied Acoustics, 2020

Recently, there has been an incremental interest on Environmental Sound Classification (ESC), which is an important topic of the non-speech audio classification task. A novel approach, which is based on deep Convolutional Neural Networks (CNN), is proposed in this study. The proposed approach covers a bunch of stages such as pre-processing, deep learning based feature extraction, feature concatenation, feature reduction and classification, respectively. In the first stage, the input sound signals are denoised and are converted into sound images by using the Sort Time Fourier Transform (STFT) method. After sound images are formed, pre-trained CNN models are used for deep feature extraction. In this stage, VGG16, VGG19 and DenseNet201 models are considered. The feature extraction is performed in a pyramidal fashion which makes the dimension of the feature vector quite large. For both dimension reduction and the determination of the most efficient features, a feature selection mechanism is considered after feature concatenation stage. In the last stage of the proposed method, a Support Vector Machines (SVM) classifier is used. The efficiency of the proposed method is calculated on various ESC datasets such as ESC 10, ESC 50 and UrbanSound8K, respectively. The experimental works show that the proposed method produced 94.8%, 81.4% and 78.14% accuracy scores for ESC-10, ESC-50 and UrbanSound8K datasets. The obtained results are also compared with the state-of-the art methods achievements.

End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network

Expert Systems with Applications

In this paper, we present an end-to-end approach for environmental sound classification based on a 1D Convolution Neural Network (CNN) that learns a representation directly from the audio signal. Several convolutional layers are used to capture the signal's fine time structure and learn diverse filters that are relevant to the classification task. The proposed approach can deal with audio signals of any length as it splits the signal into overlapped frames using a sliding window. Different architectures considering several input sizes are evaluated, including the initialization of the first convolutional layer with a Gammatone filterbank that models the human auditory filter response in the cochlea. The performance of the proposed end-to-end approach in classifying environmental sounds was assessed on the UrbanSound8k dataset and the experimental results have shown that it achieves 89% of mean accuracy. Therefore, the propose approach outperforms most of the state-of-the-art approaches that use handcrafted features or 2D representations as input. Furthermore, the proposed approach has a small number of parameters compared to other architectures found in the literature, which reduces the amount of data required for training.

IJERT-Improved Deep CNN with Reduced Parameters for Automatic Identification of Environmental Sounds

International Journal of Engineering Research and Technology (IJERT), 2019

https://www.ijert.org/improved-deep-cnn-with-reduced-parameters-for-automatic-identification-of-environmental-sounds https://www.ijert.org/research/improved-deep-cnn-with-reduced-parameters-for-automatic-identification-of-environmental-sounds-IJERTCONV7IS13001.pdf Deep learning techniques like Convolutional Neural Network (CNN) are steadily gaining impetus in the context of environmental sound classification. Despite their excellent performance CNN poses a challenge in terms of hardware and memory requirements due to its computationally intensive nature. Recent trends in deep learning research focus on reducing the number of parameters in the deep learning framework without performance degradation. In this paper, we put forward a novel CNN architecture with reduced parameters for automatic environmental sound classification. The proposed architecture offered a parameter reduction of 24.16% and reduced the MAC operations by 20.17%. This indicates that the proposed architecture results in reduced computational complexity during hardware deployment. The impact of parameter reduction on model accuracy is analyzed by evaluating the proposed model on a publicly available database. The results indicate that the proposed architecture outshines the state of the art approaches for automatic identification of environmental sounds.