Efficient Classification of Environmental Sounds through Multiple Features Aggregation and Data Enhancement Techniques for Spectrogram Images (original) (raw)

PERFORMANCE ACCURACY OF CLASSIFICATION ON ENVIRONMENTAL SOUND CLASSIFICATION (ESC_50) DATASET

IJCIRAS, 2020

The classification of audio dataset is intended to distinguish between the different source of audio such as indoor, outdoor and environmental sounds. The environmental sound classification (ESC-50) dataset is composed with a labeled set of 2000 environmental recordings. The spectral centroid method is applied to extract audio features from ESC-50 dataset with waveform audio file (WAV) format. The decision tree is easy to implement and fast for fitting and prediction therefore this proposed system is utilized the coarse tree and medium tree as a classifier. Then fivefold cross-validation is also applied to evaluate the performance of classifier. The proposed system is implemented by using Matlab programming. The classification accuracy of coarse tree is 63.8% whereas the medium tree is 58.6% on ESC-50 dataset.

Environmental sound classification using a regularized deep convolutional neural network with data augmentation

The adoption of the environmental sound classification (ESC) tasks increases very rapidly over recent years due to its broad range of applications in our daily routine life. ESC is also known as Sound Event Recognition (SER) which involves the context of recognizing the audio stream, related to various environmental sounds. Some frequent and common aspects like non-uniform distance between acoustic source and microphone, the difference in the framework, presence of numerous sounds sources in audio recordings and overlapping various sound events make this ESC problem much complex and complicated. This study is to employ deep convolutional neural networks (DCNN) with regularization and data enhancement with basic audio features that have verified to be efficient on ESC tasks. In this study, the performance of DCNN with max-pooling (Model-1) and without max-pooling (Model-2) function are examined. Three audio attribute extraction techniques, Mel spectrogram (Mel), Mel Frequency Cepstral Coefficient (MFCC) and Log-Mel, are considered for the ESC-10, ESC-50, and Urban sound (US8K) datasets. Furthermore, to avoid the risk of overfitting due to limited numbers of data, this study also introduces offline data augmentation techniques to enhance the used datasets with a combination of L2 regularization. The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets. For ESC-10, ESC-50 and US8K, the highest achieved accuracies are 94.94%, 89.28%, and 95.37% respectively. The experimental results show that the proposed approach can accomplish the best performance on environment sound classification problems.

ENVIRONMENTAL SOUND RECOGNITION USING SPECTROGRAM IMAGE FEATURES

IJRET, 2017

Most of the prior research which has been carried out on audio recognition has been done in speech and music. Only in recent years, dozens of emerging works have been conducted on Environmental Sound Recognition and has gained importance. For the purpose of audio classification, many previous efforts utilize acoustic features such as Mel-frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), Root Mean Square Error (RMSE), spectral centroid, spectral bandwidth and other frequency domain features derived from the spectrogram of the audio. In this paper, we use a slightly different approach of feature extraction, where we summarize short audio clips of about five seconds by segmenting out the most prominent part of the audio signal. We then compute spectrogram image of the segmented audio, and divide it into different sub-bands with respect to the frequency axis. For each of the sub-bands, we extract first order statistics and Gray Level Concurrence Matrix (GLCM) features. In the classification stage, we combine two SVM (Support Vector Machines) classifiers. The first classifier uses first order statistics and GLCM features. The second classifier uses acoustic features such as MFCCs, ZCR, RMSE, spectral centroid, spectral bandwidth and other frequency domain features derived from the spectrogram of the audio to obtain the final result. We evaluate our approach on two publicly available datasets, namely, ESC-10 and Freiburg-106 with a five-fold and a tenfold cross validation for ESC-10 dataset and Freiburg-106 dataset respectively. Experiments show that the proposed approach outperforms the baselines and provides similar results compared to the state-of-art.

Hybrid Computerized Method for Environmental Sound Classification

IEEE Access, 2020

Classification of environmental sounds plays a key role in security, investigation, robotics since the study of the sounds present in a specific environment can allow to get significant insights. Lack of standardized methods for an automatic and effective environmental sound classification (ESC) creates a need to be urgently satisfied. As a response to this limitation, in this paper, a hybrid model for automatic and accurate classification of environmental sounds is proposed. Optimum allocation sampling (OAS) is used to elicit the informative samples from each class. The representative samples obtained by OAS are turned into the spectrogram containing their time-frequency-amplitude representation by using a short-time Fourier transform (STFT). The spectrogram is then given as an input to pre-trained AlexNet and Visual Geometry Group (VGG)-16 networks. Multiple deep features are extracted using the pre-trained networks and classified by using multiple classification techniques namely decision tree (fine, medium, coarse kernel), k-nearest neighbor (fine, medium, cosine, cubic, coarse and weighted kernel), support vector machine, linear discriminant analysis, bagged tree and softmax classifiers. The ESC-10, a ten-class environmental sound dataset, is used for the evaluation of the methodology. An accuracy of 90.1%, 95.8%, 94.7%, 87.9%, 95.6%, and 92.4% is obtained with a decision tree, k-neared neighbor, support vector machine, linear discriminant analysis, bagged tree and softmax classifier respectively. The proposed method proved to be robust, effective, and promising in comparison with other existing state-of-the-art techniques, using the same dataset. INDEX TERMS Environmental sound classification, optimal allocation sampling, spectrogram, convolutional neural network, classification techniques.

Deep Learning-based Environmental Sound Classification Using Feature Fusion and Data Enhancement

Computers, Materials & Continua

Environmental sound classification (ESC) involves the process of distinguishing an audio stream associated with numerous environmental sounds. Some common aspects such as the framework difference, overlapping of different sound events, and the presence of various sound sources during recording make the ESC task much more complicated and complex. This research is to propose a deep learning model to improve the recognition rate of environmental sounds and reduce the model training time under limited computation resources. In this research, the performance of transformer and convolutional neural networks (CNN) are investigated. Seven audio features, chromagram, Mel-spectrogram, tonnetz, Mel-Frequency Cepstral Coefficients (MFCCs), delta MFCCs, delta-delta MFCCs and spectral contrast, are extracted from the UrbanSound8K, ESC-50, and ESC-10, databases. Moreover, this research also employed three data enhancement methods, namely, white noise, pitch tuning, and time stretch to reduce the risk of overfitting issue due to the limited audio clips. The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on enhanced database. For UrbanSound8K, ESC-50, and ESC-10, the highest attained accuracies are 0.98, 0.94, and 0.97 respectively. The experimental results reveal that the proposed technique can achieve the best performance for ESC problems.

Comparative Study of MFCC Feature with Different Machine Learning Techniques in Acoustic Scene Classification

The task of labelling the audio sample in outdoor condition or indoor condition is called Acoustic Scene Classification (ASC). The ASC use acoustic information to imply about the context of the recorded environment. Since ASC can only applied in indoor environment in real world, a new set of strategies and classification techniques are required to consider for outdoor environment. In this paper, we present the comparative study of different machine learning classifiers with Mel-Frequency Cepstral Coefficients (MFCC) feature. We used DCASE Challenge 2016 dataset to show the properties of machine learning classifiers. There are several classifiers to address the ASC task. In this paper, we compare the properties of different classifiers: K-nearest neighbours (KNN), Support Vector Machine (SVM), Decision Tree (ID3) and Linear Discriminant Analysis by using MFCC feature. The best of classification methodology and feature extraction are essential for ASC task. In this comparative study, we extract MFCC feature from acoustic scene audio and then extracted feature is applied in different classifiers to know the advantages of classifiers for MFCC feature. This paper also proposed the MFCC-moment feature for ASC task by considering the statistical moment information of MFCC feature.

A new pyramidal concatenated CNN approach for environmental sound classification

Applied Acoustics, 2020

Recently, there has been an incremental interest on Environmental Sound Classification (ESC), which is an important topic of the non-speech audio classification task. A novel approach, which is based on deep Convolutional Neural Networks (CNN), is proposed in this study. The proposed approach covers a bunch of stages such as pre-processing, deep learning based feature extraction, feature concatenation, feature reduction and classification, respectively. In the first stage, the input sound signals are denoised and are converted into sound images by using the Sort Time Fourier Transform (STFT) method. After sound images are formed, pre-trained CNN models are used for deep feature extraction. In this stage, VGG16, VGG19 and DenseNet201 models are considered. The feature extraction is performed in a pyramidal fashion which makes the dimension of the feature vector quite large. For both dimension reduction and the determination of the most efficient features, a feature selection mechanism is considered after feature concatenation stage. In the last stage of the proposed method, a Support Vector Machines (SVM) classifier is used. The efficiency of the proposed method is calculated on various ESC datasets such as ESC 10, ESC 50 and UrbanSound8K, respectively. The experimental works show that the proposed method produced 94.8%, 81.4% and 78.14% accuracy scores for ESC-10, ESC-50 and UrbanSound8K datasets. The obtained results are also compared with the state-of-the art methods achievements.

Forest Sound Classification Dataset: FSC22

Sensors

The study of environmental sound classification (ESC) has become popular over the years due to the intricate nature of environmental sounds and the evolution of deep learning (DL) techniques. Forest ESC is one use case of ESC, which has been widely experimented with recently to identify illegal activities inside a forest. However, at present, there is a limitation of public datasets specific to all the possible sounds in a forest environment. Most of the existing experiments have been done using generic environment sound datasets such as ESC-50, U8K, and FSD50K. Importantly, in DL-based sound classification, the lack of quality data can cause misguided information, and the predictions obtained remain questionable. Hence, there is a requirement for a well-defined benchmark forest environment sound dataset. This paper proposes FSC22, which fills the gap of a benchmark dataset for forest environmental sound classification. It includes 2025 sound clips under 27 acoustic classes, which c...

Audio Classification Based on MFCC and GMM under Noise for Embedded System

Nowadays, digital audio applications are part of our everyday lives. Those applications segment the audio stream into some kind of catalogues audio, and have the corresponding responding to each kind of catalogues audio. Such as in IP Network Camera (IPNC) system, when detected the screaming or window breaking signal, the IPNC system turns the motor towards source generating abnormal sounds. So far, a wide variety of features, being extracted from audio signals in either the temporal or frequency domains. Of these, the Mel-Frequency Cepstral features (MFCC), which are frequency transformed and logarithmically scaled, appear to be universally recognized as the most generally effective for analyzing human voice. The most common classification methods used for this audio class recognition include Gaussian Mixture Models (GMM), K-Nearest Neighbor(k-NN), Neural Networks (NN), support vector machines (SVM), and Hidden Markov Models(HMM) The choice of classification method has been shown to be largely insignificant. In this paper, we took Gaussian Mixture Models (GMM) to classify the audio signal.