Audio classification based on sparse coefficients (original) (raw)

Dictionary learning based sparse coefficients for audio classification with max and average pooling

Digital Signal Processing, 2013

Audio classification is an important problem in signal processing and pattern recognition with potential applications in audio retrieval, documentation and scene analysis. Common to general signal classification systems, it involves both training and classification (or testing) stages. The performance of an audio classification system, such as its complexity and classification accuracy, depends highly on the choice of the signal features and the classifiers. Several features have been widely exploited in existing methods, such as the mel-frequency cepstrum coefficients (MFCCs), line spectral frequencies (LSF) and short time energy (STM). In this paper, instead of using these well-established features, we explore the potential of sparse features, derived from the dictionary of signal atoms using sparse coding based on e.g. orthogonal matching pursuit (OMP), where the atoms are adapted directly from audio training data using the K-SVD dictionary learning algorithm. To reduce the computational complexity, we propose to perform pooling and sampling operations on the sparse coefficients. Such operations also help to maintain a unified dimension of the signal features, regardless of the various lengths of the training and testing signals. Using the popular support vector machine (SVM) as the classifier, we examine the performance of the proposed classification system for two binary classification problems, namely speech-music classification and male-female speech discrimination and a multi-class problem, speaker identification. The experimental results show that the sparse (max-pooled and average-pooled) coefficients perform better than the classical MFCCs features, in particular, for noisy audio data.

A Study On Feature Analysis Using Sparse Representation For Music Classification

2019

This paper presents the first attempt to classify for Myanmar ethnic music by using sparse representation classification method to define the class label according to their ethnic traditional style. In this system, the only five myanmar ethnic groups are considered such as Kachin, Kayin, Mon, Shan, Rakhine. The classification system describe the better accuracy by analysing the temporal features and spectral features, the b est outcome by calculating all the results based on Sparse Representation classifier in compared with K Nearest Neighbors classifier. Therefore, this audio classification achieved the results by evaluating feature combination and the best feature combination by using SRC and KNN classfier. With all the features of all ethnic classes, the overall outcome of the SRC 64%, which is better than 54% of the overall KNN accuracy. All features (114) combination give the best results of 82% for Kayin ethnic songs than other ethnic songs.The feature combination of MFCC(std, deltamean) are tested on all of five ethnic classes which is the best classification results of 75.00% accuracy from SRC classifier than the classification results of 51.33% from KNN classifier.

UNSUPERVISED LEARNING OF SPARSE FEATURES FOR SCALABLE AUDIO CLASSIFICATION

In this work we present a system to automatically learn features from audio in an unsupervised manner. Our method first learns an overcomplete dictionary which can be used to sparsely decompose log-scaled spectrograms. It then trains an efficient encoder which quickly maps new inputs to approximations of their sparse representations using the learned dictionary. This avoids expensive iterative procedures usually required to infer sparse codes. We then use these sparse codes as inputs for a linear Support Vector Machine (SVM). Our system achieves 83.4% accuracy in predicting genres on the GTZAN dataset, which is competitive with current state-of-the-art approaches. Furthermore, the use of a simple linear classifier combined with a fast feature extraction system allows our approach to scale well to large datasets.

Musical audio analysis using sparse representations

Sparse representations are becoming an increasingly useful tool in the analysis of musical audio signals. In this paper we will given an overview of work by ourselves and others in this area, to give a flavour of the work being undertaken, and to give some pointers for further information about this interesting and challenging research topic.

Sparse coding based features for speech units classification

Computer Speech & Language, 2018

In this work, we propose sparse representation based features for speech units classification tasks. In order to effectively capture the variations in a speech unit, the proposed method employs multiple class specific dictionaries. Here, the training data belonging to each class is clustered into multiple clusters, and a principal component analysis (PCA) based dictionary is learnt for each cluster. It has been observed that coefficients corresponding to middle principal components can effectively discriminate among different speech units. Exploiting this observation, we propose to use a transformation function known as weighted decomposition (WD) of principal components, which is used to emphasize the discriminative information present in the PCA-based dictionary. In this paper, both raw speech samples and mel frequency cepstral coefficients (MFCC) are used as an initial representation for feature extraction. For comparison, various popular dictionary learning techniques such as K-singular value decomposition (KSVD), simultaneous codeword optimization (SimCO) and greedy adaptive dictionary (GAD) are also employed in the proposed framework. The effectiveness of the proposed features is demonstrated using continuous density hidden Markov model (CDHMM) based classifiers for (i) classification of isolated utterances of E-set of English alphabet, (ii) classification of consonant-vowel (CV) segments in Hindi language and (iii) classification of phoneme from TIMIT phonetic corpus.

Analysis of sparse representation based feature on speech mode classification

Interspeech 2018, 2018

Traditional phone recognition systems are developed using read speech. But, in reality, the speech that needs to be processed by machine is not always in read mode. Therefore to handle the phone recognition in realistic scenarios, three broad modes of speech: read, conversation and extempore are considered in this study. The conversation mode includes informal communication in an unconstrained environment between two or more individuals. In the extempore mode, a person speaks with confidence without the help of notes. Read mode is a formal type of speech in a rigid environment. In this work, we have proposed a sparse based feature for speech mode classification. The effectiveness of sparse representation depends on the dictionary. Therefore, we have learned multiple overcomplete dictionaries by using parallel atom-update dictionary learning (PAU-DL) technique to capture the discrimination characteristics present in the considered speech modes. Further, sparse features correspond to the sequence of speech frames are derived using the learned dictionary by applying the orthogonal matching pursuit (OMP) algorithm. The proposed sparse features are evaluated on speech corpora consisting of six Indian languages by performing classification of speech modes. The results with the proposed sparse features outperform the standard spectral, excitation source and prosodic features.

Non-Negative Sparse Decomposition Of Musical Signal Using Pre-Trained Dictionary Of Feature Vectors Of Possible Tones From Different Instruments

Proceedings of the SMC Conferences, 2015

Decomposition of the musical signal into the signals of the individual instruments is a fundamental task for musical signal processing. This paper proposes a decomposition algorithm of the musical signal based on non-negative sparse estimation. We estimate the coefficients of the linear combination by assuming the feature vector of the given musical signal can be approximated as the linear combination of the elements in the pre-trained dictionary. Since the musical signal is considered as a mixture of tones from several instruments and only a few tones appear at the same time, the coefficients must be non-negative and sparse if the musical signals are represented by non-negative vectors. In this paper we used the feature vector based on the auto correlation functions. The experimental results show that the proposed decomposition method can accurately estimate the tone sequence from the musical signal played using two instruments.

Music genre classification via sparse representations of auditory temporal modulations

A robust music genre classification framework is proposed that combines the rich, psycho-physiologically grounded properties of slow temporal modulations of music recordings and the power of sparse representation-based classifiers. Linear subspace dimension- ality reduction techniques are shown to play a crucial role within the framework under study. The proposed method yields a music genre classification accuracy of 91% and 93.56% on the GTZAN and the ISMIR2004 Genre dataset, respectively. Both accuracies outper- form any reported accuracy ever obtained by state of the art music genre classification algorithms in the aforementioned datasets.

Sparse Representation for Signal Classification

In this paper, application of sparse representation (factorization) of signals over an overcomplete basis (dictionary) for signal classification is discussed. Searching for the sparse representation of a signal over an overcomplete dictionary is achieved by optimizing an objective function that includes two terms: one that measures the signal reconstruction error and another that measures the sparsity. This objective function works well in applications where signals need to be reconstructed, like coding and denoising. On the other hand, discriminative methods, such as linear discriminative analysis (LDA), are better suited for classification tasks. However, discriminative methods are usually sensitive to corruption in signals due to lacking crucial properties for signal reconstruction. In this paper, we present a theoretical framework for signal classification with sparse representation. The approach combines the discrimination power of the discriminative methods with the reconstruction property and the sparsity of the sparse representation that enables one to deal with signal corruptions: noise, missing data and outliers. The proposed approach is therefore capable of robust classification with a sparse representation of signals. The theoretical results are demonstrated with signal classification tasks, showing that the proposed approach outperforms the standard discriminative methods and the standard sparse representation in the case of corrupted signals.

Sparse representation features for speech recognition

2010

Abstract In this paper, we explore the use of exemplar-based sparse representations (SRs) to map test features into the linear span of training examples. We show that the frame classification accuracy with these new features is 1. 3% higher than a Gaussian Mixture Model (GMM), showing that not only do SRs move test features closer to training, but also move the features closer to the correct class. Given these new SR features, we train up a Hidden Markov Model (HMM) on these features and perform recognition.

Audio classification based on sparse coefficients (original) (raw)

Related papers