Novel Approaches to Speech Detection in the Processing of Continuous Audio Streams (original) (raw)

Feature extraction for speech and music discrimination

2008 International Workshop on Content-Based Multimedia Indexing, 2008

Driven by the demand of information retrieval, video editing and human-computer interface, in this paper we propose a novel spectral feature for music and speech discrimination. This scheme attempts to simulate a biological model using the averaged cepstrum, where human perception tends to pick up the areas of large cepstral changes. The cepstrum data that is away from the mean value will be exponentially reduced in magnitude. We conduct experiments of music/speech discrimination by comparing the performance of the proposed feature with that of previously proposed features in classification. The dynamic time warping based classification verifies that the proposed feature has the best quality of music/speech classification in the test database.

Speech, music and songs discrimination in the context of handsets variability

2002

The problem of speech, music and music with songs discrimination in telephony with handsets variability is addressed in this paper. Two systems are proposed. The first system uses three Gaussian Mixture Models (GMM) for speech, music and songs respectively. Each GMM comprises 8 Gaussians trained on very short sessions. Twenty six speakers (13 females, 13 males) have been randomly chosen from the SPIDRE corpus. The music were obtained from a large set of data and comprises various styles. For 138 minutes of testing time, a speech discrimination score of 97.9% is obtained when no channel normalization is used. These performance are obtained for a relatively short analysis frame (32ms sliding window, buffering of 100 ms). When using channel normalization, an important score reduction (on the order of 10 to 20%) is observed. The second system has been designed for applications requiring shorter processing times along with shorter training sessions. It is based on an empirical transformation of the ∆MFCC that enhances the dynamical evolution of tonality. It yields in average an acceptable discrimination rate of 90% (speech-/music) and 84% (speech, music and songs with music).

Speech and Music Classification and Separation: A Review

2006

The classification and separation of speech and music signals have attracted attention by many researchers. The purpose of the classification process is needed to build two different libraries: speech library and music library, from a stream of sounds. However, the separation process is needed in a cocktail-party problem to separate speech from music and remove the undesired one. In this paper, a review of the existing classification and separation algorithms is presented and discussed. The classification algorithms will be divided into three categories: time-domain, frequency-domain, and time-frequency domain approaches. The time-domain approaches used in literature are: the zero-crossing rate (ZCR), the short-time energy (STE), the ZCR and the STE with positive derivative, with some of their modified versions, the variance of the roll-off, and the neural networks. The frequency-domain approaches are mainly based on: spectral centroid, variance of the spectral centroid, spectral flux, variance of the spectral flux, roll-off of the spectrum, cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in literature; so, the spectrogram and the evolutionary spectrum will be introduced. Also, some new algorithms dealing with music and speech separation and segregation processes will be presented.

Speech/music discrimination using discrete hidden Markov models

2004

A speech/music discrimination system using discrete Hidden Markov Models has been designed. The system has been evaluated using separate training, development and test databases. The discrimination ability was examined using different features and feature combinations and results are presented as error rate on the development and test databases. Features were chosen from knowledge about the speech signal. Adding the zero crossing rate, autocorrelation function and spectral gravity features to cepstrum coefficients helped to improve the discrimination result, while the cepstrum features were found to be more robust. The impact of the model size on the Speech/Music discrimination result was especially evaluated. Different compositions of the database were also explored, with and without a good match. The best result on a mismatched situation, 2.3% error rate on test data, was achieved with 2x13 LFCC (13 Cepstrum coefficients and first time differentials), using 24 states and 96 symbol...

A speech/music discriminator based on RMS and zero-crossings

IEEE Transactions on Multimedia, 2005

Over the last several years, major efforts have been made to develop methods for extracting information from audiovisual media, in order that they may be stored and retrieved in databases automatically, based on their content. In this work we deal with the characterization of an audio signal, which may be part of a larger audiovisual system or may be autonomous, as for example in the case of an audio recording stored digitally on disk. Our goal was to first develop a system for segmentation of the audio signal, and then classification into one of two main categories: speech or music. Among the system's requirements are its processing speed and its ability to function in a real-time environment with a small responding delay. Because of the restriction to two classes, the characteristics that are extracted are considerably reduced and moreover the required computations are straightforward. Experimental results show that efficiency is exceptionally good, without sacrificing performance.

Speech/music discrimination based on posterior probability features

1999

A hybrid connectionist-HMM speech recognizer uses a neural network acoustic classifier. This network estimates the posterior probability that the acoustic feature vectors at the current time step should be labelled as each of around 50 phone classes. We sought to exploit informal observations of the distinctions in this posterior domain between nonspeech audio and speech segments well-modeled by the network. We describe four statistics that successfully capture these differences, and which can be combined to make a reliable speech/nonspeech categorization that is closely related to the likely performance of the speech recognizer. We test these features on a database of speech/music examples, and our results match the previously-reported classification error, based on a variety of special-purpose features, of 1.4% for 2.5 second segments. We also show that recognizing segments ordered according to their resemblance to clean speech can result in an error rate close to the ideal minimum over all such subsetting strategies.

Structure-Based Speech Classifcation Using Non-Linear Embedding Techniques

2000

Usable speech" is referred to as those portions of corrupted speech which can be used in determining a reasonable amount of distinguishing features of the speaker. It has previously been shown that the use of only voiced segments of speech improves the usable speech detection system, and also, that unvoiced speech does not contributes significant information about the speaker(s) for speaker identification. Therefore, using a voiced/unvoiced speech detection system, voiced portions of co-channel speech are usually detected and extracted for use in usable speech extraction systems. The process of human speech production is complex, nonlinear and nonstationary. Its most precise description can only be realized in terms of nonlinear fluid dynamics.

Audio Source Classification using Speaker Recognition Techniques

A practical problem in processing any audio stream is to detect different types of audio and to treat each segment accordingly. This problem may be viewed as a combination of audio seg-mentation and audio source classification. This paper, treats the latter problem, using a Gaussian Mixture Model (GMM). The problem is formulated as one of identification of several music models and two gender models for speech. First an audio seg-ment is classified as music or speech. Then, the type of musical instrument or the gender of the speaker is tagged. 1400 excerpts of music in different styles from over 70 composers were used together with the speech of 700 male and 700 female speakers. The audio signal was telephone quality sampled at 8kHz with µ-law amplitude encoding. A 1% error rate of speech versus music classification and a 1.9% gender classification error rate were achieved at speeds of more than three times real-time on a single core of a multi-core Xeon processor.