Implementation of cepstrum based voiced / unvoiced Classification (original) (raw)

Cepstrum-based pitch detection using a new statistical V/UV classification algorithm

IEEE Transactions on Speech and Audio Processing, 1999

An improved cepstrum-based voicing detection and pitch determination algorithm is presented. Voicing decisions are made using a multifeature voiced/unvoiced classification algorithm based on statistical analysis of cepstral peak, zero-crossing rate, and energy of short-time segments of the speech signal. Pitch frequency information is extracted by a modified cepstrum-based method and then carefully refined using pitch tracking, correction, and smoothing algorithms. Performance analysis on a large database indicates considerable improvement relative to the conventional cepstrum method. The proposed algorithm is also shown to be robust to additive noise.

A Comparison of Cepstral Features in the Detection of Pathological Voices by Varying the Input and Filterbank of the Cepstrum Computation

IEEE Access, 2021

Automatic voice pathology detection enables objective assessment of pathologies that affect the voice production mechanism. Detection systems have been developed using the traditional pipeline approach (consisting of the feature extraction part and the detection part) and using the modern deep learning-based end-to-end approach. Due to the lack of vast amounts of training data in the study area of pathological voice, the former approach is still a valid choice. In the existing detection systems based on the traditional pipeline approach, the mel-frequency cepstral coefficient (MFCC) features can be regarded as the defacto standard feature set. In this study, automatic voice pathology detection is investigated by comparing the performance of various MFCC variants derived by considering two factors: the input and the filterbank in the cepstrum computation. For the first factor, three inputs (the voice signal, the glottal source and the vocal tract) are compared. The glottal source and the vocal tract are estimated using the quasi-closed phase glottal inverse filtering method. For the second factor, the mel-frequency and linear-frequency filterbanks are compared. Experiments were conducted separately using six databases consisting of voices produced by speakers suffering from one of four disorders (dysphonia, Parkinson's disease, laryngitis, or heart failure) and by healthy speakers. Support vector machine (SVM) was used as the classifier. The results show that by combining mel-and linear-frequency cepstral coefficients derived from the glottal source and vocal tract, better overall detection accuracy was obtained compared to the defacto MFCC features derived from the voice signal. Furthermore, this combination provided comparable or better performance than four existing cepstral feature extraction techniques in clean and high signal-to-noise ratio (SNR) conditions. INDEX TERMS Voice disorders, glottal inverse filtering, support vector machine, cepstral coefficients. I. INTRODUCTION Voice pathologies arise either due to physical changes in the voice production mechanism (e.g., in the respiratory system, vocal folds, and vocal tract) [1], [2] or due to improper vocal use when the physical structure of the mechanism is normal (e.g., vocal fatigue or ventricular phonation) [3]-[5]. Examples of voice pathologies are dysarthria [7], dysphonia [8], vocal polyp [9], and developmental dysphasia [13]. Voice pathology may also indicate early neurodegenerative disease such as Parkinson's disease (PD) [10]-[12], [14]. Voice pathology detection refers to a technology to automatically The associate editor coordinating the review of this manuscript and approving it for publication was Shuihua Wang. distinguish normal voices from pathological voices by computer using the recorded voice signal. Existing voice pathology detection systems can be divided into two categories: traditional pipeline systems and modern end-to-end systems [15]. The traditional pipeline system consists of two components [15], [16]: the feature extraction part and the detection part. The feature extraction part tries to capture discriminative information from acoustic voice signal waveforms by representing this information in compressed forms using a set of pre-defined features. The feature sets reported in the literature for voice pathology detection can be grouped into four categories: (1) perturbation measures (such as jitter and shimmer); (2) spectral and cepstral measures

Robustness of Auditory Teager Energy Cepstrum Coefficients for Classification of Pathological and Normal Voices in Noisy Environments

The Scientific World Journal, 2013

This paper focuses on a robust feature extraction algorithm for automatic classification of pathological and normal voices in noisy environments. The proposed algorithm is based on human auditory processing and the nonlinear Teager-Kaiser energy operator. The robust features which labeled Teager Energy Cepstrum Coefficients (TECCs) are computed in three steps. Firstly, each speech signal frame is passed through a Gammatone or Mel scale triangular filter bank. Then, the absolute value of the Teager energy operator of the short-time spectrum is calculated. Finally, the discrete cosine transform of the log-filtered Teager Energy spectrum is applied. This feature is proposed to identify the pathological voices using a developed neural system of multilayer perceptron (MLP). We evaluate the developed method using mixed voice database composed of recorded voice samples from normophonic or dysphonic speakers. In order to show the robustness of the proposed feature in detection of pathological voices at different White Gaussian noise levels, we compare its performance with results for clean environments. The experimental results show that TECCs computed from Gammatone filter bank are more robust in noisy environments than other extracted features, while their performance is practically similar to clean environments.

Formant estimation for noise robust vowel recognition based on spectral domain ramp cepstrum model

2012

The opportunity to work with my supervisor Dr. Shaikh Anowarul Fattah, whose dynamic ideas, support and guidance helped me althrough this research, was immense and extraordinary. I want to express my heartiest gratitude to him. Completing the work in time required using the lab facilities of EEE department at dierent times. I would also like to thank the Head of the Department of Electrical and Electronic Engineering for allowing me to use the lab facilities. I wish to express my gratitude to Dr. Celia Shahnaz, who was a continuous source of inspiration and support. I express appreciation to my friends Shan, Raju, Partha and Rabu, for their suggestions, support and friendship.

Voice source cepstrum coefficients for speaker identification

2008

We propose a novel feature set for speaker recognition that is based on the voice source signal. The feature extraction process uses closed-phase LPC analysis to estimate the vocal tract transfer function. The LPC spectrum envelope is converted to cepstrum coefficients which are used to derive the voice source features. Unlike approaches based on inverse-filtering, our procedure is robust to LPC analysis errors and low-frequency phase distortion. We have performed text-independent closed-set speaker identification experiments on the TIMIT and the YOHO databases using a standard Gaussian mixture model technique. Compared to using melfrequency cepstrum coefficients, the misclassification rate for the TIMIT database reduced from 1.51% to 0.16% when combined with the proposed voice source features. For the YOHO database the misclassification rate decreased from 13.79% to 10.07%. The new feature vector also compares favourably to other proposed voice source feature sets.

USE OF CEPSTRUM-BASED PARAMETERS FOR AUTOMATIC PATHOLOGY DETECTION ON SPEECH - Analysis of Performance and Theoretical Justification

Proceedings of the First International Conference on Bio-inspired Systems and Signal Processing, 2008

The majority of speech signal analysis procedures for automatic pathology detection mostly rely on parameters extracted from time-domain processing. Moreover, calculation of these parameters often requires prior pitch period estimation; therefore, their validity heavily depends on the robustness of pitch detection. Within this paper, an alternative approach based on cepstral-domain processing is presented which has the advantage of not requiring pitch estimation, thus providing a gain in both simplicity and robustness. While the proposed scheme is similar to solutions based on Mel-frequency cepstral parameters, already present in literature, it has an easier physical interpretation while achieving similar performance standards.

A new model for the short-time complex cepstrum of voiced speech

IEEE Transactions on Acoustics, Speech, and Signal Processing, 1986

Traditionally, a very simple model for short-time homomorphic analysis has been used. It is shown that there is no theoretical justification for applying this model to voiced speech and that the model is of limited value for improving cepstral deconvolution procedures. Consequently, a more elaborate model is introduced in which the influence of window length is approximated and the spectral sampling inherent in voiced speech is explicitly represented. As a result, this new model shows that the vocal tract contribution to the complex cepstrum is repeated at every multiple of the pitch quefrency (np) and is multiplied by a double sinclike distortion (D(n)). It is shown that in order to achieve deconvolution with a low-time gating system, a cepstral lifter of length np/2 should be used (instead of the usual length "less than np"). Furthermore, the lifter should compensate for the distortion D(n). Unfortunately, the accuracy of straightforward homomorphic deconvolution approximations is limited by aliasing distortion which results from the repeated nature of the vocal tract contribution. Nevertheless, reasonable deconvolution approximations are obtained.

Parametric cepstral analysis for pathological voice assessment

2008

Traditional methods to diagnose laryngeal pathologies such as laryngoscopy are considered invasive and uncomfortable. Methods based on acoustic analisys of speech signals have been investigated in order to diminish the number of laryngoscopical exams. Digital signal processing techniques have been used to perform an acoustic analysis for vocal quality assessment due to the simplicity and the non-invasive nature of the measurement procedures. Their employment is of special interest, as they can provide an objective diagnosis of pathological voices, and may be used as complementary tool in laryngoscopy. The degree of reliability and effectiveness of discriminating process of pathological voices from normal ones depends on the characteristics and parameters of voice used to train the employed classifier. This paper aims at evaluating the performance of the Linear Prediction Coding (LPC)-based cepstral analysis to discriminate pathological voices of speakers affected by vocal fold edema. For this purpose, LPC, cepstral, weighted cepstral, delta cepstral weighted delta cepstral mel-cepstral coefficients and are used. A vector-quantizing-trained distance classifier is used in the discrimination process.