Automatic Speech Segmentation and Recognition using Class-Specific Features (original) (raw)

Language independent automatic speech segmentation into phoneme-like units on the base of acoustic distinctive features

There are special topics in cognitive info communication where the processing of continuous speech is necessary. These topics often require the segmentation of speech signal into phoneme sized units. This kind of segmentation is necessary, when the desired behavior depends on speech timing, like rhythm or the place of voiced sounds (emotion or mood detection, language learning, acoustic feature visualization). Segmentation systems based on the acoustic-phonetic knowledge of speech could be realized in a language independent way. In this paper we introduce a language independent solution, based on the segmentation of continuous speech into 9 broad phonetic classes. The classification and segmentation was prepared using Hidden Markov Models. Three databases were used to evaluate the segmentation systems: Hungarian MRBA, German KIEL and English TIMIT databases. 80% average recognition result was obtained.

A Speech Recognizer Based on Multiclass SVMs with HMM-Guided Segmentation

Lecture Notes in Computer Science, 2006

Automatic Speech Recognition (ASR) is essentially a problem of pattern classification, however, the time dimension of the speech signal has prevented to pose ASR as a simple static classification problem. Support Vector Machine (SVM) classifiers could provide an appropriate solution, since they are very well adapted to high-dimensional classification problems. Nevertheless, the use of SVMs for ASR is by no means straightforward, mainly because SVM classifiers require an input of fixed-dimension. In this paper we study the use of a HMM-based segmentation as a mean to get the fixed-dimension input vectors required by SVMs, in a problem of isolated-digit recognition. Different configurations for all the parameters involved have been tested. Also, we deal with the problem of multi-class classification (as SVMs are initially binary classifers), studying two of the most popular approaches: 1-vs-all and 1-vs-1.

A Review: Automatic Speech Segmentation

Automated segmentation of speech signals has been under research for over 30 years. Many speech processing systems require segmentation of Speech waveform into principal acoustic units. Segmentation is a process of breaking down a speech signal into smaller units. Segmentation is the very primary step in any voiced activated systems like speech recognition systems and training of speech synthesis systems. Speech segmentation is performed utilizing Wavelet, Fuzzy methods, Artificial Neural Networks and Hidden Markov Model.

Speech/Non-Speech Segmentation Based on Phoneme Recognition Features

EURASIP Journal on Advances in Signal Processing, 2006

This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.

On the automatic segmentation of speech signals

ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing

For large vocabulary and continuous speech recognition, the subword-unit-based approach is a viable alternative to the wholeword-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

Audio segmentation for speech recognition using segment features

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

Audio segmentation is an essential preprocessing step in several audio processing applications with a significant impact e.g. on speech recognition performance. We introduce a novel framework which combines the advantages of different well known segmentation methods. An automatically estimated log-linear segment model is used to determine the segmentation of an audio stream in a holistic way by a maximum a posteriori decoding strategy, instead of classifying change points locally. A comparison to other segmentation techniques in terms of speech recognition performance is presented, showing a promising segmentation quality of our approach.

Statistical corpus-based speech segmentation

Eighth International Conference on Spoken …, 2004

An automatic speech segmentation technique is presented that is based on the alignment of a target speech signal with a set of different reference speech signals generated by a specific designed corpus-based speech synthesis system that additionally generates phoneme boundary markers. Each reference signal is then warped to the target speech signal. By synthesizing and warping many different reference speech signals, each phoneme boundary of the target signal is characterized by a distribution of warped phoneme boundary positions. The boundary distributions are statistically and acoustically processed in order to generate the final segmentation. First, some problems related to manual and automatic phoneme segmentation are addressed. Then the technique of Statistical Corpus-based Segmentation (SCS) is introduced. Finally, intra-and inter-speaker segmentation results are presented.

Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries

The Journal of the Acoustical Society of America, 2010

Despite using different algorithms, most unsupervised automatic phone segmentation methods achieve similar performance in terms of percentage correct boundary detection. Nevertheless, unsupervised segmentation algorithms are not able to perfectly reproduce manually obtained reference transcriptions. This paper investigates fundamental problems for unsupervised segmentation algorithms by comparing a phone segmentation obtained using only the acoustic information present in the signal with a reference segmentation created by human transcribers. The analyses of the output of an unsupervised speech segmentation method that uses acoustic change to hypothesize boundaries showed that acoustic change is a fairly good indicator of segment boundaries: over two-thirds of the hypothesized boundaries coincide with segment boundaries. Statistical analyses showed that the errors are related to segment duration, sequences of similar segments, and inherently dynamic phones. In order to improve unsupervised automatic speech segmentation, current one-stage bottom-up segmentation methods should be expanded into two-stage segmentation methods that are able to use a mix of bottom-up information extracted from the speech signal and automatically derived top-down information. In this way, unsupervised methods can be improved while remaining flexible and language-independent.

A Comparison of Different Approaches to Automatic Speech Segmentation

2002

We compare different methods for obtaining accurate speech segmentations starting from the corresponding orthography. The complete segmentation process can be decomposed into two basic steps. First, a phonetic transcription is automatically produced with the help of large vocabulary continuous speech recognition (LVCSR). Then, the phonetic information and the speech signal serve as input to a speech segmentation tool. We compare two automatic approaches to segmentation, based on the Viterbi and the Forward-Backward algorithm respectively. Further, we develop different techniques to cope with biases between automatic and manual segmentations. Experiments were performed to evaluate the generation of phonetic transcriptions as well as the different speech segmentation methods.

Automatic Parameter Estimation for a Context-Independent Speech Segmentation Algorithm

Lecture Notes in Computer Science, 2002

In the framework of a recently introduced algorithm for speech phoneme segmentation, a novel strategy has been elaborated for comparing different speech encoding methods and for finding parameters which are optimal to the algorithm. The automatic procedure that implements this strategy allows to improve previously declared performances and poses the basis for a more accurate comparison between the investigated segmentation