Multi-stream parameterization for structural speech recognition (original) (raw)

On invariant structural representation for speech recognition: theoretical validation and experimental improvement

Interspeech 2009, 2009

One of the most challenging problems in speech recognition is to deal with inevitable acoustic variations caused by nonlinguistic factors. Recently, an invariant structural representation of speech was proposed [1], where the non-linguistic variations are effectively removed though modeling the dynamic and contrastive aspects of speech signals. This paper describes our recent progresses on this problem. Theoretically, we prove that the maximum likelihood based decomposition can lead to the same structural representations for a sequence and its transformed version. Practically, we introduce a method of discriminant analysis of eigen-structure to deal with two limitations of structural representations, namely, high dimensionality and too strong invariance. In the 1st experiment, we evaluate the proposed method through recognizing connected Japanese vowels. The proposed method achieves a recognition rate 99.0%, which is higher than those of the previous structure based recognition methods [2, 3, 4] and word HMM. In the 2nd experiment, we examine the recognition performance of structural representations to vocal tract length (VTL) differences. The experimental results indicate that structural representations have much more robustness to VTL changes than HMM. w Moreover, the proposed method is about 60 times faster than the previous ones.

Structure-based and template-based automatic speech recognition - comparing parametric and non-parametric approaches

This paper provides an introductory tutorial for the Interspeech07 special session on "Structure-Based and Template-Based Automatic Speech Recognition". The purpose of the special session is to bring together researchers who have special interest in novel techniques that are aimed at overcoming weaknesses of HMMs for acoustic modeling in speech recognition. Numerous such approaches have been taken over the past dozen years, which can be broadly classified into structured-based (parametric) and templatebased (non-parametric) ones. In this paper, we will provide an overview of both approaches, focusing on the incorporation of long-range temporal dependencies of the speech features and phonetic detail in speech recognition algorithms. We will provide a high-level survey on major existing work and systems using these two types of "beyond-HMM" frameworks. The contributed papers in this special session will elaborate further on the related topics.

Structural Classification Methods Based on Weighted Finite-State Transducers for Automatic Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing, 2000

The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).

Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics

Interspeech 2007, 2007

Speech acoustics vary due to differences in gender, age, microphone, room, lines, and a variety of factors. In speech recognition research, to deal with these inevitable non-linguistic variations, thousands of speakers in different acoustic conditions were prepared to train acoustic models of individual phonemes. Recently, a novel representation of speech dynamics was proposed [1, 2], where the above non-linguistic factors are effectively removed from speech as if pitch information is removed from spectrum by its smoothing. This representation captures only speaker-and microphone-invariant speech dynamics and no absolute or static acoustic properties such as spectrums are used. With them, speaker identity has to remain in speech representation. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels [3]. The proposed method with a single training speaker outperformed the conventional HMMs trained with more than four thousand speakers even in the case of noisy speech. The current paper shows the initial results of applying the dynamic representation to recognizing continuous speech, that is connected vowels.

Leveraging phonetic context dependent invariant structure for continuous speech recognition

2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), 2014

Speech acoustics intrinsically vary due to linguistic and nonlinguistic factors. The invariant structure extracted from a given utterance is one of the long-span acoustic representations, where acoustic variation caused by non-linguistic factors can be removed reasonably. It expresses spectral contrasts between acoustic events in an utterance. In previous studies, the invariant structure was leveraged in continuous speech recognition for reranking the N-best candidates hypothesized by a traditional automatic speech recognition (ASR) system. Use of the invariant structure features for reranking showed good effects, however, the features were defined or labeled in a phonetic-context-independent way. In this paper, use of phonetic context to define invariant structure features is examined. The proposed method is tested in two tasks of continuous digits speech recognition and large vocabulary continuous speech recognition (LVCSR). The performances are improved relatively by 4.7% and 1.2%, respectively.

Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers

2006

We present an approach to general multi-stream recognition utilizing multi-tape finite-state transducers (FSTs). The approach is novel in that each of the multiple "streams" of features can represent either a sequence (e.g., fixed-or variable-rate frames) or a directed acyclic graph (e.g., containing hypothesized phonetic segmentations). Each transition of the multi-tape FST specifies the models to be applied to each stream and the degree of feature stream asynchrony to allow. We show how this framework can easily represent the 2-stream variable-rate landmark and segment modeling utilized by our baseline SUMMIT speech recognizer. We present experiments merging standard hidden Markov models (HMMs) with landmark models on the Wall Street Journal speech recognition task, and find that some degree of asynchrony can be critical when combining different types of models. We also present experiments performing audio-visual speech recognition on the AV-TIMIT task.

A segmental-feature HMM for continuous speech recognition based on a parametric trajectory model

Speech Communication, 2002

In this paper, we propose a new acoustic model for characterizing segmental features and an algorithm based upon a general framework of hidden Markov models (HMMs). The segmental features are represented as a trajectory of observed vector sequences by a polynomial regression function. To obtain the polynomial trajectory from speech segments, we modify the design matrix to include transitional information for contiguous frames. We also propose methods for estimating the likelihood of a given segment and trajectory parameters. The observation probability of a given segment is represented as the relation between the segment likelihood and the estimation error of the trajectories. The estimation error of a trajectory is considered the weight of the likelihood of a given segment in a state. This weight represents the probability of how well the corresponding trajectory characterizes the segment. The proposed model can be regarded as a generalization of a conventional HMM and a parametric trajectory model. We conducted several experiments to establish the effectiveness of the proposed method and the characteristics of the segmental features. The recognition results on the TIMIT database demonstrate that the performance of segmental-feature HMM (SFHMM) is better than that of a conventional HMM.

Hierarchical Multi-stream Posterior Based Speech Recognition System

Lecture Notes in Computer Science, 2006

In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on "state gamma posterior" definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs.This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the stateof-the-art Tandem systems.

Structure-Based Speech Classifcation Using Non-Linear Embedding Techniques

2000

Usable speech" is referred to as those portions of corrupted speech which can be used in determining a reasonable amount of distinguishing features of the speaker. It has previously been shown that the use of only voiced segments of speech improves the usable speech detection system, and also, that unvoiced speech does not contributes significant information about the speaker(s) for speaker identification. Therefore, using a voiced/unvoiced speech detection system, voiced portions of co-channel speech are usually detected and extracted for use in usable speech extraction systems. The process of human speech production is complex, nonlinear and nonstationary. Its most precise description can only be realized in terms of nonlinear fluid dynamics.

Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations

2002

A theoretical and experimental analysis of a simple multilevel segmental HMM is presented in which the relationship between symbolic (phonetic) and surface (acoustic) representations of speech is regulated by an intermediate (articulatory) layer, where speech dynamics are modeled using linear trajectories. Three formant-based parameterizations and measured articulatory positions are considered as intermediate representations, from the TIMIT and MOCHA corpora respectively. The articulatory-to-acoustic mapping was performed by between 1 and 49 linear transformations. Results of phone-classification experiments demonstrate that, by appropriate choice of intermediate parameterization and mappings, it is possible to achieve close to optimal performance.