Weighted finite-state transducers in speech recognition (original) (raw)
Related papers
Juicer: A weighted finite state transducer speech decoder
2006
A major component in the development of any speech recognition system is the decoder. As task complexities and, consequently, system complexities have continued to increase the decoding problem has become an increasingly significant component in the overall speech recognition system development effort, with efficient decoder design contributing to significantly improve the trade-off between decoding time and search errors. In this paper we present the "Juicer" (from transducer ) large vocabulary continuous speech recognition (LVCSR) decoder based on weighted finite-State transducer (WFST). We begin with a discussion of the need for open source, state-of-the-art decoding software in LVCSR research and how this lead to the development of Juicer, followed by a brief overview of decoding techniques and major issues in decoder design. We present Juicer and its major features, emphasising its potential not only as a critical component in the development of LVCSR systems, but also as an important research tool in itself, being based around the flexible WFST paradigm. We also provide results of benchmarking tests that have been carried out to date, demonstrating that in many respects Juicer, while still in its early development, is already achieving state-of-the-art. These benchmarking tests serve to not only demonstrate the utility of Juicer in its present state, but are also being used to guide future development, hence, we conclude with a brief discussion of some of the extensions that are currently under way or being considered for Juicer.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
We propose a generalized dynamic composition algorithm of weighted nite state transducers (WFST), which avoids the creation of noncoaccessible paths, performs weight look-ahead and does not impose any constraints to the topology of the WFSTs. Experimental results on Wall Street Journal (WSJ1) 20k-word trigram task show that at 17% WER (moderately-wide beam width), the decoding time of the proposed approach is about 48% and 65% of the other two dynamic composition approaches. In comparison with static composition, at the same level of 17% WER, we observe a reduction of about 60% in memory requirement, with an increase of about 60% in decoding time due to extra overheads for dynamic composition.
Language model combination and adaptation usingweighted finite state transducers
2010
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaptation may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequences.
A discriminative model for continuous speech recognition based on Weighted Finite State Transducers
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
This paper proposes a discriminative model for speech recognition that directly optimizes the parameters of a speech model represented in the form of a decoding graph. In the process of recognition, a decoder, given an input speech signal, searches for an appropriate label sequence among possible combinations from separate knowledge sources of speech, e.g., acoustic, lexicon, and language models. It is more reasonable to use an integrated knowledge source, which is composed of these models and forms an overall space to be searched by a decoder, than to use separate ones. This paper aims to estimate a speech model composed in this way directly in the search network, unlike discriminative training approaches, which estimate parameters in acoustic or language model layers. Our approach is formulated as the weight parameter optimization of log-linear distributions in the decoding arcs of a Weighted Finite State Transducer (WFST) to efficiently handle a large network statically. The weight parameters are estimated by an averaged perceptron algorithm. The experimental results show that, especially when the model size is small, the proposed approach provided better recognition performance than the conventional maximum likelihood and comparable to or slightly better performance than discriminative training approaches.
Interspeech 2004, 2004
This paper proposes a new on-the-fly composition algorithm for Weighted Finite-State Transducers (WFSTs) in large-vocabulary continuous-speech recognition. In general on-the-fly composition, two transducers are composed during decoding, and a Viterbi search is performed based on the composed search space. In this new method, a Viterbi search is performed based on the first of two transducers. The second transducer is only used to rescore the hypotheses generated during the search. Since this rescoring is very efficient, the total amount of computation in the new method is almost the same as when using only the first transducer. In a 30kword vocabulary spontaneous lecture speech transcription task, our proposed method significantly outperformed the general on-the-fly composition method. Furthermore the speed of our method was slightly faster than that of decoding with a single fully composed and optimized WFST, where our method consumed only 20% of the memory usage required for decoding with the single WFST. Finally, we have achieved one-pass real-time speech recognition in an extremely large vocabulary of 1.8 million words.
Finite-State Transducers in Language and Speech Processing
Finite-state machines have been used in various domains of natural language processing. We consider here the use of a type of transducer that supports very efficient programs: sequential transducers. We recall classical theorems and give new ones characterizing sequential string-tostring transducers. Transducers that output weights also play an important role in language and speech processing. We give a specific study of string-to-weight transducers, including algorithms for determinizing and minimizing these transducers very efficiently, and characterizations of the transducers admitting determinization and the corresponding algorithms. Some applications of these algorithms in speech recognition are described and illustrated.
Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers
2006
We present an approach to general multi-stream recognition utilizing multi-tape finite-state transducers (FSTs). The approach is novel in that each of the multiple "streams" of features can represent either a sequence (e.g., fixed-or variable-rate frames) or a directed acyclic graph (e.g., containing hypothesized phonetic segmentations). Each transition of the multi-tape FST specifies the models to be applied to each stream and the degree of feature stream asynchrony to allow. We show how this framework can easily represent the 2-stream variable-rate landmark and segment modeling utilized by our baseline SUMMIT speech recognizer. We present experiments merging standard hidden Markov models (HMMs) with landmark models on the Wall Street Journal speech recognition task, and find that some degree of asynchrony can be critical when combining different types of models. We also present experiments performing audio-visual speech recognition on the AV-TIMIT task.
Use of Weighted Finite State Transducers in Part of Speech Tagging
Computing Research Repository, 1997
This paper addresses issues in part of speech disambiguation using finite-state transducersand presents two main contributions to the field. One of them is the use of finite-statemachines for part of speech tagging. Linguistic and statistical information is representedin terms of weights on transitions in weighted finite-state transducers. Another contributionis the successful combination of techniques -- linguistic and statistical -- for
The Titech large vocabulary WFST speech recognition system
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007
In this paper we present evaluations on the large vocabulary speech decoder we are currently developing at Tokyo Institute of Technology. Our goal is to build a fast, scalable, flexible decoder to operate on weighted finite state transducer (WFST) search spaces. Even though the development of the decoder is still in its infancy we have already implemented a impressive feature set and are achieving good accuracy and speed on a large vocabulary spontaneous speech task. We have developed a technique to allow parts of the decoder to be run on the graphics processor, this can lead to a very significant speed up.