Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers (original) (raw)

We present an approach to general multi-stream recognition utilizing multi-tape finite-state transducers (FSTs). The approach is novel in that each of the multiple "streams" of features can represent either a sequence (e.g., fixed-or variable-rate frames) or a directed acyclic graph (e.g., containing hypothesized phonetic segmentations). Each transition of the multi-tape FST specifies the models to be applied to each stream and the degree of feature stream asynchrony to allow. We show how this framework can easily represent the 2-stream variable-rate landmark and segment modeling utilized by our baseline SUMMIT speech recognizer. We present experiments merging standard hidden Markov models (HMMs) with landmark models on the Wall Street Journal speech recognition task, and find that some degree of asynchrony can be critical when combining different types of models. We also present experiments performing audio-visual speech recognition on the AV-TIMIT task.