Expanded Lattice Embeddings for Spoken Document Retrieval on Informal Meetings (original) (raw)
Related papers
A critical assessment of spoken utterance retrieval through approximate lattice representations
2008
This paper compares the performance of position-specific posterior lattices (PSPL) and confusion networks applied to spoken utterance retrieval, and tests these recent proposals against several baselines in two disparate domains. These lossy methods provide compact representations that generalize the original segment lattices and provide greater recall and robustness, but have yet to be evaluated against each other in multiple WER conditions for spoken utterance retrieval. Our comparisons suggest that while PSPL and confusion networks have comparable recall, the former is slightly more precise, although its merit appears to be coupled to the assumptions of low-frequency search queries and low-WER environments.
Word/sub-word lattices decomposition and combination for speech recognition
2008
This paper presents the benefit of using multiple lexical units in the post-processing stage of an ASR system. Since the use of sub-word units can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. By using a sub-word information table, every word in a lattice can be decomposed into sub-word units. These decomposed lattices can be combined into a common lattice in order to generate a confusion network. This lattices combination scheme results in an absolute syllable error rate reduction of about 1.4% over the sentence MAP baseline method for a Vietnamese ASR task. By comparing with the N-best lists combination and voting method, the proposed method works better.
2006
Most ASR systems adopt an all-in-one approach: acoustic model, lexicon and language model are all applied simultaneously, thus forming a single large search space. This way, both lexicon and language model help in constraining the search at an early stage which greatly improves its efficiency. However, such close integration comes at a cost: all resources must be kept simple. Achieving higher accuracy in unconstrained LVCSR tasks will require more complex resources while at the same time the 'unconstrainedness' of the task reduces the effectiveness of the all-in-one approach. Therefore, we propose a modular two-layered architecture. First, a pure acoustic-phonemic search generates a dense phone network. Next a robust decoder finds those words from the lexicon that match well with the phone sequences encoded in the phone network. In this paper we investigate the properties the robust word decoder must have and we propose an efficient search algorithm.
CROSS-SITE AND INTRA-SITE ASR SYSTEM COMBINATION: COMPARISONS ON LATTICE AND 1BEST METHODS
2006
We evaluate system combination techniques for automatic speech recognition using systems from multiple sites who participated in the TC-STAR 2006 Evaluation. Both lattice and 1-best combination techniques are tested for cross-site and intra-site tasks. For pairwise combinations the lattice based approaches can outperform 1-best ROVER with confidence scores, but 1-best ROVER results are equal (or even better) when combining three or four systems.
Evaluating ASR Output for Information Retrieval
Measurement, 2007
Within the context of international benchmarks and collection specific projects, much work on spoken document retrieval has been done in recent years. In 2000 the issue of automatic speech recognition for spoken document retrieval was declared 'solved' for the broadcast news domain. Many collections, however, are not in this domain and automatic speech recognition for these collections may contain specific new challenges. This requires a method to evaluate automatic speech recognition optimization schemes for these application areas. Traditional measures such as word error rate and story word error rate are not ideal for this. In this paper, three new metrics are proposed. Their behaviour is investigated on a cultural heritage collection and performance is compared to traditional measurements on TREC broadcast news data.
On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models
2021
Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different training schemes are evaluated comprehensively on Librispeech, and wp-CTC-bMMI and ch-CTC-bMMI are evalu...
The THISL SDR system at TREC-9
2001
This paper describes our participation in the TREC-9 Spoken Document Retrieval (SDR) track. The THISL SDR system consists of a realtime version of a hybrid connectionist/HMM large vocabulary speech recognition system and a probabilistic text retrieval system. This paper describes the configuration of the speech recognition and text retrieval systems, including segmentation and query expansion. We report our results for development tests using the TREC-8 queries, and for the TREC-9 evaluation.
Evaluation of phone lattice based speech decoding
Interspeech 2009
Previously, we proposed a flexible two-layered speech recogniser architecture, called FLaVoR. In the first layer an unconstrained, task independent phone recogniser generates a phone lattice. Only in the second layer the task specific lexicon and language model are applied to decode the phone lattice and produce a word level recognition result. In this paper, we present a further evaluation of the FLaVoR architecture. The performance of a classical single-layered architecture and the FLaVoR architecture are compared on two recognition tasks, using the same acoustic, lexical and language models. On the large vocabulary Wall Street Journal 5k and 20k benchmark tasks, the two-layered architecture resulted in slightly but not significantly better word error rates. On a reading error detection task for a reading tutor for children, the FLaVoR architecture clearly outperformed the single-layered architecture.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically-vs. manuallygenerated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
Hill climbing on speech lattices: A new rescoring framework
2011
We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.