Trigger-Based Language Model Construction by Combining Different Corpora (第6回音声言語シンポジウム) (original) (raw)

In this paper we study the trigger-based language model, which can model dependencies between words longer than those modeled by the n-gram language model. Generally in language modeling, when the training corpus matches the target task, its size is typically small, and therefore insufficient for providing reliable probability estimates. On the other hand, large corpora are often too general to capture task dependency. The proposed approach tries to overcome this generality-sparseness trade-off problem by constructing a trigger-based language model in which task-dependent trigger pairs are first extracted from the corpus that matches the task, and then the occurrence probabilities of the pairs are estimated from both the task corpus and a large text corpus to avoid the data sparseness problem. We report evaluation results in the Corpus of Spontaneous Japanese (CSJ).