Trigger-Based Language Model Construction by Combining Different Corpora (第6回音声言語シンポジウム) (original) (raw)

Trigger-based language model construction by combining different corpora

2004

In this paper we study the trigger-based language model, which can model dependencies between words longer than those modeled by the n-gram language model. Generally in language modeling, when the training corpus matches the target task, its size is typically small, and therefore insufficient for providing reliable probability estimates. On the other hand, large corpora are often too general to capture task dependency. The proposed approach tries to overcome this generality-sparseness trade-off problem by constructing a trigger-based language model in which task-dependent trigger pairs are first extracted from the corpus that matches the task, and then the occurrence probabilities of the pairs are estimated from both the task corpus and a large text corpus to avoid the data sparseness problem. We report evaluation results in the Corpus of Spontaneous Japanese (CSJ).

Dependency Language Modeling

1997

This report summarizes the work of the Dependency Language Modeling group at the 1996 Summer Speech Workshop at the Center for Language and Speech Processing at Johns Hopkins University (WS96). We motivate and descibe a novel statistical language model that models the syntactic dependencies between words. The model is formulated in the maximum entropy framework, which expresses statistical constraints on the frequencies of various type of dependencies, as well the standard N-gram statistics. We describe how this model was applied to the recognition of spontaneous English speech from the Switchboard corpus. Due to implementation constraints, only a reduced version of our model could be tested so far. The model gave a modest improvement over an N-gram baseline model. A by-product of the project is the Maximim Entropy Modeling Toolkit (MEMT), a freely available software package for domain-independent maximum entropy modeling. 1 Introduction Current state-of-the-art language models for ...

A comparison of various approaches for using probabilistic dependencies in language modeling

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR '03, 2003

This version may not include final proof corrections and does not include published layout or pagination. Citation Details Citation for the version of the work held in 'OpenAIR@RGU': BRUZA, P. D. and SONG, D., 2003. A comparison of various approaches for using probabilistic dependencies in language modeling. Available from OpenAIR@RGU.

Proposal for a mutual-information based language model

1994

We propose a probabilistic language model that is intended to overcome some of the limitations of the well-known n-gram models, namely the strong dependence of the parameter values of the model on the discourse domain and the constant size of word context taken into account. The new model is based on the mutual information (MI) measurement for the correlation of events and derives a hierarchy of categories from unlabelled training text. It has close analogies to the bi-gram model and is therefore explained by comparing it with this model.

Context-sensitive statistics for improved grammatical language models

PROCEEDINGS OF THE NATIONAL …, 1994

We develop a language model using probabilistic context-free grammars (PCFGs) that is “pseudo context-sensitive” in that the probability that a non-terminal N expands using a rule T depends on N's par-ent. We give the equations for estimating the neces-sary probabilities using a ...

Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers

2011

In this paper, with a belief that a language model that embraces a larger context provides better prediction ability, we present two extensions to standard n-gram language models in statistical machine translation: a backward language model that augments the conventional forward language model, and a mutual information trigger model which captures long-distance dependencies that go beyond the scope of standard n-gram language models. We integrate the two proposed models into phrase-based statistical machine translation and conduct experiments on large-scale training data to investigate their effectiveness. Our experimental results show that both models are able to significantly improve translation quality and collectively achieve up to 1 BLEU point over a competitive baseline.

Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.

Rapid language model development for new task domains

1998

Abstract Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The first technique is based on using a context-free grammar to generate a corpus of word collocations. The second is an adaptation technique based on using out-of-domain corpora to estimate target domain language models.

Structure and performance of a dependency language model

1997

We present a maximum entropy language model that incorporates both syntax and semantics via a dependency grammar. Such a grammar expresses the relations between words by a directed graph. Because the edges of this graph may connect words that are arbitrarily far apart in a sentence, this technique can incorporate the predictive p o wer of words that lie outside of bigram or trigram range. We h a ve built several simple dependency models, as we call them, and tested them in a speech recognition experiment. We report experimental results for these models here, including one that has a small but statistically signi cant a d v antage (p < : 02) over a bigram language model.