SRILM - an extensible language modeling toolkit (original) (raw)
Related papers
A Language Modelling Tool for Statistical NLP
Anais do V Workshop em Tecnologia da …, 2007
In recent years the use of statistical language models (SLMs) has become widespread in most NLP fields. In this work we introduce jNina, a basic language modelling tool to aid the development of Machine Translation systems and many other text-generating applications. The tool allows for the quick comparison of multiple text outputs (e.g., alternative translations of a single source) based on a given SLM, and enables the user to build and evaluate her own SLMs from any corpora provided.
Language models: where are the bottlenecks?
AISB Quarterly, 1994
Statistical, parsing, database, and other methods of bringing contextual information to bear on the recognition task are described in a uniform framework in which the central data structure mediating between the recognition and the contextual components is a segment lattice, a directed graph that contains the alternative segments and their confidence/probability ranking. Explicit measures of the value of such segment lattices and the correctness of language models are proposed, and the dominant technologies are critically evaluated.
5 Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
1995
An automatic word classi cation system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a binary top-down form of word clustering which employs an average class mutual information metric. Resulting classi cations are hierarchical, allowing variable class granularity. Words are represented as structural tags | unique n-bit numbers the most signi cant bit-patterns of which incorporate class information. Access to a structural tag immediately provides access to all classi cation levels for the corresponding word. The classi cation system has successfully revealed some of the structure of English, from the phonemic to the semantic level. The system has been compared | directly and indirectly | with other recent word classi cation systems. Class based interpolated language models have been constructed to exploit the extra information supplied by the classi cations and some experiments have shown that the new models improve model performance.
Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach
In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.
Statistical feature language model
Proc. ICSLP, 2004
Statistical language models are widely used in automatic speech recognition in order to constrain the decoding of a sentence. Most of these models derive from the clas-sical n-gram paradigm. However, the production of a word depends on a large set of linguistic features ...
Source Code: Querying and Serving N -gram Language Models with Python
Statistical n-gram language modeling is a very important technique in Natural Language Processing (NLP) and Computational Linguistics used to assess the fluency of an utterance in any given language. It is widely employed in several important NLP applications such as Machine Translation and Automatic Speech Recognition. However, the most commonly used toolkit (SRILM) to build such language models on a large scale is written entirely in C++ which presents a challenge to an NLP developer or researcher whose primary language of choice is Python. This article first provides a gentle introduction to statistical language modeling. It then describes how to build a native and efficient Python interface (using SWIG) to the SRILM toolkit such that language models can be queried and used directly in Python code. Finally, it also demonstrates an effective use case of this interface by showing how to leverage it to build a Python language model server. Such a server can prove to be extremely useful when the language model needs to be queried by multiple clients over a network: the language model must only be loaded into memory once by the server and can then satisfy multiple requests. This article supplements ) and provides the entire set of source code listings along with appropriate technical comments where necessary. Some of the listings may already be included with the primary article (in complete or excerpted form) but are reproduced here for the sake of completeness.
Towards Improved Language Model Evaluation Measures
1999
Much recent research has demonstrated that the correlation between a language model's perplexity and its effect on the word error rate of a speech recognition system is not as strong as was once thought. This represents a major problem for those involved in developing language models. This paper describes the development of new measures of language model quality. These measures retain the ease of computation and task independence that are perplexity's strengths, yet are considerably better correlated with word error rate. This paper also shows that mixture-based language models are improved by applying interpolation weights which are optimised with respect to these new measures, rather than a maximum likelihood criterion.
Improving language models by using distant information
2007 9th International Symposium on Signal Processing and Its Applications, 2007
This study examines how to take originally advantage from distant information in statistical language models. We show that it is possible to use n-gram models considering histories different from those used during training. These models are called crossing context models. Our study deals with classical and distant n-gram models. A mixture of four models is proposed and evaluated. A bigram linear mixture achieves an improvement of 14% in terms of perplexity. Moreover the trigram mixture outperforms the standard trigram by 5.6%. These improvements have been obtained without complexifying standard n-gram models. The resulting mixture language model has been integrated into a speech recognition system. Its evaluation achieves a slight improvement in terms of word error rate on the data used for the francophone evaluation campaign ESTER . Finally, the impact of the proposed crossing context language models on performance is presented according to various speakers.
One billion word benchmark for measuring progress in statistical language modeling
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 74.4. A combination of techniques leads to 37% reduction in perplexity, or 11% reduction in cross-entropy (bits), over that baseline.