Subsequence similarity language models (original) (raw)

Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.

Statistical language modeling based on variable-length sequences

Computer Speech & Language, 2003

... Then, an evaluation of these language models in terms of perplexity and word error rate obtained with our ASR system MAUD (Fohr, Haton, Mari, Smaı̈li, & Zitouni, 1997; Zitouni & Smaı̈li, 1997) is reported in Section 5. Finally, we give in ... la Syrie → Damas,, RPR → gaullisme,. ...

SRILM - an extensible language modeling toolkit

2002

SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimen- tation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a vari- ety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipu- lation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and imple- mentation, highlighting ease of rapid prototyping, reusability, and combinability of tools.

One billion word benchmark for measuring progress in statistical language modeling

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 74.4. A combination of techniques leads to 37% reduction in perplexity, or 11% reduction in cross-entropy (bits), over that baseline.

Character-based Language Model

Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using arbitrarily small units of text data: character-based language model (CBLM). In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity) and as a proof-of-concept and an extrinsic evaluation I present a random sentence generator based on the model.

Improving n-gram modeling using distance-related unit association maximum entropy language modeling

1999

Abstract: In this paper, a distance-related unit association maximumentropy (DUAME) language modeling is proposed.This approach can model an event (unit subsequence)using the co-occurrence of full distance unit association(UA) features so that it is able to pursue a functionalapproximation to higher order N-gram with significantlyless memory requirement. A smoothing strategy relatedto this modeling will also be discussed. Preliminary experimentalresults have shown that DUAME modelingis...

Similarity Based Smoothing In Language Modeling

2000

In this paper, we improve our previously proposed Similarity Based Smoothing (SBS) algorithm. The idea of the SBS is to map words or part of sentences to an Euclidean space, and approximate the language model in that space. The bottleneck of the original algorithm was to train a regular- ized logistic regression model, which was incapable to deal with real

Improving Statistical Language Model Performance With Automatically Generated Word Hierarchies

Computational Linguistics, 1996

An automatic word-classification system has been designed that uses word unigram and bigram frequency statistics to implement a binary top-down form of word clustering and employs an average class mutual information metric. Words are represented as structural tags-n-bit numbers the most significant bit-patterns of which incorporate class information. The classification system has revealed some of the lexical structure of English, as well as some phonemic and semantic structure. The system has been compared-directly and indirectly-with other recent word-classification systems. We see our classification as a means towards the end of constructing multilevel class-based interpolated language models. We have built some of these models and carried out experiments that show a 7% drop in test set perplexity compared to a standard interpolated trigram language model.