A Maximum Entropy Approach for Semantic Language Modeling (original) (raw)
Related papers
Compact maximum entropy language models
Proceedings of the IEEE workshop on automatic …
In language modeling we are always confronted with a sparse data problem. The Maximum Entropy formalism allows to fully integrate complementary statistical properties of limited corpora. The focus of the present paper is twofold. The new smoothing technique of LM-induced marginals is introduced and discussed. We then highlight the advantages resulting from a combination of robust features and show that the brute-force inclusion of too many constraints may deteriorate the performance due to overtraining effects. Very good LMs may be trained on the basis of pair correlations which are supplemented by heavily pruned ¢-grams. This is especially true if word and class based features are combined. Tests were carried out for the German Verbmobil task and on WSJ data. The test-set perplexities were reduced by 3-7% and the number of free parameters was reduced by 60-75%. At the same time overtraining effects are considerably reduced.
Investigating linguistic knowledge in a maximum entropy token-based language model
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007
We present a novel language model capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple. Using tokens as hidden states, our model is effectively a hidden Markov model (HMM) producing sequences of words with trivial output distributions. The transition probabilities, however, are computed using a maximum entropy model to take advantage of potentially overlapping features. We investigated different types of labels with a wide range of linguistic implications. These models outperform Kneser-Ney smoothed n-gram models both in terms of perplexity on standard datasets and in terms of word error rate for a large vocabulary speech recognition system.
Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing, 2002
This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.
Machine Learning, 2005
Current statistical machine translation systems are mainly based on statistical word lexicons. However, these models are usually context-independent, therefore, the disambiguation of the translation of a source word must be carried out using other probabilistic distributions (distortion distributions and statistical language models). One efficient way to add contextual information to the statistical lexicons is based on maximum entropy modeling. In that framework, the context is introduced through feature functions that allow us to automatically learn context-dependent lexicon models.
Improving n-gram modeling using distance-related unit association maximum entropy language modeling
1999
Abstract: In this paper, a distance-related unit association maximumentropy (DUAME) language modeling is proposed.This approach can model an event (unit subsequence)using the co-occurrence of full distance unit association(UA) features so that it is able to pursue a functionalapproximation to higher order N-gram with significantlyless memory requirement. A smoothing strategy relatedto this modeling will also be discussed. Preliminary experimentalresults have shown that DUAME modelingis...
Improving language models by using distant information
2007 9th International Symposium on Signal Processing and Its Applications, 2007
This study examines how to take originally advantage from distant information in statistical language models. We show that it is possible to use n-gram models considering histories different from those used during training. These models are called crossing context models. Our study deals with classical and distant n-gram models. A mixture of four models is proposed and evaluated. A bigram linear mixture achieves an improvement of 14% in terms of perplexity. Moreover the trigram mixture outperforms the standard trigram by 5.6%. These improvements have been obtained without complexifying standard n-gram models. The resulting mixture language model has been integrated into a speech recognition system. Its evaluation achieves a slight improvement in terms of word error rate on the data used for the francophone evaluation campaign ESTER . Finally, the impact of the proposed crossing context language models on performance is presented according to various speakers.
A Weighted Maximum Entropy Language Model for Text
Proceeding of the 2nd …, 2005
The Maximum entropy (ME) approach has been extensively used for various natural language processing tasks, such as language modeling, part-of-speech tagging, text segmentation and text classification. Previous work in text classification has been done using maximum entropy modeling with binary-valued features or counts of feature words. In this work, we present a method to apply Maximum Entropy modeling for text classification in a different way it has been used so far, using weights for both to select the features of the model and to emphasize the importance of each one of them in the classification task. Using the X square test to assess the contribution of each candidate feature from the obtained X square values we rank the features and the most prevalent of them, those which are ranked with the higher X square scores, they are used as the selected features of the model. Instead of using Maximum Entropy modeling in the classical way, we use the X square values to weight the features of the model and give thus a different importance to each one of them. The method has been evaluated on Reuters-21578 dataset for test classification tasks, giving very promising results and performing comparable to some of the "state of the art" systems in the classification field.
Investigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR
For languages like German and Polish, higher numbers of word inflections lead to high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Thus, one of the main challenges in large vocabulary continuous speech recognition (LVCSR) is recognizing an open vocabulary. In this paper, we investigate the use of mixed type of sub-word units in the same recognition lexicon. Namely, morphemic or syllabic units combined with pronunciations called graphones, normal graphemic morphemes or syllables, along with full-words. In addition, we investigate the suitability of hybrid mixed-unit Ngrams as features for Maximum Entropy LM along with adaptation. We achieve significant improvements in recognizing OOVs and word error rate reductions for German and Polish LVCSR compared to the conventional full-word approach and state-of-the-art N-gram mixed type hybrid LM.
Proposal for a mutual-information based language model
1994
We propose a probabilistic language model that is intended to overcome some of the limitations of the well-known n-gram models, namely the strong dependence of the parameter values of the model on the discourse domain and the constant size of word context taken into account. The new model is based on the mutual information (MI) measurement for the correlation of events and derives a hierarchy of categories from unlabelled training text. It has close analogies to the bi-gram model and is therefore explained by comparing it with this model.
One billion word benchmark for measuring progress in statistical language modeling
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 74.4. A combination of techniques leads to 37% reduction in perplexity, or 11% reduction in cross-entropy (bits), over that baseline.