Character-based Language Model (original) (raw)

A random text model for the generation of statistical language invariants

2007

A novel random text generation model is introduced. Unlike in previous random text models, that mainly aim at producing a Zipfian distribution of word frequencies, our model also takes the properties of neighboring co-occurrence into account and introduces the notion of sentences in random text. After pointing out the deficiencies of related models, we provide a generation process that takes neither the Zipfian distribution on word frequencies nor the small-world structure of the neighboring co-occurrence graph as a constraint. Nevertheless, these distributions emerge in the process. The distributions obtained with the random generation model are compared to a sample of natural language data, showing high agreement also on word length and sentence length. This work proposes a plausible model for the emergence of large-scale characteristics of language without assuming a grammar or semantics.

Models of English text

Proceedings DCC '97. Data Compression Conference, 1997

The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.

Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.

IRJET-Comparative Study of Statistical and Neural Network Language Modelling in Sentence Generation

IRJET, 2020

Natural Language Processing(NLP) is a field in machine learning with the ability of a computer to understand, analyze, manipulate, and potentially generate human language. NLP is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. In this work, we propose an comparative study on the statistical and neural network language model based on its accuracy of performance. The language models are built for both the statistical language model and neural network language model. The developed models will predict the expected outcomes based on the input given to the model. The input is converted from speech to text by automatic speech recognition. The outcome is used to identify which language model results has better performance.

A Maximum Entropy Approach for Semantic Language Modeling

2006

The conventional n-gram language model exploits only the immediate context of historical words without exploring long-distance semantic information. In this paper, we present a new information source extracted from latent semantic analysis (LSA) and adopt the maximum entropy (ME) principle to integrate it into an n-gram language model. With the ME approach, each information source serves as a set of constraints, which should be satisfied to estimate a hybrid statistical language model with maximum randomness. For comparative study, we also carry out knowledge integration via linear interpolation (LI). In the experiments on the TDT2 Chinese corpus, we find that the ME language model that combines the features of trigram and semantic information achieves a 17.9% perplexity reduction compared to the conventional trigram language model, and it outperforms the LI language model. Furthermore, in evaluation on a Mandarin speech recognition task, the ME and LI language models reduce the character error rate by 16.9% and 8.5%, respectively, over the bigram language model.

Statistical feature language model

Statistical language models are widely used in automatic speech recognition in order to constrain the decoding of a sentence. Most of these models derive from the clas- sical n-gram paradigm. However, the production of a word depends on a large set of linguistic features : lex- ical, syntactic, semantic, etc. Moreover, in some nat- ural languages the gender and number of the left con- text affect the production of the next word. Therefore, it seems attractive to design a language model based on a variety of word features. We present in this paper a new statistical language model, called Statistical Featur e Language Model, SFLM, based on this idea. In SFLM a word is considered as an array of linguistic features, and the model is defined in a way similar to the n-gram model. Experiments carried out for French and shown an improvement in terms of perplexity and predicted words.

Open-Lexicon Language Modeling Combining Word and Character Levels

2014 14th International Conference on Frontiers in Handwriting Recognition, 2014

In this paper we investigate different n-gram language models that are defined over an open lexicon. We introduce a character-level language model and combine it with a standard word-level language model in a backoff fashion. The character-level language model is redefined and renormalized to assign zero probability to words from a fixed vocabulary. Furthermore we present a way to interpolate language models created at the word and character levels. The computation of character-level probabilities incorporates the across-word context. We compare perplexities on all words from the test set and on in-lexicon and OOV words separately on corpora of English and Arabic text.

Statistical language modeling based on variable-length sequences

Computer Speech & Language, 2003

... Then, an evaluation of these language models in terms of perplexity and word error rate obtained with our ASR system MAUD (Fohr, Haton, Mari, Smaı̈li, & Zitouni, 1997; Zitouni & Smaı̈li, 1997) is reported in Section 5. Finally, we give in ... la Syrie → Damas,, RPR → gaullisme,. ...

Comprehensive Review of Large Language Models and its Applications

Nanotechnology Perfection, 2024

In the realm of contemporary academic discussions and scholarly conversations, it has become increasingly evident that Large Language Models (LLMs), which are sophisticated artificial intelligence systems designed for understanding and generating human language, have showcased remarkable and exceptional levels of proficiency across a wide variety of complex and intricate natural language processing (NLP) tasks that are crucial to various applications. This manuscript, which serves as a comprehensive and thorough examination of the field of Large Language Models (LLMs), aims to delve deeply into and investigate the latest and most significant innovations, advancements, and breakthroughs that have emerged within this rapidly evolving discipline and area of study. To begin this exploration, we will carefully elucidate and clarify the fundamental principles and core concepts that serve as the foundational underpinnings of these advanced models, and then we will proceed to categorize them systematically according to their unique architectural compositions and structural characteristics. Furthermore, a detailed and comparative analysis of the dominant methodologies employed in this field will be conducted, which will involve a careful delineation of their respective advantages and disadvantages, particularly in relation to their architectural design features and the empirical results they yield in practical applications. In addition to this, we will investigate and explore prospective avenues for further research and potential domains that could benefit from future inquiry and exploration within this fascinating and significant area of study. Finally, the manuscript will critically evaluate and assess the practical implications of the performance and assessment of LLMs, while also considering their environmental impact and the broader implications of their deployment in real-world scenarios.