Character-based Language Model (original) (raw)

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using arbitrarily small units of text data: character-based language model (CBLM). In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity) and as a proof-of-concept and an extrinsic evaluation I present a random sentence generator based on the model.

A Maximum Entropy Approach for Semantic Language Modeling

2006

The conventional n-gram language model exploits only the immediate context of historical words without exploring long-distance semantic information. In this paper, we present a new information source extracted from latent semantic analysis (LSA) and adopt the maximum entropy (ME) principle to integrate it into an n-gram language model. With the ME approach, each information source serves as a set of constraints, which should be satisfied to estimate a hybrid statistical language model with maximum randomness. For comparative study, we also carry out knowledge integration via linear interpolation (LI). In the experiments on the TDT2 Chinese corpus, we find that the ME language model that combines the features of trigram and semantic information achieves a 17.9% perplexity reduction compared to the conventional trigram language model, and it outperforms the LI language model. Furthermore, in evaluation on a Mandarin speech recognition task, the ME and LI language models reduce the character error rate by 16.9% and 8.5%, respectively, over the bigram language model.

Statistical feature language model

Statistical language models are widely used in automatic speech recognition in order to constrain the decoding of a sentence. Most of these models derive from the clas- sical n-gram paradigm. However, the production of a word depends on a large set of linguistic features : lex- ical, syntactic, semantic, etc. Moreover, in some nat- ural languages the gender and number of the left con- text affect the production of the next word. Therefore, it seems attractive to design a language model based on a variety of word features. We present in this paper a new statistical language model, called Statistical Featur e Language Model, SFLM, based on this idea. In SFLM a word is considered as an array of linguistic features, and the model is defined in a way similar to the n-gram model. Experiments carried out for French and shown an improvement in terms of perplexity and predicted words.

A Compositional Approach to Language Modeling

arXiv (Cornell University), 2016

Traditional language models treat language as a finite state automaton on a probability space over words. This is a very strong assumption when modeling something inherently complex such as language. In this paper, we challenge this by showing how the linear chain assumption inherent in previous work can be translated into a sequential composition tree. We then propose a new model that marginalizes over all possible composition trees thereby removing any underlying structural assumptions. As the partition function of this new model is intractable, we use a recently proposed sentence level evaluation metric Contrastive Entropy to evaluate our model. Given this new evaluation metric, we report more than 100% improvement across distortion levels over current state of the art recurrent neural network based language models.

Open-Lexicon Language Modeling Combining Word and Character Levels

2014 14th International Conference on Frontiers in Handwriting Recognition, 2014

In this paper we investigate different n-gram language models that are defined over an open lexicon. We introduce a character-level language model and combine it with a standard word-level language model in a backoff fashion. The character-level language model is redefined and renormalized to assign zero probability to words from a fixed vocabulary. Furthermore we present a way to interpolate language models created at the word and character levels. The computation of character-level probabilities incorporates the across-word context. We compare perplexities on all words from the test set and on in-lexicon and OOV words separately on corpora of English and Arabic text.

Statistical language modeling based on variable-length sequences

Computer Speech & Language, 2003

... Then, an evaluation of these language models in terms of perplexity and word error rate obtained with our ASR system MAUD (Fohr, Haton, Mari, Smaı̈li, & Zitouni, 1997; Zitouni & Smaı̈li, 1997) is reported in Section 5. Finally, we give in ... la Syrie → Damas,, RPR → gaullisme,. ...

Comprehensive Review of Large Language Models and its Applications

Nanotechnology Perfection, 2024

In the realm of contemporary academic discussions and scholarly conversations, it has become increasingly evident that Large Language Models (LLMs), which are sophisticated artificial intelligence systems designed for understanding and generating human language, have showcased remarkable and exceptional levels of proficiency across a wide variety of complex and intricate natural language processing (NLP) tasks that are crucial to various applications. This manuscript, which serves as a comprehensive and thorough examination of the field of Large Language Models (LLMs), aims to delve deeply into and investigate the latest and most significant innovations, advancements, and breakthroughs that have emerged within this rapidly evolving discipline and area of study. To begin this exploration, we will carefully elucidate and clarify the fundamental principles and core concepts that serve as the foundational underpinnings of these advanced models, and then we will proceed to categorize them systematically according to their unique architectural compositions and structural characteristics. Furthermore, a detailed and comparative analysis of the dominant methodologies employed in this field will be conducted, which will involve a careful delineation of their respective advantages and disadvantages, particularly in relation to their architectural design features and the empirical results they yield in practical applications. In addition to this, we will investigate and explore prospective avenues for further research and potential domains that could benefit from future inquiry and exploration within this fascinating and significant area of study. Finally, the manuscript will critically evaluate and assess the practical implications of the performance and assessment of LLMs, while also considering their environmental impact and the broader implications of their deployment in real-world scenarios.

A Weighted Maximum Entropy Language Model for Text

Proceeding of the 2nd …, 2005

The Maximum entropy (ME) approach has been extensively used for various natural language processing tasks, such as language modeling, part-of-speech tagging, text segmentation and text classification. Previous work in text classification has been done using maximum entropy modeling with binary-valued features or counts of feature words. In this work, we present a method to apply Maximum Entropy modeling for text classification in a different way it has been used so far, using weights for both to select the features of the model and to emphasize the importance of each one of them in the classification task. Using the X square test to assess the contribution of each candidate feature from the obtained X square values we rank the features and the most prevalent of them, those which are ranked with the higher X square scores, they are used as the selected features of the model. Instead of using Maximum Entropy modeling in the classical way, we use the X square values to weight the features of the model and give thus a different importance to each one of them. The method has been evaluated on Reuters-21578 dataset for test classification tasks, giving very promising results and performing comparable to some of the "state of the art" systems in the classification field.

A Stochastic Language Model for Automatic Generation of Arabic Sentences

2006

Language modeling aims to summarize general knowledge related in natural language. To this aim, the automatic generation of sentence is an important operation in the automatic language processing. It can serve as the basic for such various applications such as automatic translation, continuous speech recognition. In this article, we present a stochastic model that allows us to measure the probability of generating a sentence in Arabic from a set of words. This model is based on the fact that a sentence is based on syntax and semantic level that are independent, and that allows us to model each level with the appropriate model. The estimation of the parameters of this model is made on a corpus of training labeled manually by the syntactic labels.

Stochastic text generators do not understand language, but they can help us get there

Draft, 2024

Large language models (LLMs) are essentially a massive bottom-up, data-driven experiment in reverse engineering of language at scale. With the massive amount of text ingested, the neural networks underlying these LLMs managed to learn the distribution of “ordinary spoken language” in such a way that they can subsequently draw on that distribution to generate grammatically correct and semantically coherent text in response to user prompts. As impressive as they seem, however, LLMs do not truly understand language and the impressive ‘generative’ capabilities are not an indication of the language competency of LLMs. To accurately test the language understanding capabilities of these LLMs we should prompt LLMs with a snippet of text and embed a query that questions their understanding of the input text. Done properly, it becomes clear that these massive stochastic hashtables do not ‘understand’ language. However, as a mas-sive associative memory of how humans sensibly talk about the world they live in, LLMs can be a power-ful reverse engineering tool that can help us uncover the conceptual structure that seems to be implicitly assumed in our linguistic communication. The result of this process, that has been previously suggested by several luminaries in the philosophy of language and the philosophy of mind, is no less than the discovery of the ontology of natural language and the conceptual structure of the language of thought.

Language modeling via stochastic processes

2022

Modern language models can generate high-quality short texts. However, they often meander or are incoherent when generating longer texts. These issues arise from the next-token-only language modeling objective. To address these issues, we introduce Time Control (TC), a language model that implicitly plans via a latent stochastic process. TC does this by learning a representation which maps the dynamics of how text changes in a document to the dynamics of a stochastic process of interest. Using this representation, the language model can generate text by first implicitly generating a document plan via a stochastic process, and then generating text that is consistent with this latent plan. Compared to domain-specific methods and fine-tuning GPT2 across a variety of text domains, TC improves performance on text infilling and discourse coherence. On long text generation settings, TC preserves the text structure both in terms of ordering (up to +40% better) and text length consistency (up to +17% better). Human evaluators also prefer TC's output 28.6% more than the baselines. 1

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (8)

  1. Bilmes, J.A., Kirchhoff, K.: Factored language models and generalized parallel back- off. In: Proceedings of the 2003 Conference of the North American Chapter of the Vít Baisa Association for Computational Linguistics on Human Language Technology: com- panion volume of the Proceedings of HLT-NAACL 2003-short papers-Volume 2, As- sociation for Computational Linguistics (2003) 4-6
  2. Shannon, C.: A mathematical theory of communication. Bell Sys. Tech. J. 27 379-423
  3. Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkk önen, J., Siivola, V., Var- jokallio, M., Arisoy, E., Sarac ¸lar, M., Stolcke, A.: Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5(1) (2007) 3
  4. Niesler, T.R., Woodland, P.: A variable-length category-based n-gram language model. In: Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on. Volume 1., IEEE (1996) 164-167
  5. Zhang, Y., Vogel, S.: Suffix array and its applications in empirical natural language processing. Technical report, Technical Report CMU-LTI-06-010, Language Technolo- gies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2006)
  6. Suchomel, V.: Recent czech web corpora. In: 6th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno
  7. Brown, P.F., Pietra, V.J.D., Mercer, R.L., Pietra, S.A.D., Lai, J.C.: An estimate of an upper bound for the entropy of english. Comput. Linguist. 18(1) (March 1992) 31-40
  8. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P.: One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arXiv preprint arXiv:1312.3005. 2013.

A random text model for the generation of statistical language invariants

2007

A novel random text generation model is introduced. Unlike in previous random text models, that mainly aim at producing a Zipfian distribution of word frequencies, our model also takes the properties of neighboring co-occurrence into account and introduces the notion of sentences in random text. After pointing out the deficiencies of related models, we provide a generation process that takes neither the Zipfian distribution on word frequencies nor the small-world structure of the neighboring co-occurrence graph as a constraint. Nevertheless, these distributions emerge in the process. The distributions obtained with the random generation model are compared to a sample of natural language data, showing high agreement also on word length and sentence length. This work proposes a plausible model for the emergence of large-scale characteristics of language without assuming a grammar or semantics.

Models of English text

Proceedings DCC '97. Data Compression Conference, 1997

The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models for English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxiliary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.

Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.

IRJET-Comparative Study of Statistical and Neural Network Language Modelling in Sentence Generation

IRJET, 2020

Natural Language Processing(NLP) is a field in machine learning with the ability of a computer to understand, analyze, manipulate, and potentially generate human language. NLP is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. In this work, we propose an comparative study on the statistical and neural network language model based on its accuracy of performance. The language models are built for both the statistical language model and neural network language model. The developed models will predict the expected outcomes based on the input given to the model. The input is converted from speech to text by automatic speech recognition. The outcome is used to identify which language model results has better performance.

A Language Modelling Tool for Statistical NLP

Anais do V Workshop em Tecnologia da …, 2007

In recent years the use of statistical language models (SLMs) has become widespread in most NLP fields. In this work we introduce jNina, a basic language modelling tool to aid the development of Machine Translation systems and many other text-generating applications. The tool allows for the quick comparison of multiple text outputs (e.g., alternative translations of a single source) based on a given SLM, and enables the user to build and evaluate her own SLMs from any corpora provided.

A New Language Model Based on Possibility Theory

Lecture Notes in Computer Science, 2018

Language modeling is a very important step in several NLP applications. Most of the current language models are based on probabilistic methods. In this paper, we propose a new language modeling approach based on the possibility theory. Our goal is to suggest a method for estimating the possibility of a word-sequence and to test this new approach in a machine translation system. We propose a wordsequence possibilistic measure, which can be estimated from a corpus. We proceeded in two ways: first, we checked the behavior of the new approach compared with the existing work. Second, we compared the new language model with the probabilistic one used in statistical MT systems. The results, in terms of the METEOR metric, show that the possibilistic-language model is better than the probabilistic one. However, in terms of BLEU and TER scores, the probabilistic model remains better.

One billion word benchmark for measuring progress in statistical language modeling

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 74.4. A combination of techniques leads to 37% reduction in perplexity, or 11% reduction in cross-entropy (bits), over that baseline.

Stochastic automata for language modeling

Computer Speech & Language, 1996

Stochastic language models are widely used in spoken language understanding to recognize and interpret the speech signal: the speech samples are decoded into word transcriptions by means of acoustic and syntactic models and then interpreted according to a semantic model. Both for speech recognition and understanding, search algorithms use stochastic models to extract the most likely uttered sentence and its correspondent interpretation. The design of the language models has to be effective in order to mostly constrain the search algorithms and has to be efficient to comply with the storage space limits. In this work we present the Variable N-gram Stochastic Automaton (VNSA) language model that provides a unified formalism for building a wide class of language models. First, this approach allows for the use of accurate language models for large vocabulary speech recognition by using the standard search algorithm in the one-pass Viterbi decoder. Second, the unified formalism is an effective approach to incorporate different sources of information for computing the probability of word sequences. Third, the VNSAs are well suited for those applications where speech and language decoding cascades are implemented through weighted rational transductions. The VNSAs have been compared to standard bigram and trigram language models and their reduced set of parameters does not affect by any means the performances in terms of perplexity. The design of a stochastic language model through the VNSA is described and applied to word and phrase class-based language models. The effectiveness of VNSAs has been tested within the Air Travel Information System (ATIS) task to build the language model for the speech recognition and the language understanding system.

A Survey of Text Generation Models

International Journal of Computer Applications

Natural language processing (NLP)is a field of linguistics and computer science which focuses on the interaction of humans and computers. The main aim of natural language processing is to make sure that a computer can understand what a human says and possibly get key insights from the auditory data. Natural language text production is a well-known sub-part of NLP which focuses on converting auditory data from spoken languages into text. This survey aims to shed some light on crucial details about the past, present and the future of text production algorithms along with an aim to provide a comprehensive overview of how different machine learning techniques are being investigated and studied for different NLP applications. Finally, some important research gaps which were found out through the review are highlighted as the study is drawn to a close. This study also aims to synthesize a guide for beginners in this field and to point them towards related research and popular practices.