Language models for contextual error detection and correction (original) (raw)

Misspelling Correction with Pre-trained Contextual Language Model

2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)

Spelling irregularities, known now as spelling mistakes, have been found for several centuries. As humans, we are able to understand most of the misspelled words based on their location in the sentence, perceived pronunciation, and context. Unlike humans, computer systems do not possess the convenient auto complete functionality of which human brains are capable. While many programs provide spelling correction functionality, many systems do not take context into account. Moreover, Artificial Intelligence systems function in the way they are trained on. With many current Natural Language Processing (NLP) systems trained on grammatically correct text data, many are vulnerable against adversarial examples, yet correctly spelled text processing is crucial for learning. In this paper, we investigate how spelling errors can be corrected in context, with a pretrained language model BERT. We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections. The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.

A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction

Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, 2019

Spelling correction has attracted a lot of attention in the NLP community. However, models have been usually evaluated on artificially-created or proprietary corpora. A publicly-available corpus of authentic misspellings, annotated in context, is still lacking. To address this, we present and release an annotated data set of 6,121 spelling errors in context, based on a corpus of essays written by English language learners. We also develop a minimally-supervised context-aware approach to spelling correction. It achieves strong results on our data: 88.12% accuracy. This approach can also train with a minimal amount of annotated data (performance reduced by less than 1%). Furthermore, this approach allows easy porta-bility to new domains. We evaluate our model on data from a medical domain and demonstrate that it rivals the performance of a model trained and tuned on in-domain data.

Scaling up context-sensitive text correction

2001

Abstract The main challenge in an effort to build a realistic system with context-sensitive inference capabilities, beyond accuracy, is scalability. This paper studies this problem in the context of a learning-based approach to context sensitive text correction–the task of fixing spelling errors that result in valid words, such as substituting to for too, casual for causal, and so on. Research papers on this problem have developed algorithms that can achieve fairly high accuracy, in many cases over 90%.

Four types of context for automatic spelling correction

Traitement Automatique des Langues (TAL), 53:3, 61-99., 2012

Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99. This paper presents an investigation on using four types of contextual information for improving the accuracy of automatic correction of single-token non-word misspellings. The task is framed as contextually-informed re-ranking of correction candidates. Immediate local context is captured by word n-grams statistics from a Web-scale language model. The second approach measures how well a candidate correction fits in the semantic fabric of the local lexical neighborhood, using a very large Distributional Semantic Model. In the third approach, recognizing a misspelling as an instance of a recurring word can be useful for reranking. The fourth approach looks at context beyond the text itself. If the approximate topic can be known in advance, spelling correction can be biased towards the topic. Effectiveness of proposed methods is demonstrated with an annotated corpus of 3,000 student essays from international high-stakes English language assessments. The paper also describes an implemented system that achieves high accuracy on this task."

Generalized Character-Level Spelling Error Correction

We present a generalized discriminative model for spelling error correction which targets character-level transformations. While operating at the character level, the model makes use of wordlevel and contextual information. In contrast to previous work, the proposed approach learns to correct a variety of error types without guidance of manuallyselected constraints or language-specific features. We apply the model to correct errors in Egyptian Arabic dialect text, achieving 65% reduction in word error rate over the input baseline, and improving over the earlier state-of-the-art system.

New Language Models for Spelling Correction

The International Arab Journal of Information Technology

Correcting spelling errors based on the context is a fairly significant problem in Natural Language Processing (NLP) applications. The majority of the work carried out to introduce the context into the process of spelling correction uses the n-gram language models. However, these models fail in several cases to give adequate probabilities for the suggested solutions of a misspelled word in a given context. To resolve this issue, we propose two new language models inspired by stochastic language models combined with edit distance. A first phase consists in finding the words of the lexicon orthographically close to the erroneous word and a second phase consists in ranking and limiting these suggestions. We have applied the new approach to Arabic language taking into account its specificity of having strong contextual connections between distant words in a sentence. To evaluate our approach, we have developed textual data processing applications, namely the extraction of distant transi...

Toward filling the gap between interactive and fully-automatic spelling correction using the linguistic context

2001

We report on the comparison of different strategies for correcting spelling errors resulting in non-existent words. Unlike interactive spelling checkers, where usually only the left context is available, the system we developed takes advantage of the entire context surrounding misspelling. Moreover, unlike traditional systems, based exclusively on a string-to-string edit distance and a word language model, we explore the use of the part-of-speech for selecting candidates. In conclusion, we show that spelling correction improves by extending the context. The best results are obtained when combining a part-of-speech filter with a word language model, and using both the left and right adjacent contexts.

Large scale experiments on correction of confused words

Proceedings 24th Australian Computer Science Conference. ACSC 2001

This paper describes a new approach to automatically learn contextual knowledge for spelling and grammar correctionwe aim particularly to deal with cases where the words are all in the dictionary and so it is not obvious that there is an error. Traditional approaches are dictionary based, or use elementary tagging or partial parsing of the sentence to obtain context knowledge. Our approach uses aflx information and only the most frequent words to reduce the complexity in terms of training time and running time for context-sensitive spelling correction. We build large scale confused word sets based on keyboard adjacency and apply our new approach to learn the contextual knowledge to detect and correct them. We explore the perjlormance of autocorrection under conditions where significance and probabilty are set by the user.

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning, 1999

A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts refer to only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. We present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting "to" for "too", "casual" for "causal", etc. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) The primary reason that WinSpell outperforms BaySpell is that WinSpell learns a better linear separator; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.

Global Spelling Correction in Context using Language Models: Application to the Arabic Language

International Journal of Computing and Digital Systems

Automatic spelling correction is a very important task used in many Natural Language Processing (NLP) applications such as Optical Character Recognition (OCR), Information retrieval, etc. There are many approaches able to detect and correct misspelled words. These approaches can be divided into two main categories: contextual and context-free approaches. In this paper, we propose a new contextual spelling correction method applied to the Arabic language, without loss of generality for other languages. The method is based on both the Viterbi algorithm and a probabilistic model built with a new estimate of n-gram language models combined with the edit distance. The probabilistic model is learned with an Arabic multipurpose corpus. The originality of our work consists in handling up global and simultaneous correction of a set of many erroneous words within sentences. The experiments carried out prove the performance of our proposal, giving encouraging results for the correction of several spelling errors in a given context. The method achieves a correction accuracy of up to 93.6% by evaluating the first given correction suggestion. It is able to take into account strong links between distant words carrying meaning in a given context. The high-level correction accuracy of our method allows for its integration into many applications.

Korektor – A System for Contextual Spell-Checking and Diacritics Completion

2012

We present Korektor - a flexible and powerful purely statistical text correction tool for Czech that goes beyond a traditional spell checker. We use a combination of several language models and an error model to offer the best ordering of correction proposals and also to find errors that cannot be detected by simple spell checkers, namely spelling errors that happen to be homographs of existing word forms. Our system works also without any adaptation as a diacritics generator with the best reported results for Czech text. The design of Korektor contains no language-specific parts other than trained statistical models, which makes it highly suitable to be trained for other languages with available resources. The evaluation demonstrates that the system is a state-of-the-art tool for Czech, both as a spell checker and as a diacritics generator. We also show that these functions combine into a potential aid in the error annotation of a learner corpus of Czech.

Context-aware Stand-alone Neural Spelling Correction

Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

Existing natural language processing systems are vulnerable to noisy inputs resulting from misspellings. On the contrary, humans can easily infer the corresponding correct words from their misspellings and surrounding context. Inspired by this, we address the standalone spelling correction problem, which only corrects the spelling of each token without additional token insertion or deletion, by utilizing both spelling information and global context representations. We present a simple yet powerful solution that jointly detects and corrects misspellings as a sequence labeling task by fine-turning a pre-trained language model. Our solution outperform the previous state-ofthe-art result by 12.8% absolute F 0.5 score.

Grammatical and context-sensitive error correction using a statistical machine translation framework

Producing electronic rather than paper documents has considerable benefits such as easier organizing and data management. Therefore, existence of automatic writing assistance tools such as spell and grammar checker/correctors can increase the quality of electronic texts by removing noise and correcting the erroneous sentences. Different kinds of errors in a text can be categorized into spelling, grammatical and real-word errors. In this article, we present a language-independent approach based on a statistical machine translation framework to develop a proofreading tool, which detects grammatical errors as well as context-sensitive spelling mistakes (real-word errors). A hybrid model for grammar checking is suggested by combining the mentioned approach with an existing rule-based grammar checker. Experimental results on both English and Persian languages indicate that the proposed statistical method and the rule-based grammar checker are complementary in detecting and correcting syntactic errors. The results of the hybrid grammar checker, applied to some English texts, show an improvement of about 24% with respect to the recall metric with almost similar value for precision. Experiments on real-world data set show that state-of-the-art results are achieved for grammar checking and context-sensitive spell checking for Persian language.

Robust error detection: A hybrid approach combining unsupervised error detection and linguistic knowledge

Proc. 2nd Workshop Robust Methods in …, 2002

This article presents a robust probabilistic method for the detection of context-sensitive spelling errors. The algorithm identifies less- frequent grammatical constructions and at- tempts to transform them into more-frequent constructions while retaining similar syntactic structure. If the transformations result in low- frequency constructions, the text is likely to contain an error. A first unsupervised approach uses only information derived from

Neural Network Translation Models for Grammatical Error Correction

2016

Phrase-based statistical machine translation (SMT) systems have previously been used for the task of grammatical error correction (GEC) to achieve state-of-the-art accuracy. The superiority of SMT systems comes from their ability to learn text transformations from erroneous to corrected text, without explicitly modeling error types. However, phrase-based SMT systems suffer from limitations of discrete word representation, linear mapping, and lack of global context. In this paper, we address these limitations by using two different yet complementary neural network models, namely a neural network global lexicon model and a neural network joint model. These neural networks can generalize better by using continuous space representation of words and learn non-linear mappings. Moreover, they can leverage contextual information from the source sentence more effectively. By adding these two components, we achieve statistically significant improvement in accuracy for grammatical error correc...

Correcting real-word spelling errors by restoring lexical cohesion

Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context. Relatedness to context is determined by a measure of semantic distance initially proposed by . We tested the method on an artificial corpus of errors; it achieved recall of 23 to 50% and precision of 18 to 25%.

Towards a single proposal in spelling correction

Proceedings of the 17th …, 1998

The study presented here relies on the integrated use of different kinds of knowledge in order to improve first-guess accuracy in non-word context-sensitive correction for general unrestricted texts. State of the art spelling correction systems, e.g.

Applying Winnow to Context-Sensitive Spelling Correction

Computing Research Repository, 1996

Multiplicative weight-updating algorithms such as Winnow have been studied extensively in the COLT literature, but only recently have people started to use them in applications. In this paper, we apply a Winnow-based algorithm to a task in natural language: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting {\it to\/} for {\it too}, {\it casual\/} for {\it causal}, and so on. Previous approaches to this problem have been statistics-based; we compare Winnow to one of the more successful such approaches, which uses Bayesian classifiers. We find that: (1)~When the standard (heavily-pruned) set of features is used to describe problem instances, Winnow performs comparably to the Bayesian method; (2)~When the full (unpruned) set of features is used, Winnow is able to exploit the new features and convincingly outperform Bayes; and (3)~When a test set is encountered that is dissimilar to the training set, Winnow is better than Bayes at adapting to the unfamiliar test set, using a strategy we will present for combining learning on the training set with unsupervised learning on the (noisy) test set.

Exploiting syntactic and distributional information for spelling correction with web-scale n-gram models

Proceedings of the …, 2011

We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale n-gram model for spelling correction. The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems. Experimental results show that introducing syntactic features into n-gram based models significantly reduces errors by up to 12.4% over the current state-of-the-art. The ...

Context's impact on the automatic spelling correction

International Journal of Artificial Intelligence and Soft Computing

This paper aims to shed light on a mechanism that will be used to exploit topical context information improving the accuracy of the automatic spell checking system. This study aims at solving the problem encountered by the auto correct spell checking system, which resides in the fact that the requested solution might be located in the last position. We have implemented a set of techniques in order to build a context oriented spelling corrector and by the end of this work the designed corrector will essentially be using a dictionary that contains a distribution of probability of a word occurrence in various contexts. This latter is constructed by bringing a collection of documents available via the internet.