New Language Models for Spelling Correction (original) (raw)

Global Spelling Correction in Context using Language Models: Application to the Arabic Language

International Journal of Computing and Digital Systems

Automatic spelling correction is a very important task used in many Natural Language Processing (NLP) applications such as Optical Character Recognition (OCR), Information retrieval, etc. There are many approaches able to detect and correct misspelled words. These approaches can be divided into two main categories: contextual and context-free approaches. In this paper, we propose a new contextual spelling correction method applied to the Arabic language, without loss of generality for other languages. The method is based on both the Viterbi algorithm and a probabilistic model built with a new estimate of n-gram language models combined with the edit distance. The probabilistic model is learned with an Arabic multipurpose corpus. The originality of our work consists in handling up global and simultaneous correction of a set of many erroneous words within sentences. The experiments carried out prove the performance of our proposal, giving encouraging results for the correction of several spelling errors in a given context. The method achieves a correction accuracy of up to 93.6% by evaluating the first given correction suggestion. It is able to take into account strong links between distant words carrying meaning in a given context. The high-level correction accuracy of our method allows for its integration into many applications.

Arabic Spelling Error Detection and Correction

Arabic Spelling Error Detection and Correction, 2015

A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

Arabic spellchecking: a depth-filtered composition metric to achieve fully automatic correction

International Journal of Electrical and Computer Engineering (IJECE), 2023

Digital environments for human learning have evolved a lot in recent years thanks to incredible advances in information technologies. Computer assistance for text creation and editing tools represent a future market in which natural language processing (NLP) concepts will be used. This is particularly the case of the automatic correction of spelling mistakes used daily by data operators. Unfortunately, these spellcheckers are considered writing aids tools, they are unable to perform this task automatically without user's assistance. In this paper, we suggest a filtered composition metric based on the weighting of two lexical similarity distances in order to reach the auto-correction. The approach developed in this article requires the use of two phases: the first phase of correction involves combining two well-known distances: the edit distance weighted by relative weights of the proximity of the Arabic keyboard and the calligraphical similarity between Arabic alphabet, and combine this measure with the Jaro-Winkler distance to better weight, filter solutions having the same metric. The second phase is considered as a booster of the first phase, this use the probabilistic bigram language model after the recognition of the solutions of error, which may have the same lexical similarity measure in the first correction phase. The evaluation of the experimental results obtained from the test performed by our filtered composition measure on a dataset of errors allowed us to achieve a 96% of auto-correction rate. This is an open access article under the CC BY-SA license.

Arabic Spelling Correction using Supervised Learning

Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014

In this work, we address the problem of spelling correction in the Arabic language utilizing the new corpus provided by QALB (Qatar Arabic Language Bank) project which is an annotated corpus of sentences with errors and their corrections. The corpus contains edit, add before, split, merge, add after, move and other error types. We are concerned with the first four error types as they contribute more than 90% of the spelling errors in the corpus. The proposed system has many models to address each error type on its own and then integrating all the models to provide an efficient and robust system that achieves an overall recall of 0.59, precision of 0.58 and F1 score of 0.58 including all the error types on the development set. Our system participated in the QALB 2014 shared task "Automatic Arabic Error Correction" and achieved an F1 score of 0.6, earning the sixth place out of nine participants.

Efficient Weighted Edit Distance and N-gram Language Models to Improve Spelling Correction of Segmentation Errors

International Journal of Advanced Computer Science and Applications, 2021

In most research that has dealt with the correction of spelling errors, the errors are caused by the misuse of space (deletion or insertion of space) are not tackled. Forgetting to deal with this type of errors in the texts poses a problem of understanding and ambiguity of the meaning of the sentence containing these errors. In this article, we propose a new approach to correct errors due to the insertion of space in a word, and at the same time correct other types of editing errors. This approach is based on the edit distance and uses bi-grams language models to correct words in context. The test conducted on hundreds of erroneous words (by insertion of space and/or by simple editing errors) made it possible to assess the relevance and validity of the methods developed to correct this type of error. The approaches proposed in this article provide a very important clarification and reminder by comparing them to those of other existing approaches.

Four types of context for automatic spelling correction

Traitement Automatique des Langues (TAL), 53:3, 61-99., 2012

Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99. This paper presents an investigation on using four types of contextual information for improving the accuracy of automatic correction of single-token non-word misspellings. The task is framed as contextually-informed re-ranking of correction candidates. Immediate local context is captured by word n-grams statistics from a Web-scale language model. The second approach measures how well a candidate correction fits in the semantic fabric of the local lexical neighborhood, using a very large Distributional Semantic Model. In the third approach, recognizing a misspelling as an instance of a recurring word can be useful for reranking. The fourth approach looks at context beyond the text itself. If the approximate topic can be known in advance, spelling correction can be biased towards the topic. Effectiveness of proposed methods is demonstrated with an annotated corpus of 3,000 student essays from international high-stakes English language assessments. The paper also describes an implemented system that achieves high accuracy on this task."

Improved Spelling Error Detection and Correction for Arabic

ABSTRACT A spelling error detection and correction application is based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system.

Improving SpellChecking: an effective Ad-Hoc probabilistic lexical measure for general typos

Indonesian Journal of Electrical Engineering and Computer Science

Since the era of learning to write by human beings, mistakes made in typing words have occupied a privileged place in linguistic studies, integrating new disciplines into school curricula such as spelling and dictation. According to exhaustive studies that we have done in the field of spellchecking errors made in typing Arabic texts, very few research works that deal with typographical errors specifically caused by the insertion or missing of the blank-space in words. On the other hand, spelling correction software remains ineffective for handling this type of errors. Failure to process errors due to the insertion/missing of blankspace between and in words leads and brings us back to situations of ambiguity and incomprehension of the meaning of the typed text. To remedy this limitation of correction, we propose in this article an ad-hoc probabilistic method which is based jointly on two approaches. The first approach treats the errors due to deletion or missing of blank-space betwe...

Using Part-of-Speech and Word-Sense Disambiguation for Boosting String-Edit Distance Spelling Correction

Lecture Notes in Computer Science, 2001

We report on the design of a system for correcting spelling errors resulting in non-existent words. The system aims at improving edition of medical reports. Unlike traditional systems, both semantic and syntactic contexts are considered here. The system is organized along three steps. The first module is based on a context independent string-to-string edit distance calculus. The second module, based on the morpho-syntactic context attempts to rank more relevantly the data set provided by the first module, finally a third contextual module processes words with the same part-of-speech by applying some contextual word-sense disambiguation. Modules 2 and 3 are using both hand written rules and data-driven Markovian matrices. A final evaluation shows a significant improvement compared to context-free spelling correction.

Toward filling the gap between interactive and fully-automatic spelling correction using the linguistic context

2001

We report on the comparison of different strategies for correcting spelling errors resulting in non-existent words. Unlike interactive spelling checkers, where usually only the left context is available, the system we developed takes advantage of the entire context surrounding misspelling. Moreover, unlike traditional systems, based exclusively on a string-to-string edit distance and a word language model, we explore the use of the part-of-speech for selecting candidates. In conclusion, we show that spelling correction improves by extending the context. The best results are obtained when combining a part-of-speech filter with a word language model, and using both the left and right adjacent contexts.