Si Lhoussain Aouragh | UNIVERSITE MOHAMED V RABAT (original) (raw)

Uploads

Papers by Si Lhoussain Aouragh

Research paper thumbnail of Integration of data sources in an automatic corrector of Arabic texts

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), 2016

Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal ... more Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal of errors. The purpose of this article is to resolve issues of tolerance of some errors in Arabic texts and to develop an automatic detection system as well as a correction system of those errors. This work represents a combination of the Levenshtein Distance (LD) and bi-context language models based on a large training corpus. The method involves the automatic detection and correction of no-word1 errors and the real words2 errors. This method consists of two parts. The first part is to extract the context information from the training corpus, which means to create the vocabulary and the database of the bi-context language models. The second part implements the automatic detection and correction of the incorrect words. The experimental results show a successful extraction of the context information from the corpus. Furthermore, they also show that the use of this system can reduce the rate of errors in Arabic texts.

Research paper thumbnail of A new spell-checking approach based on the user profile

International Journal of Computing and Digital Systems

This paper presents a new approach for spell-checking based on the user profile and that can be a... more This paper presents a new approach for spell-checking based on the user profile and that can be applied for any language. For this purpose and for the specific case of Arabic, spelling errors are studied and divided into 18 types. Then, a relationship model between users and their errors is obtained. The proposed architecture initially gives apposite values for a current user, then corrects misspelled words by applying the spelling rules, and the remaining words are corrected based on the probability given by an adopted model of the profile values. To show the efficiency of our profile-based approach, we conducted an experiment with a corpus of 11,908 words containing 1,888 errors. It showed that our approach suggests the correct word in 88.43% times and ranks it in the first four positions in 75.14% times. Moreover, using the same corpus we compared our implemented tool with two existing ones where ours ranked better in 69.79% times than Sahehly and 77.63% times than MS word.

Research paper thumbnail of A new estimate of the n-gram language model

Procedia Computer Science

Research paper thumbnail of The Large Annotated Corpus for the Arabic Language (LACAL)

Studies in computational intelligence, 2022

Research paper thumbnail of The Large Annotated Corpus for the Arabic Language (LACAL)

Studies in computational intelligence, 2022

Research paper thumbnail of Global Spelling Correction in Context using Language Models: Application to the Arabic Language

International Journal of Computing and Digital Systems

Automatic spelling correction is a very important task used in many Natural Language Processing (... more Automatic spelling correction is a very important task used in many Natural Language Processing (NLP) applications such as Optical Character Recognition (OCR), Information retrieval, etc. There are many approaches able to detect and correct misspelled words. These approaches can be divided into two main categories: contextual and context-free approaches. In this paper, we propose a new contextual spelling correction method applied to the Arabic language, without loss of generality for other languages. The method is based on both the Viterbi algorithm and a probabilistic model built with a new estimate of n-gram language models combined with the edit distance. The probabilistic model is learned with an Arabic multipurpose corpus. The originality of our work consists in handling up global and simultaneous correction of a set of many erroneous words within sentences. The experiments carried out prove the performance of our proposal, giving encouraging results for the correction of several spelling errors in a given context. The method achieves a correction accuracy of up to 93.6% by evaluating the first given correction suggestion. It is able to take into account strong links between distant words carrying meaning in a given context. The high-level correction accuracy of our method allows for its integration into many applications.

Research paper thumbnail of Improving SpellChecking: an effective Ad-Hoc probabilistic lexical measure for general typos

Indonesian Journal of Electrical Engineering and Computer Science

Since the era of learning to write by human beings, mistakes made in typing words have occupied ... more Since the era of learning to write by human beings, mistakes made in typing words have occupied a privileged place in linguistic studies, integrating new disciplines into school curricula such as spelling and dictation. According to exhaustive studies that we have done in the field of spellchecking errors made in typing Arabic texts, very few research works that deal with typographical errors specifically caused by the insertion or missing of the blank-space in words. On the other hand, spelling correction software remains ineffective for handling this type of errors. Failure to process errors due to the insertion/missing of blankspace between and in words leads and brings us back to situations of ambiguity and incomprehension of the meaning of the typed text. To remedy this limitation of correction, we propose in this article an ad-hoc probabilistic method which is based jointly on two approaches. The first approach treats the errors due to deletion or missing of blank-space betwe...

Research paper thumbnail of New Language Models for Spelling Correction

The International Arab Journal of Information Technology

Correcting spelling errors based on the context is a fairly significant problem in Natural Langua... more Correcting spelling errors based on the context is a fairly significant problem in Natural Language Processing (NLP) applications. The majority of the work carried out to introduce the context into the process of spelling correction uses the n-gram language models. However, these models fail in several cases to give adequate probabilities for the suggested solutions of a misspelled word in a given context. To resolve this issue, we propose two new language models inspired by stochastic language models combined with edit distance. A first phase consists in finding the words of the lexicon orthographically close to the erroneous word and a second phase consists in ranking and limiting these suggestions. We have applied the new approach to Arabic language taking into account its specificity of having strong contextual connections between distant words in a sentence. To evaluate our approach, we have developed textual data processing applications, namely the extraction of distant transi...

Research paper thumbnail of A new estimate of the n-gram language model

Procedia Computer Science, 2021

Research paper thumbnail of Integration of data sources in an automatic corrector of Arabic texts

Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal ... more Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal of errors. The purpose of this article is to resolve issues of tolerance of some errors in Arabic texts and to develop an automatic detection system as well as a correction system of those errors. This work represents a combination of the Levenshtein Distance (LD) and bi-context language models based on a large training corpus. The method involves the automatic detection and correction of no-word1 errors and the real words2 errors. This method consists of two parts. The first part is to extract the context information from the training corpus, which means to create the vocabulary and the database of the bi-context language models. The second part implements the automatic detection and correction of the incorrect words. The experimental results show a successful extraction of the context information from the corpus. Furthermore, they also show that the use of this system can reduce the ra...

Research paper thumbnail of Adaptation de la distance de Levenshtein Pour la Correction Orthographique Contextuelle

Research paper thumbnail of A Stochastic Language Model for Automatic Generation of Arabic Sentences

Language modeling aims to summarize general knowledge related in natural language. To this aim, t... more Language modeling aims to summarize general knowledge related in natural language. To this aim, the automatic generation of sentence is an important operation in the automatic language processing. It can serve as the basic for such various applications such as automatic translation, continuous speech recognition. In this article, we present a stochastic model that allows us to measure the probability of generating a sentence in Arabic from a set of words. This model is based on the fact that a sentence is based on syntax and semantic level that are independent, and that allows us to model each level with the appropriate model. The estimation of the parameters of this model is made on a corpus of training labeled manually by the syntactic labels.

Research paper thumbnail of Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

Studies in Computational Intelligence

Lemmatization is a key preprocessing step and an important component for many natural language ap... more Lemmatization is a key preprocessing step and an important component for many natural language applications. For Arabic language, lemmatization is a complex task due to Arabic morphology richness. In this paper, we present a new lemmatizer that combines a lexicon-based approach with a machine-learning-based approach to get the lemma solution. The lexicon-based step provides a context-free lemmatization and the most appropriate lemma according to the sentence context is detected using the Hidden Markov Model. The developed lemmatizer evaluations yield to over than 91% of accuracy. This achievement outperforms the state of the art Arabic lemmatizers.

Research paper thumbnail of Automatic Identification of Moroccan Colloquial Arabic

Communications in Computer and Information Science

Language Identification is an NLP task which aims at predicting the language of a given text. For... more Language Identification is an NLP task which aims at predicting the language of a given text. For the Arabic dialects many attempts have been done to address this topic. In this paper, we present our approach to build a Language Identification system in order to distinguish between Moroccan Colloquial Arabic and Arabic languages using two different methods. The first is rule-based and relies on stop word frequency, while the second is statically-based and uses several machine learning classifiers. Obtained results show that the statistical approach outperforms the rule-based approach. Furthermore, the Support Vector Machines classifier is more accurate than other statistical classifiers. Our goal in this paper is to pave the way toward building advanced Moroccan dialect NLP tools such as morphological analyzer and machine translation system.

Research paper thumbnail of A Light Arabic POS Tagger Using a Hybrid Approach

Research paper thumbnail of A Stochastic Language Model for Automatic Generation of Arabic Sentences

Language modeling aims to summarize general knowledge related in natural language. To this aim, t... more Language modeling aims to summarize general knowledge related in natural language. To this aim, the automatic generation of sentence is an important operation in the automatic language processing. It can serve as the basic for such various applications such as automatic translation, continuous speech recognition. In this article, we present a stochastic model that allows us to measure the probability of generating a sentence in Arabic from a set of words. This model is based on the fact that a sentence is based on syntax and semantic level that are independent, and that allows us to model each level with the appropriate model. The estimation of the parameters of this model is made on a corpus of training labeled manually by the syntactic labels.

Research paper thumbnail of Morpho-Syntactic Tagging System Based on the Patterns Words for Arabic Texts

Text tagging is a very important tool for various applications in natural language processing, na... more Text tagging is a very important tool for various applications in natural language processing, namely the morphological and syntactic analysis of texts, indexation and information retrieval, "vocalization" of Arabic texts, and probabilistic language model (n-class model). However, these systems based on the lexemes of limited size, are unable to treat unknown words consequently. To overcome this problem, we developed in this paper, a new system based on the patterns of unknown words and the hidden Markov model. The experiments are carried out in the set of labeled texts, the set of 3800 patterns, and the 52 tags of morpho-syntactic nature, to estimate the parameters of the new model HMM.

Research paper thumbnail of Integration of data sources in an automatic corrector of Arabic texts

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), 2016

Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal ... more Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal of errors. The purpose of this article is to resolve issues of tolerance of some errors in Arabic texts and to develop an automatic detection system as well as a correction system of those errors. This work represents a combination of the Levenshtein Distance (LD) and bi-context language models based on a large training corpus. The method involves the automatic detection and correction of no-word1 errors and the real words2 errors. This method consists of two parts. The first part is to extract the context information from the training corpus, which means to create the vocabulary and the database of the bi-context language models. The second part implements the automatic detection and correction of the incorrect words. The experimental results show a successful extraction of the context information from the corpus. Furthermore, they also show that the use of this system can reduce the rate of errors in Arabic texts.

Research paper thumbnail of A new spell-checking approach based on the user profile

International Journal of Computing and Digital Systems

This paper presents a new approach for spell-checking based on the user profile and that can be a... more This paper presents a new approach for spell-checking based on the user profile and that can be applied for any language. For this purpose and for the specific case of Arabic, spelling errors are studied and divided into 18 types. Then, a relationship model between users and their errors is obtained. The proposed architecture initially gives apposite values for a current user, then corrects misspelled words by applying the spelling rules, and the remaining words are corrected based on the probability given by an adopted model of the profile values. To show the efficiency of our profile-based approach, we conducted an experiment with a corpus of 11,908 words containing 1,888 errors. It showed that our approach suggests the correct word in 88.43% times and ranks it in the first four positions in 75.14% times. Moreover, using the same corpus we compared our implemented tool with two existing ones where ours ranked better in 69.79% times than Sahehly and 77.63% times than MS word.

Research paper thumbnail of A new estimate of the n-gram language model

Procedia Computer Science

Research paper thumbnail of The Large Annotated Corpus for the Arabic Language (LACAL)

Studies in computational intelligence, 2022

Research paper thumbnail of The Large Annotated Corpus for the Arabic Language (LACAL)

Studies in computational intelligence, 2022

Research paper thumbnail of Global Spelling Correction in Context using Language Models: Application to the Arabic Language

International Journal of Computing and Digital Systems

Automatic spelling correction is a very important task used in many Natural Language Processing (... more Automatic spelling correction is a very important task used in many Natural Language Processing (NLP) applications such as Optical Character Recognition (OCR), Information retrieval, etc. There are many approaches able to detect and correct misspelled words. These approaches can be divided into two main categories: contextual and context-free approaches. In this paper, we propose a new contextual spelling correction method applied to the Arabic language, without loss of generality for other languages. The method is based on both the Viterbi algorithm and a probabilistic model built with a new estimate of n-gram language models combined with the edit distance. The probabilistic model is learned with an Arabic multipurpose corpus. The originality of our work consists in handling up global and simultaneous correction of a set of many erroneous words within sentences. The experiments carried out prove the performance of our proposal, giving encouraging results for the correction of several spelling errors in a given context. The method achieves a correction accuracy of up to 93.6% by evaluating the first given correction suggestion. It is able to take into account strong links between distant words carrying meaning in a given context. The high-level correction accuracy of our method allows for its integration into many applications.

Research paper thumbnail of Improving SpellChecking: an effective Ad-Hoc probabilistic lexical measure for general typos

Indonesian Journal of Electrical Engineering and Computer Science

Since the era of learning to write by human beings, mistakes made in typing words have occupied ... more Since the era of learning to write by human beings, mistakes made in typing words have occupied a privileged place in linguistic studies, integrating new disciplines into school curricula such as spelling and dictation. According to exhaustive studies that we have done in the field of spellchecking errors made in typing Arabic texts, very few research works that deal with typographical errors specifically caused by the insertion or missing of the blank-space in words. On the other hand, spelling correction software remains ineffective for handling this type of errors. Failure to process errors due to the insertion/missing of blankspace between and in words leads and brings us back to situations of ambiguity and incomprehension of the meaning of the typed text. To remedy this limitation of correction, we propose in this article an ad-hoc probabilistic method which is based jointly on two approaches. The first approach treats the errors due to deletion or missing of blank-space betwe...

Research paper thumbnail of New Language Models for Spelling Correction

The International Arab Journal of Information Technology

Correcting spelling errors based on the context is a fairly significant problem in Natural Langua... more Correcting spelling errors based on the context is a fairly significant problem in Natural Language Processing (NLP) applications. The majority of the work carried out to introduce the context into the process of spelling correction uses the n-gram language models. However, these models fail in several cases to give adequate probabilities for the suggested solutions of a misspelled word in a given context. To resolve this issue, we propose two new language models inspired by stochastic language models combined with edit distance. A first phase consists in finding the words of the lexicon orthographically close to the erroneous word and a second phase consists in ranking and limiting these suggestions. We have applied the new approach to Arabic language taking into account its specificity of having strong contextual connections between distant words in a sentence. To evaluate our approach, we have developed textual data processing applications, namely the extraction of distant transi...

Research paper thumbnail of A new estimate of the n-gram language model

Procedia Computer Science, 2021

Research paper thumbnail of Integration of data sources in an automatic corrector of Arabic texts

Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal ... more Unlike French and English, the richness and ambiguity of written Arabic texts cause a great deal of errors. The purpose of this article is to resolve issues of tolerance of some errors in Arabic texts and to develop an automatic detection system as well as a correction system of those errors. This work represents a combination of the Levenshtein Distance (LD) and bi-context language models based on a large training corpus. The method involves the automatic detection and correction of no-word1 errors and the real words2 errors. This method consists of two parts. The first part is to extract the context information from the training corpus, which means to create the vocabulary and the database of the bi-context language models. The second part implements the automatic detection and correction of the incorrect words. The experimental results show a successful extraction of the context information from the corpus. Furthermore, they also show that the use of this system can reduce the ra...

Research paper thumbnail of Adaptation de la distance de Levenshtein Pour la Correction Orthographique Contextuelle

Research paper thumbnail of A Stochastic Language Model for Automatic Generation of Arabic Sentences

Language modeling aims to summarize general knowledge related in natural language. To this aim, t... more Language modeling aims to summarize general knowledge related in natural language. To this aim, the automatic generation of sentence is an important operation in the automatic language processing. It can serve as the basic for such various applications such as automatic translation, continuous speech recognition. In this article, we present a stochastic model that allows us to measure the probability of generating a sentence in Arabic from a set of words. This model is based on the fact that a sentence is based on syntax and semantic level that are independent, and that allows us to model each level with the appropriate model. The estimation of the parameters of this model is made on a corpus of training labeled manually by the syntactic labels.

Research paper thumbnail of Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

Studies in Computational Intelligence

Lemmatization is a key preprocessing step and an important component for many natural language ap... more Lemmatization is a key preprocessing step and an important component for many natural language applications. For Arabic language, lemmatization is a complex task due to Arabic morphology richness. In this paper, we present a new lemmatizer that combines a lexicon-based approach with a machine-learning-based approach to get the lemma solution. The lexicon-based step provides a context-free lemmatization and the most appropriate lemma according to the sentence context is detected using the Hidden Markov Model. The developed lemmatizer evaluations yield to over than 91% of accuracy. This achievement outperforms the state of the art Arabic lemmatizers.

Research paper thumbnail of Automatic Identification of Moroccan Colloquial Arabic

Communications in Computer and Information Science

Language Identification is an NLP task which aims at predicting the language of a given text. For... more Language Identification is an NLP task which aims at predicting the language of a given text. For the Arabic dialects many attempts have been done to address this topic. In this paper, we present our approach to build a Language Identification system in order to distinguish between Moroccan Colloquial Arabic and Arabic languages using two different methods. The first is rule-based and relies on stop word frequency, while the second is statically-based and uses several machine learning classifiers. Obtained results show that the statistical approach outperforms the rule-based approach. Furthermore, the Support Vector Machines classifier is more accurate than other statistical classifiers. Our goal in this paper is to pave the way toward building advanced Moroccan dialect NLP tools such as morphological analyzer and machine translation system.

Research paper thumbnail of A Light Arabic POS Tagger Using a Hybrid Approach

Research paper thumbnail of A Stochastic Language Model for Automatic Generation of Arabic Sentences

Language modeling aims to summarize general knowledge related in natural language. To this aim, t... more Language modeling aims to summarize general knowledge related in natural language. To this aim, the automatic generation of sentence is an important operation in the automatic language processing. It can serve as the basic for such various applications such as automatic translation, continuous speech recognition. In this article, we present a stochastic model that allows us to measure the probability of generating a sentence in Arabic from a set of words. This model is based on the fact that a sentence is based on syntax and semantic level that are independent, and that allows us to model each level with the appropriate model. The estimation of the parameters of this model is made on a corpus of training labeled manually by the syntactic labels.

Research paper thumbnail of Morpho-Syntactic Tagging System Based on the Patterns Words for Arabic Texts

Text tagging is a very important tool for various applications in natural language processing, na... more Text tagging is a very important tool for various applications in natural language processing, namely the morphological and syntactic analysis of texts, indexation and information retrieval, "vocalization" of Arabic texts, and probabilistic language model (n-class model). However, these systems based on the lexemes of limited size, are unable to treat unknown words consequently. To overcome this problem, we developed in this paper, a new system based on the patterns of unknown words and the hidden Markov model. The experiments are carried out in the set of labeled texts, the set of 3800 patterns, and the 52 tags of morpho-syntactic nature, to estimate the parameters of the new model HMM.