IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE (original) (raw)

Hybrid Part-Of-Speech Tagger for Non-Vocalized Arabic Text

Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper proposes an efficient and accurate POS Tagging technique for Arabic language using hybrid approach. Due to the ambiguity issue, Arabic Rule-Based method suffers from misclassified and unanalyzed words. To overcome these two problems, we propose a Hidden Markov Model (HMM) integrated with Arabic Rule-Based method. Our POS tagger generates a set of three POS tags: Noun, Verb, and Particle. The proposed technique uses the different contextual information of the words with a variety of the features which are helpful to predict the various POS classes. To evaluate its accuracy, the proposed method has been trained and tested with two corpora: the Holy Quran Corpus and Kalimat Corpus for undiacritized Classical Arabic language. The experiment results demonst...

A hidden Markov model-based POS tagger for Arabic

… of the 8th International Conference on the …, 2006

This paper presents a Part-of-Speech (POS) Tagger for Arabic. The POS tagger resolves Arabic text POS tagging ambiguity through the use of a statistical language model developed from Arabic corpus as a Hidden Markov Model (HMM). The paper presents the characteristics of the Arabic language and the POS tag set that has been selected. It then introduces the methodology followed to develop the HMM for Arabic. The proposed HMM POS tagger has been tested and has achieved a state-of-the-art performance of 97%.

Hidden Markov Model Tagger for Applications Based Arabic Text: A review

International Journal of Computation and Applied Sciences IJOCAAS, 2019

The immense increase in the use of the Arabic Language in transmitting information on the internet makes the Arabic Language a focus of researchers and commercial developers. The developing of an efficient Arabic POS tagger is not an easy task due to the complexity of the Language itself and the challenges of tagging disambiguation and unknown words. This paper aims to explore and review the use of Part of speech Tagger for Arabic text based on Hidden Markov Model. Besides, it is discussed and explored the implementation of POS tagger for different languages. This study examined a group of research papers that applied the Part of Speech to Arabic using the Hidden Markov Model. The results have shown that a large number of researchers achieved high accuracy rates in the classification of parts of speech correctly. Handi and Alshamsi achieved a high accuracy rate of 97.6% and 97.4% respectively. Kadim obtained an average accuracy of 75.38% for a Parallel Hidden Markov Model.

A Review of Part of Speech Tagger for Arabic Language

The aim of this paper is to review the implementation of Part of Speech (POS) Tagger for Arabic Language which will help in building accurate corpus for Arabic Language. Many researchers have been design and implement POS using different machine learning methods like Rule Based, Neural Network, Decision Tree, Transformation-Based, and Hidden Markov Model. Arabic is the mother tongue of more than 400 million people. It is one of the most important natural languages in the world. Therefore, an arranging Arabic content records that contain suppositions, interpersonal organization like online journals, Facebook, tweeter, Holy Quran, Hadith exchange groups is interested and needed a significance estimation investigation. Albeit Arabic one of the richest dialect and turn into the main dialect for more than 24 country. This paper proven that the created tagger is accurately labeled the words in the preparing dataset between 84% and 99%, which is enhancing the commented on Arabic corpus and its applications.

Probabilistic Arabic part of speech tagger with unknown words handling

2016

Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of vocabulary and rich and complex morphology like Arabic. In order to handle this problem in Arabic POS tagger, we have studied the effect of integrating a lexicon based morphological analyzer to improve the performance of the tagger. Moreover, in this work, several lexical models have been empirically defined, implemented and evaluated. These models are based essentially on the internal structure and the formation process of Arabic words. Furthermore, several combinations of these models have been presented. The POS tagger has been trained with a training corpus of 29300 words and it uses a tagset ...

Statistical Part-of-Speech Tagger for Traditional Arabic Texts

Journal of Computer Science, 2009

Problem statement: This study presented the development of an Arabic part-of-speech tagger that can be used for analyzing and annotating traditional Arabic texts, especially the Quran text. Approach: It is a part of a project related to the computerization of the Holy Quran. One of the main objectives in this project was to build a textual corpus of the Holy Quran. Results: Since an appropriate textual version of the Holy Quran was prepared and morphologically analyzed in other stages of this project, we focused in this work on its annotation by developing and using an appropriate tagger. The developed tagger employed an approach that combines morphological analysis with Hidden Markov Models (HMMs) based-on the Arabic sentence structure. The morphological analysis is used to reduce the size of the tags lexicon by segmenting Arabic words in their prefixes, stems and suffixes; this is due to the fact that Arabic is a derivational language. On another hand, HMM is used to represent the Arabic sentence structure in order to take into account the linguistic combinations. For these purposes, an appropriate tagging system has been proposed to represent the main Arabic part of speech in a hierarchical manner allowing an easy expansion whenever it is needed. Each tag in this system is used to represent a possible state of the HMM and the transitions between tags (states) are governed by the syntax of the sentence. A corpus of some traditional texts, extracted from Books of third century (Hijri), is manually morphologically analyzed and tagged using our developed tagset. Conclusion/Recommendations: It is then used for training and testing this model. Experiments conducted on this dataset gave a recognition rate of about 96% and thus are very promising compared to the data size tagged till now and used in the training. Since our Holy Quran corpus is still under revision, we did not make significant experiments on it. However, preliminary tests conducted on the seven verses of AL-Fatiha showed an encouraging accuracy rate.

Survey For Arabic Part of Speech Tagging based on Machine Learning

Iraqi Journal of Science

The Arabic Language is the native tongue of more than 400 million people around the world, it is also a language that carries an important religious and international weight. The Arabic language has taken its share of the huge technological explosion that has swept the world, and therefore it needs to be addressed with natural language processing applications and tasks. This paper aims to survey and gather the most recent research related to Arabic Part of Speech (APoS), pointing to tagger methods used for the Arabic language, which ought to aim to constructing corpus for Arabic tongue. Many AI investigators and researchers have worked and performed POS utilizing various machine-learning methods, such as Hidden-Markov-Model (HMM), Brill, Maximum-Match (MM), decision tree, bee colony, Neural-Network (NN), and other hybrid methods. This survey groups a number of published papers based on the Arabic Language Applications (ALP) towards tagging related problems utilized and appro...

ARABIC PART-OF-SPEECH TAGGING USING THE SENTENCE STRUCTURE

2000

This paper presents a system for Arabic Part.Of.Spe ech Tagging, which combines morphological analysis with Hidden Markov Model (HMM) and relies on the Arabic sentence structure. On the one hand, the morphological analysis is used to reduce the size of the tags lexicon by segmenting Arabic words in their prefixes, stems, and suffixes due to the fact that Arabic is

Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic

British Journal of Applied Science & Technology, 2017

As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units. This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates the tagging performance of three statistical models, namely, the Arabic HMM POS tagger with the prefix guessing models, the Arabic HMM POS tagger with the linear interpolation guessing models and the TnT tagger, given training data from both morphemebased and word-based tokenization levels. It also studies the influence of each choice on the

A Grammatically and Structurally Based Part of Speech (Pos) Tagger for Arabic Language

Zenodo (CERN European Organization for Nuclear Research), 2022

In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is developed using a primitive set of lexicon items along with extensive grammatical and structural rules. It is tested and compared to Stanford tagger both in terms of accuracy and performance (speed). Obtained results are quite comparable to Stanford tagger performance with marginal difference favoring the developed tagger in accuracy with huge difference in terms of speed of execution. The newly developed tagger named MTE Tagger has been tested and evaluated. For the evaluation of its accuracy of tagging, a set of Arabic text was manually prepared and annotated. Compared to Stanford tagger, the MTE tagger performance was quite comparable. The developed tagger makes use of no pre-annotated datasets, except of some simple lexicon consisting of list of words representing closed word types like demonstrative nouns or pronouns or some particles. For the purpose of evaluation of the new tagger, it was run on multiple datasets and results were compared to those of Stanford tagger. In particular, both taggers (the MTE and the Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human annotated and verified to serve as testbed. The results were very encouraging where in both test runs, the MTE tagger outperformed the Stanford tagger in terms of accuracy of 87.88% versus 86.67% for the Stanford tagger. In terms of speed of tagging and in comparison Stanford tagger, MTE Taggers' performance was on average 1:50. More improved accuracy is possible in future work as the set of rules are further optimized, integrated and more of Arabic language properties such as end of word discretization are used.