A Proposed Adaptive Scheme for Arabic Part-of Speech Tagging (original) (raw)
Related papers
The study described in this paper belongs to the area of computational linguistics. Computational linguistics is a field of artificial intelligence dealing with the logical modeling of natural language from a computational perspective. It unites two areas that are quite different in appearance, computer science and natural languages. Computational linguistics might be considered as a synonym of automatic processing of natural language, since the main task of computational linguistics is just the construction of computer programs to process words and texts in natural language. There are many areas that may be considered as properly included within the discipline of computational linguistics. One of these areas is part-of-speech tagging (POS-tagging). POS-tagging is considered as a process for automatically assigning the proper grammatical tag to each word of a written text according to its appearance on the text. Thus, the task of POS-tagging is attaching appropriate grammatical or morpho-syntactical category labels to each word, token, symbol, abbreviation and even punctuation mark in a corpus. POS-tagging is usually the first step in linguistic analysis. Also, it is very important intermediate step to build many natural language processing applications. It could be used in spell checking and correcting systems, speech recognition systems, information retrieval systems and text-to-speech synthesis systems.
Automated Tagging System And Tagset Design For Arabic Text
This paper presents diacritics rule-based part-of-speech (POS) tagger which automatically tags a partially vocalized Arabic text. The aim is to remove ambiguity and to enable accurate fast automated tagging system. A tagset is being designed in support of this system. Tagset design is at an early stage of research related to automatic morphosyntactic annotation in Arabic language. Preliminary results of the tagset design have been reported in this paper. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This feature plays a great role in adding linguistic attributes to Arabic words and in indicating pronunciation and grammatical function of the words. This feature enriches the language syntactically while removing a great deal of morphological and semantically ambiguities.
A Grammatically and Structurally Based Part of Speech (Pos) Tagger for Arabic Language
Zenodo (CERN European Organization for Nuclear Research), 2022
In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is developed using a primitive set of lexicon items along with extensive grammatical and structural rules. It is tested and compared to Stanford tagger both in terms of accuracy and performance (speed). Obtained results are quite comparable to Stanford tagger performance with marginal difference favoring the developed tagger in accuracy with huge difference in terms of speed of execution. The newly developed tagger named MTE Tagger has been tested and evaluated. For the evaluation of its accuracy of tagging, a set of Arabic text was manually prepared and annotated. Compared to Stanford tagger, the MTE tagger performance was quite comparable. The developed tagger makes use of no pre-annotated datasets, except of some simple lexicon consisting of list of words representing closed word types like demonstrative nouns or pronouns or some particles. For the purpose of evaluation of the new tagger, it was run on multiple datasets and results were compared to those of Stanford tagger. In particular, both taggers (the MTE and the Stanford) were run on a set of 1226 sentences with close to 20,000 tokens that was human annotated and verified to serve as testbed. The results were very encouraging where in both test runs, the MTE tagger outperformed the Stanford tagger in terms of accuracy of 87.88% versus 86.67% for the Stanford tagger. In terms of speed of tagging and in comparison Stanford tagger, MTE Taggers' performance was on average 1:50. More improved accuracy is possible in future work as the set of rules are further optimized, integrated and more of Arabic language properties such as end of word discretization are used.
Full automatic Arabic text tagging system
the proceedings of the International Conference on Information Technology and Natural Sciences, Amman/Jordan, 2003
Part-of-Speech tagging is the process of assigning grammatical part-of-speech tags to words based on their context. Many automated tagging systems have been developed for English and many other western languages, and for some Asian languages, and have achieved accuracy rates ranging from 95% to 98%. A tagged corpus has more useful information than untagged corpus; so, tagged corpus can be used to extract grammatical and linguistic information from the corpus. Then, it can be used for many ...
2013
The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash '-' represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. 'Noun' in Arabic subsumes what are traditionally referred to in English as 'noun' and 'adjective'. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would Word Structure 6.1 (2013): 43-99 count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora.
Developing a tagset for automated POS tagging in Arabic
2006
Arabic language has much more syntactical and morphological information. Diacritics, which are marks placed over and below the letters of Arabic word, play a great role in adding linguistic attributes to Arabic word in part-of-speech tagging system. This paper describes a tagset that were built based on the inflectional morphology system which derived from traditional Arabic grammatical theory. The tagset developed represent an early stage of research related to automatic morphosyntactic annotation in Arabic language. This paper aims to present a general tagset for use in diacritics-based automated tagging system that is underdevelopment by the author.
ARABIC PART-OF-SPEECH TAGGING USING THE SENTENCE STRUCTURE
2000
This paper presents a system for Arabic Part.Of.Spe ech Tagging, which combines morphological analysis with Hidden Markov Model (HMM) and relies on the Arabic sentence structure. On the one hand, the morphological analysis is used to reduce the size of the tags lexicon by segmenting Arabic words in their prefixes, stems, and suffixes due to the fact that Arabic is
A Review of Part of Speech Tagger for Arabic Language
The aim of this paper is to review the implementation of Part of Speech (POS) Tagger for Arabic Language which will help in building accurate corpus for Arabic Language. Many researchers have been design and implement POS using different machine learning methods like Rule Based, Neural Network, Decision Tree, Transformation-Based, and Hidden Markov Model. Arabic is the mother tongue of more than 400 million people. It is one of the most important natural languages in the world. Therefore, an arranging Arabic content records that contain suppositions, interpersonal organization like online journals, Facebook, tweeter, Holy Quran, Hadith exchange groups is interested and needed a significance estimation investigation. Albeit Arabic one of the richest dialect and turn into the main dialect for more than 24 country. This paper proven that the created tagger is accurately labeled the words in the preparing dataset between 84% and 99%, which is enhancing the commented on Arabic corpus and its applications.
Towards a standard Part of Speech tagset for the Arabic language
2017
Part of Speech (PoS) tagging is still not very well investigated with respect to the Arabic language. Determining the PoS tags of a word in a particular context is difficult, primarily because there is no use of diacritics in most of contemporary texts. Consequently, the same word may be spelled in different ways. Further, detecting the difference between Arabic derivatives represents a very challenging issue for the majority of PoS taggers. Hence, the task of tagging the correct PoS tags requires advanced processing and the use of considerable resources. This study aims to design detailed hierarchical levels of the Arabic tagset categories and their relationships. These hierarchical levels allow easier expansion when required and produce more accurate and precise results. They are based on a comparative study and important references in Arabic grammar; they are also validated by experts in this field. In addition, the proposed tagset is implemented in a PoS tagger and tested via various experiments. We believe that our study makes a significant contribution to the literature because this work is an advancement in the direction of achieving a standard, rich, and comprehensive tagset for Arabic.
A morphological-syntactical analysis approach for Arabic textual tagging
2008
Part-of-Speech (POS) tagging is the process of labeling or classifying each word in written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition, adjective, etc. It is the most common disambiguation process in the field of Natural Language Processing (NLP). POS tagging systems are often preprocessors in many NLP applications. The Arabic language has a valuable and an important feature, called diacritics, which are marks placed over and below the letters of the word. An Arabic text is partiallyvocalised 1 when the diacritical mark is assigned to one or maximum two letters in the word. Diacritics in Arabic texts are extremely important especially at the end of the word. They help determining not only the correct POS tag for each word in the sentence, but also in providing full information regarding the inflectional features, such as tense, number, gender, etc. for the sentence words. They add semantic information to words which helps with resolving ambigu...