Arabic Word Class Tagging Based on the Analysis of Affix Structure (original) (raw)
Related papers
A morphological-syntactical analysis approach for Arabic textual tagging
2008
Part-of-Speech (POS) tagging is the process of labeling or classifying each word in written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition, adjective, etc. It is the most common disambiguation process in the field of Natural Language Processing (NLP). POS tagging systems are often preprocessors in many NLP applications. The Arabic language has a valuable and an important feature, called diacritics, which are marks placed over and below the letters of the word. An Arabic text is partiallyvocalised 1 when the diacritical mark is assigned to one or maximum two letters in the word. Diacritics in Arabic texts are extremely important especially at the end of the word. They help determining not only the correct POS tag for each word in the sentence, but also in providing full information regarding the inflectional features, such as tense, number, gender, etc. for the sentence words. They add semantic information to words which helps with resolving ambigu...
A Rule-Based Approach for Tagging Non-Vocalized Arabic Words
In this work, we present a tagging system which classifies the words in a non-vocalized Arabic text to their tags. The proposed tagging system passes through three levels of analysis. The first level is a lexical analyzer that composed of a lexicon containing all fixed words and particles such as prepositions and pronouns. The second level is a morphological analyzer which relies on word structure using patterns and affixes to determine word class. The third level is a syntax analyzer or a grammatical tagging which relies on the process of assigning grammatical tags to words based on their context or the position of the word in the sentence. The syntax analyzer level consists of two stages: the first stage depends on specific keywords that inform the tag of the successive word, the second stage is the reversed parsing technique which scans the available grammars of Arabic language to get the class of a single ambiguity word in the sentence. We have tested the proposed system on a corpus consists of 2355 words. Experimental results showed that the proposed system achieved a rate of success approaching 94% of the total number of words in the sample used in the study.
Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers
Journal for Language Technology and Computational Linguistics, 2017
Focusing on Classical Arabic, this paper in its first part evaluates morphological analysers and POS taggers that are available freely for research purposes, are designed for Modern Standard Arabic (MSA) or Classical Arabic (CA), are able to analyse all forms of words, and have academic credibility. We list and compare supported features of each tool, and how they differ in the format of the output, segmentation, Part-of-Speech (POS) tags and morphological features. We demonstrate a sample output of each analyser against one CA fully-vowelized sentence. This evaluation serves as a guide in choosing the best tool that suits research needs. In the second part, we report the accuracy and coverage of tagging a set of classical Arabic vocabulary extracted from classical texts. The results show a drop in the accuracy and coverage and suggest an ensemble method might increase accuracy and coverage for classical Arabic. JLCL 2017-Band 32 (1)-1-26 brought to you by CORE View metadata, citation and similar papers at core.ac.uk provided by White Rose Research Online Alosaimy, Atwell Classical Arabic is the "liturgical" language that Muslims around the world use in religious practice. CA is also known as "Fussha" (the clearest), which Arabic Grammarians build their rules upon. One variant of CA is the Quranic Arabic, which is worded from CA, but differs in the sense that it is believed by Muslims to be the direct word of Allah. As time passes, different spoken variants of Classical Arabic emerged and people needed a standard form of communication: the Modern Standard Arabic (MSA). MSA is recognised as the formal and standard written Arabic. MSA is the language currently employed in media and education Bin-Muqbil (2006). Even though the morphology of MSA is inherited from CA, two studies showed that CA is not compatible with MSA taggers and vice versa. S. Rabiee (2011) tried to adapt several taggers by training them on a classical Arabic Corpus: the Quranic Arabic Corpus (QAC), and then tested them on MSA. The accuracy achieved in tagging a 66-word MSA sample was "not impressive", 73% was achieved. Alrabiah et al. (2014) compared MADA Habash et al. and AlKhalil Boudchiche et al. (2016) both designed for MSA in order to annotate the KSUCCA corpus. Using five samples from different genres of CA, an evaluation of these two systems showed a drop in their accuracy by 10-15%. This shows that current taggers need to be adapted for CA and their dictionaries need to include more classical vocabulary. We extend this evaluation to examine the coverage and accuracy of the surveyed tools. Next section reviews relevant work. The third and fourth sections list evaluated POS taggers and MAs in detail. The fifth section compares those tools by their features and demonstrates such differences on one tagged sentence. The last section reports the accuracy and coverage on a collection of classical vocabulary. 2. Related work Several previous studies surveyed the linguistic resources available for researchers in the field of Arabic NLP. Atwell et al. (2004) conducted a survey on the available MAs and came up with 10 different analysers. Authors concluded their survey pointing out that most of those analysers are not freely available or they are hard to use. Maegaard (2004) surveyed the state-of-art language resources including MAs and POS taggers. Basic Language Resource Kit (BLARK) project (2010) listed 7 MAs, three of which are commercial software. Sawalha (2011) listed 6 MAs with his proposal of a new fine-grained morphological analyser, three of which are freely available. Albared et al. (2009) surveyed the "POS tagging" techniques with a focus on Arabic: MSA and dialects. None was designed for classical Arabic. Those techniques were criticized as assuming closed-vocabulary which might not be the case with classical Arabic. Al-Sughaiyer and Al-Kharashi (2004) conducted a survey of Arabic "morphological analysis" techniques and classified the efforts in analysing Arabic morphology into four categories: tablelookup, linguistic (using finite state automaton or traditional grammar), combinatorial and pattern-based.
Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage
ACM Transactions on Asian and Low-Resource Language Information Processing, 2018
We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) tagger for CA. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based POS tagger achieves an accuracy of 96.22% with 97.72% on known tokens despite the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that quality improves with more data being added. The morphological segmenter and tagger have a wide range of potential applications in processing CA, a low-resource variety of the language.
Morpho-Syntactic Tagging System Based on the Patterns Words for Arabic Texts
2002
Text tagging is a very important tool for various applications in natural language processing, namely the morphological and syntactic analysis of texts, indexation and information retrieval, "vocalization" of Arabic texts, and probabilistic language model (n-class model). However, these systems based on the lexemes of limited size, are unable to treat unknown words consequently. To overcome this problem, we developed in this paper, a new system based on the patterns of unknown words and the hidden Markov model. The experiments are carried out in the set of labeled texts, the set of 3800 patterns, and the 52 tags of morpho-syntactic nature, to estimate the parameters of the new model HMM.
ARABIC PART-OF-SPEECH TAGGING USING THE SENTENCE STRUCTURE
2000
This paper presents a system for Arabic Part.Of.Spe ech Tagging, which combines morphological analysis with Hidden Markov Model (HMM) and relies on the Arabic sentence structure. On the one hand, the morphological analysis is used to reduce the size of the tags lexicon by segmenting Arabic words in their prefixes, stems, and suffixes due to the fact that Arabic is
Arabic Natural Language Processing Workshop (WANLP 2022) at EMNLP 2022, 2022
This paper sheds light on an in-progress work for building a morphological analyzer for Egyptian Arabic (EGY). To build such a tool, a tag-set schema is developed depending on a corpus of 527,000 EGY words covering different sources and genres. This tag-set schema is used in annotating about 318,940 words, morphologically, according to their contexts. Each annotated word is associated with its suitable prefix(s), original stem, tag, suffix(s), glossary, number, gender, definiteness, and conventional lemma and stem. These morphologically annotated words, in turns, are used in developing the proposed morphological analyzer where the morphological lexicons and the compatibility tables are extracted and tested. The system is compared with one of best EGY morphological analyzers; CALIMA.
SALMA: Standard Arabic Language Morphological Analysis
2013
Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. This paper reviews the SALMA-Tools (Standard Arabic Language Morphological Analysis) [1]. The SALMA-Tools is a collection of open-source standards, tools and resources that widen the scope of Arabic word structure analysisparticularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis -particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA -Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior-knowledge broadcoverage lexical resources; the SALMA -ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA -Tag Set is a standard tag set for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent.
ARABIC NOUN MORPHOLOGY: A COMPUTATIONAL STUDY
Annamalai University, 2021
This thesis investigates noun formation in Arabic from a computational point of view. It is about the computational morphological generation and analysis of Arabic nouns. The study first gives a descriptive analysis of Arabic noun morphology based on the stem-based approach, which satisfies the linguistic description and the computational formalization. Both derivational and inflectional systems are discussed in detail. Morphotactics, morphophonemics and the orthography of Arabic nouns are also addressed. The study then presents a computational implementation of Arabic nouns based on the rule-based approach to computational morphology. The overall system is implemented using the NooJ toolkit that supports both finite-state automata (FSA) and pushdown automata (PDA). The system of morphological generation and analysis consists of three components: a lexicon, morphotactics, and rules. The lexicon component lists lexical items (indivisible words and affixes), the morphotactics component encodes constraints on morphemes ordering, and the rules component maps the lexical representations to the surface representation and vice versa. Morphological, morphophonemic, orthographic, and other rules are encoded as two-level rules. The input of the system consists of a main editable lexicon of lemmas taken from three sources: Buckwalter Arabic morphological analyzer lexicon, and Arramooz and Alghani Azzahir machine-readable dictionaries. The output of the system is a full annotated lexicon of inflected noun forms (compiled into a single type of finite-state transducers (FSTs)). This generated lexicon is subsequently used for morphological analysis. Finally, the study presents the evaluation of the system. Three common measures are used to evaluate the performance of the system: accuracy, precision, and recall. The evaluation task consists of conducting two empirical experiments. The first experiment evaluates the system performing morphological analysis on diacritized Arabic words. The system’s performance using diacritized Arabic words for accuracy, precision, and recall, is 90.4%, 98.3%, and 88.9%, respectively. The second experiment evaluates the system on undiacritized words. The obtained results of this experiment in terms of accuracy, precision, and recall are 94.7%, 96.7%, and 91.6%, respectively. The average of the measures for the two experiments has been also calculated. The average of the performance values in terms of accuracy, precision, and recall is respectively 92.55%, 97.5%, and 90.25%. In general, the results are promising and they show the ability of the system to deal with different types of Arabic texts, diacritized and undiacritized texts. This system can provide a detailed analysis and morphological tags for nouns in Arabic text corpora. It divides the analyzed word into three parts (i.e., proclitics/prefixes, stem, and suffixes/enclitics) and gives each part a detailed morphological feature tag or possibly multiple tags if the parts have multiple clitics or affixes.
2013
The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash '-' represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. 'Noun' in Arabic subsumes what are traditionally referred to in English as 'noun' and 'adjective'. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would Word Structure 6.1 (2013): 43-99 count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora.