Automated Tagging System And Tagset Design For Arabic Text (original) (raw)

Abstract

This paper presents diacritics rule-based part-of-speech (POS) tagger which automatically tags a partially vocalized Arabic text. The aim is to remove ambiguity and to enable accurate fast automated tagging system. A tagset is being designed in support of this system. Tagset design is at an early stage of research related to automatic morphosyntactic annotation in Arabic language. Preliminary results of the tagset design have been reported in this paper. Arabic language has a valuable and important feature, called diacritics, which are marks placed over and below the letters of Arabic word. This feature plays a great role in adding linguistic attributes to Arabic words and in indicating pronunciation and grammatical function of the words. This feature enriches the language syntactically while removing a great deal of morphological and semantically ambiguities.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (18)

Mol, M. V (1994). The semi-automatic tagging of arabic corpora," COLING 94, USA.
Elaraby, M. A. (2000). A large scale computational processor of the arabic morphology and application.
Martin, D. J. J. H. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice-hall, USA.
Greene, B., Rubin, G (1971). Automatic grammatical tagging of English, Department of Linguistics, Brown University, Providence, R.I. USA. Figure 3. Matching-Pattern Example Pattern-Match-Algorithm Step-1 : Let N = the number of patterns (P) that have the same length of word (W) ... Call each P Step-2 : For i = 1 to N Compute Sim (W, P (i))
Sim = Similarity, the identical letters between W and P Step-3 : Return P which has the Max (Sim) Step-4 : Return the tag (T) of (W) Step-5 : Exit
Brill, E. (1992). A simple rule-based part of speech tagger, In: Proceedings of the Twelfth International Conference on AI. (AAAI-94),Seattle,bWA.
DeRose., S. J. (1988). Grammatical category disambiguation by statistical optimization, Computational Linguistics 14 (1)
Daelemans, B., Gills (1996). A memory-based part of speech tagger generator, In: Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, p. 1427.
Marques, N. G. (1996). A neural network approach to part-of-speech tagging," Proceedings of the second workshop on spoken and written Portuguese, Curitiba, Brazil, p. 1-9.
Schmid, H. (1994). Part-of-speech tagging with neural networks, In: Proceeding of COLING-94. p. 172-176.
El-Kareh., Al-Ansary. (2000). An Arabic interactive multi-feature pos tagger, In: Proceedings of the, ACIDCA conference, Monastir, Tunisia, p. 204-210.
Abuleil, S., Evens, M (1998). Discovering lexical information by tagging arabic newspaper text, In: Workshop on Semitic Language Processing. COLING-ACL.98, University of Montreal, Montreal, PQ, Canada, Aug 16, p. 1-7.
Khoja, S. (2001). Apt: Arabic part-of-speech tagger, In: Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania. June. No. 2,.
Diab, K. H., Jurafsky, Mona and D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks, In: Proceedings of HLTNAACL.
Khojah, G., Knowels. (2001). A tagset for the morphosyntactic tagging of arabic, In: Paper presented at Corpus Linguistics 2001, Lancaster University, Lancaster, UK, March 2001, and to appear in a book entitled "A Rainbow of Corpora: Corpus Linguistics and the Languages of the World", edited by Andrew Wilson, Paul Rayson, and Tony McEnery, Lincom-Europa, Munich.
Garside, Roger., Leech, Geoffrey., Sampson, Geoffrey (1987). The Computational Analysis of English: a corpus-based approach. Longman Group UK Limited.
Leech G, Wilson A 1996 Recommendations for the Morphosyntactic Annotation of Corpora EAGLES Report. http://www.ilc.pi.cnr.it/EAGLES96/annotate/
Megyesi, Beáta (1998). D-level thesis (Master's thesis) in Computational Linguistics, spring. Brill's Rule-Based Part of Speech Tagger for Hungarian. Computational Linguistics, Stockholm-University, Sweden.