Arabic Word Stemming Algorithms and Retrieval Effectiveness (original) (raw)

Abstract

Documents retrieval in Information Retrieval Systems (IRS) is generally about retrieving of relevant documents pertaining to information needs. The more the system able to understand the contents of documents the more effective will be the retrieval outcomes. But understanding of the contents is a very complex task. Conventional IRS applies algorithms that can only approximate the meaning of document contents through keywords approach using vector space model. Keywords may be unstemmed or stemmed. When keywords are stemmed and conflated in retrieval process, we are a step forwards in applying semantic technology in IRS. Word stemming is a process in morphological analysis under natural language processing, before syntactic and semantic analysis. We have developed algorithms for Arabic stemming and incorporated it in our experimental system in order to measure retrieval effectiveness. The results have shown that the retrieval effectiveness has increased when stemming is used.

Key takeaways

sparkles

Stemming algorithms significantly enhance document retrieval effectiveness in Arabic Information Retrieval Systems (IRS).
The study utilizes a dataset of 6,236 documents from the Quran for experimental validation.
Arabic word formation involves complex morphological structures, including prefixes, suffixes, and infixes.
Our proposed stemming algorithm outperforms Al-Omari's algorithm in retrieval effectiveness.
The research confirms that stemming is crucial for improving semantic understanding in IRS.

Figures (10)

In Arabic written attac the definite article and some prepositions are hed to the word. Thus, the definite article, J and some prepositions in Arabic are considered as a group of prefixes, besides the subject markers ! ,.s ,“ (alif; ya, and ta). Arabic a lows up to three consecutive prepositions from this group to precede a word. For example, the word opllslias (and with the children) which contains three prefixes (and 3, with &, the J!). Table 2 show the prefixes belong to this group. Affixes in Arabic are: prefixes, suffixes (or postfixes) and infixes (morphemes). Prefixes are attached at beginning of the words, where suffixes are attached at the end, and infixes are found in the middle of the words. For example, the Arabic word 4!Ual (a/talibat) which means the female students, consists of the elements as shown in Table 1: the root, the prefix, the suffix and the infix.

Table 4: Examples of Word Letters that match the Arabic Affixes

The stemming algorithm will take as input an Arabic word (not a stop word), and the output will be the extracted root (or stem). In cases where the algorithm cannot find a root for the specific word, the word itself will be taken as a root. Such cases are few and it depends on the quality of the algorithm proposed.

TABLE 8: The Quran’s chapters with their corresponding total number of verses

Table 10: Results of the Experiments for 10 Quran Chapters

Table 11: Stemming Errors on Ten Chapters of the Quran

Table 12: Distribution of Errors in Quran Data Set

Fig. 2 Average Recall-Precision Graph for conflation and nonconflation methods on Arabic Texts

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (21)

Lovins, J.B. Development of A Stemming Algorithm. Mechanical Translation and Computational Linguistics. 1968. 11(1-2): 22-31.
Harman, D. 1991. How Effective is Suffixing. Journal of American Society for Information Science. 42(1): 7-15.
van Rijsbergen, C. J. Information Retrieval. London: Butterworths. 1979.
Sembok, T. M.. T, M. Yusoff & F. Ahmad. A Malay Stemming Algorithm For Information Retrieval, Proceedings of the 4th International Conference and Exhibition on Multi-Lingual Computing. 1994.
Lennon , M. An Evaluation of Some Conflation Algorithms for Information Retrieval. Journal of Information Science. 1981. 3: 177- 183.
Ahmad, F., Mohammed Yusoff, Sembok, T.M.T. Experiments with A Malay Stemming Algorithm, Journal of American Society of Information Science. 1996.
Porter, M. F. An algorithm for suffix stripping'. Program, 14, 130- 137. 1980.
Al-Kharashi, I.A. & Evens, M.W. Comparing words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. Journal of the American Society for Information Science. 1994. 45(8): 548-560.
Khoja, Shereen. Stemming Arabic Text. http://zeus.cs.pacificu.edu/shereen/research.htmLarkey (2001),
Larkey, L. Ballesteros, and M. Connell. Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. SIGIR 2002. pp. 275-282.
Darwish, K. Building a Shallow Morphological Analyzer in One Day. ACL Workshop on Computational Approaches to Semitic Languages.
Chen, A., Gey, F. Building an Arabic Stemmer for Information Retrieval. TREC-2002.
El-Sadany, T.A., Hashish, M.A. An Arabic Morphological System, IBM System Journal. 1989. vol.28(4). Pp 600-612.
Hilal, Y. 1990. Automatic Processing of Arabic Language and Applications. Proceedings of the Arabic Language Processing Using Computer Conference. 1990. pp 213-219.
Shahein, H. I. & Youssef, S.A. A Model for Morphology As A Production System. Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English. 1990.
Alserhan,H.M. & Ayesh, A.S. An Application of Neural Network for Extracting Arabic Word Roots. Proceedings of the 10th WSEAS International Conference on COMPUTERS. 2006.
Mesleh, A. Support Vector Machines Based Arabic Language Text Classification System: Feature Selection Comparative Study, Proceedings of the 12th WSEAS International Conference on Applied Mathematics. 2007.
Shquier, Mohammed M. Abu; Sembok, Tengku Mohd T. Word Agreement and Ordering in English-Arabic Machine Translation. Proceedings of ITSim 2008: International Symposium on IT. 2008. Volume 1, 26-28.
Awajan A. 2003. A Rule-Based Morphological Analyzer of Arabic Word. WSEAS Transactions on Computers. No. 2. Volume 2.
Ghwanmeh, S. Effect of Excessive Letter Location in Arabic Lexical Items: A Stemmer Algorithm Approach. WSEAS Transaction. 2011.
Al-Omari, H. ALMAS: An Arabic Language Morphological Analyzer System. National University of Malaysia. Bangi, Selangor. 1994. Proceedings of the World Congress on Engineering 2013 Vol III, WCE 2013, July 3 -5, 2013, London, U.K. ISBN: 978-988-19252-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) WCE 2013