A New Distributional Semantic Model for Classical Arabic (original) (raw)

An Empirical Study On The Holy Quran Based On A Large Classical Arabic Corpus

2014

Distributional semantics is one of the empirical approaches to natural language processing and acquisition, which is mainly concerned by modeling word meaning using words distribution statistics gathered from huge corpora. Many distributional semantic models are available in the literature, but none of them have been applied so far to the Quran nor to Classical Arabic in general. This paper reports the construction of a very large corpus of Classical Arabic that will be used as a base to study distributional lexical semantics of the Quran and Classical Arabic. It also reports the results of two empirical studies; the first is applying a number of probabilistic distributional semantic models to automatically identify lexical collocations in the Quran and the other is applying those same models on the Classical Arabic corpus in an attempt to test their ability of capturing lexical collocations and co occurrences for a number of the corpus words. Results show that the MI.log_freq association measure achieved the highest results in extracting significant co-occurrences and collocations from small and large Classical Arabic corpora, while mutual information association measure achieved the worst results.

Arabic wordnet: Semi-automatic extensions using bayesian inference

Proceedings of the …, 2008

This presentation focuses on the semi-automatic extension of Arabic WordNet (AWN) using lexical and morphological rules and applying Bayesian inference. We briefly report on the current status of AWN and propose a way of extending its coverage by taking advantage of a limited set of highly productive Arabic morphological rules for deriving a range of semantically related word forms from verb entries. The application of this set of rules, combined with the use of bilingual Arabic-English resources and Princeton's WordNet, allows the generation of a graph representing the semantic neighbourhood of the original word. In previous work, a set of associations between the hypothesized Arabic words and English synsets was proposed on the basis of this graph. Here, a novel approach to extending AWN is presented whereby a Bayesian Network is automatically built from the graph and then the net is used as an inferencing mechanism for scoring the set of candidate associations. Both on its own and in combination with the previous technique, this new approach has led to improved results.

Contribution to Semantic Analysis of Arabic Language

Advances in Artificial Intelligence, 2012

We propose a new approach for determining the adequate sense of Arabic words. For that, we propose an algorithm based on information retrieval measures to identify the context of use that is the closest to the sentence containing the word to be disambiguated. The contexts of use represent a set of sentences that indicates a particular sense of the ambiguous word. These contexts are generated using the words that define the senses of the ambiguous words, the exact string-matching algorithm, and the corpus. We use the measures employed in the domain of information retrieval, Harman, Croft, and Okapi combined to the Lesk algorithm, to assign the correct sense of those proposed.

KSUCCA: A Key To Exploring Arabic Historical Linguistics

2014

Classical Arabic forms the basis of Arabic linguistic theory and it is well understood by the educated Arabic reader. It is different in many ways from Modern Standard Arabic which is more simplified in its lexical, syntactic, morphological, phraseological and semantic structure. King Saud University Corpus of Classical Arabic is a pioneering corpus of around 50 million words of Classical Arabic. It is initially constructed for the purpose of studying distributional lexical semantics of the Quran and Classical Arabic, however, it is designed in a general way making it also appropriate for other researches in Linguistics and Computational Linguistics. In this paper, we will briefly describe the structure of our corpus, and then we will demonstrate how it can be used to depict some aspect of Arabic language change between the classical and the modern periods.

An application of distributional semantics for the analysis of the Holy Quran

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), 2016

In this contribution we illustrate the methodology and the results of an experiment we conducted by applying Distributional Semantics Models to the analysis of the Holy Quran. Our aim was to gather information on the potential differences in meanings that the same words might take on when used in Modern Standard Arabic w.r.t. their usage in the Quran. To do so we used the Penn Arabic Treebank as a contrastive corpus.

Building Related Words in Indonesian and English Translation of Al-Qur’an Vocabulary Based on Distributional Similarity

Jurnal Teknologi Informasi dan Terapan, 2020

The Qur'an is the Muslim holy book as the main source and guide, consisting of 114 surahs, 30 juz and has 6200 fewer verses in it. The search for relationships or arrangements of meaning between words in the Qur'an takes a long time to find and summarize. Obtained from the dictionary, encyclopedia, or thesaurus of the Al-Qur'an vocabulary, which contains each word entry has links with other words. This final project discusses the interrelations and semantic correspondence between words in the Qur'an, which supports to help find inter-related words in it, using linking with distributions that involve important parts in the word embedding. Measurement of the relevance of the word measurement with semantic similarity which is one of the lessons learned in Natural Language Processing (NLP). Extraordinary similarity measures the proximity of a word vector using cosine similarity. The process of converting words in the form of vectors using the fasttext which is the develo...

How Different Is Arabic from Other Languages? The Relationship between Word Frequency and Lexical Coverage

This study examines Zipf's law as a predictor of the relationship between word frequency and lexical coverage in Arabic. Zipf's law has been applied in a number of languages, such as English, French and Greek, and revealed useful information. However, word derivation processes are far more regular and extensive in Arabic than they are in English and it is suspected that how words are defined may significantly affect the outcome of this kind of analysis. The concept of the lemma as applied to English could be redrawn for Arabic entirely credibly. In this study, Arabic lemmatised frequency lists generated from a large Web-based corpus have been used to calculate coverage. Results show that Zipf's law does apply in Arabic, and the findings suggest that the most frequent 9,000 lemmatised words provide approximately 95% coverage, and 14,000 words give nearly 98% coverage. These results suggest that the relationship between word frequency and coverage in Arabic is comparable, to a certain degree, to English and Greek, but not to French. However, the definition of the lemma used in this study is probably more relevant to European languages than to Arabic and if this was changed it would significantly change the results.

Arabic Word Semantic Similarity

2012

This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS ...

Mining the Web for the Induction of a Dialectical Arabic Lexicon

LREC. European Language Resources …, 2010

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) -one of the most widely understood dialects in the Arab World -and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually evaluated for 1,000 word entries by two native speakers of the ECA-MSA varieties, the proposed approach achieves a promising F-measured performance rate of 70.9%. In discussion to the proposed algorithm, different semantic issues are highlighted for upcoming phases of the induction of a more comprehensive ECA-MSA lexicon.