Arabic Natural Language Processing Research Papers (original) (raw)

- by and +1
- •
- Arabic Language and Linguistics, Arabic, Arabic Sociolinguistics, Teaching Arabic as a Foreign Language (TAFL)

Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic...

Life is a blessing from the graces of God that he granted to all His creatures. God has excelled in creating this life in its most beautiful form, making the difference of people and the difference of graces among them a cornerstone for enjoying life. Created nature around him, which fascinates hearts, and some enjoy hearing the melody of nature and the chirping of birds and the chants of the seas. The situation regarding the deaf and dumb category is different, because life on a daily basis faces the deaf and dumb category with various difficulties, some of which are similar to what other natural individuals face, Some of the others are his own, that is, they are caused by the problems he suffers because of his disability, and the deaf-dumb tries to continue living relying on his attempt to be able to adapt to the conditions of his life, so they succeed and fail at times. The greater challenge for the deafdumb is to strive all possible ways to communicate with the society easily and confidently. They face many problems because most of the individuals they interact may lack knowledge of communicating using sign language. This research aims to reach solutions to overcome this problem by proposing an application to communicate for the Deaf and dumb. This application will help the public to understand the sign language used by deaf and dumb individuals. Therefore, the deaf and dumb group can communicate with people who do not know sign language without feeling embarrassed.

Many times people use a single word with multiple senses, which provides a different meaning based on the sentence in which it has been used. So, the main goal of the work is to disambiguate an ambiguous word that has been located in certain sentences. Thus, it helps in exploring the actual sense of the word that it means. This process is itself named as Word Sense Disambiguation (WSD) which is an essential and ongoing subject in NLP. However, the role of domain is very helpful in exploring the actual sense of an ambiguous word. There are a number of approaches to WSD which takes a wider semantic space of ambiguous words into account [10]. This semantic space can be represented as a specific domain, task or application [10]. The domain information which is one among them is more advantageous in the process of disambiguation, thus, the work explores the role of the domain in the disambiguation process. The work also includes a score allotment to all the senses of an ambiguous word based on the semantic relation of it with the ambiguous word. Later, this helps in obtaining the actual sense of the word. Keywords: Disambiguation, natural language processing, word sense I. INTRODUCTION In NLP, ambiguity has become a barrier to human language understanding. The best solution to overcome this barrier is the Word Sense Disambiguation process. For instance, consider an example word 'bank' which has two senses. One sense of the word is "financial reservoir" and another sense of it is "a river edge". Now consider the following two statements. "Willows lined the bank of the stream" and "I went to the bank to get a home loan" To a human, the difference between these two senses of the word 'bank' will be clearly understandable as it means 'river bank' in the first sentence and a 'financial reservoir' in the second sentence. But this is not the same in case of a machine. Thus, it needs a different solution. There are different solutions of WSD that have been proposed. This can be generally divided into supervised and knowledge-based approaches. In comparison of these two approaches, the knowledge-based approach has gained a rapid development while compared to others in recent years [12]. Also, the availability of abundant information from a different knowledge resources has narrowed the gap between these two approaches [12]. Hence, the use of a knowledge-based approach has been considered as a useful one in our work. The main idea of this approach is to make use of WordNet and a semantic space such as specific domains to a greater extent to obtain the actual sense of the ambiguous word. Thus, the main purpose of our work is to make use of the domain information as a best possible way to explore the actual meaning of an ambiguous word that has been used in a sentence. There is a hypothesis that the domain information provides a powerful way to establish a semantic relation among the word senses [11]. Thus, it can be used in a profitable way in the process of disambiguation. We can refer to the domain, as a set of words that has a strong semantic relation between them. Thus, an approach of using domain information in obtaining the actual sense of an ambiguous word makes sense. Here, the basic prediction that we assumed to achieve the goal was to make use of a distinct score allotment for each of the senses that comes under the semantic space of domain of ambiguous word. This score allotment is not a random one, rather the sense that is closer to the word in a particular sentence will be allotted with a highest score while the remaining senses will be allotted in a decreasing manner accordingly. This prediction gave a way to attain the result later. Also, this approach helps in less time consumption than the normal disambiguation process. Therefore, the efficiency of the project will also be increased. II. LITERATURE SURVEY S.G Kolte and S.G Bhirud approaches is to Word sense disambiguation using wordnet domains, here they used wordnet as database to define domains and the words in the given sentence is taken as parameter which helps to detect domain by domain-oriented text analysis. For this they go through unsupervised method, and they trying to disambiguate nouns first using pos tag and getting results but here the drawback is that this approach is failed for a word having more than one sense in a same domain [1]. A Fully Unsupervised Word sense disambiguation system using dependency knowledge on specific domain is proposed in the year 2010, here they developed a fully unsupervised system using domain specific knowledge and this system performs above the first sense baseline. they showed that wsd can be achieve in unsupervised method without get compromised with supervised approach by using easily available unannotated text from internet and other sources and get a good result [2].

There is a number of machine learning algorithms for recognizing Arabic characters. In this paper, we investigate a range of strategies for multiple machine learning algorithms for the task of Arabic characters recognition, where we are faced with imperfect and dimensionally variable input characters. We show two different strategies to combining multiple machine learning algorithms: manual backoff strategry and ensemble learning strategy. We show the performance of using individual algorithms and combined algorithms on recognizing Arabic characters. Experimental results show that combined confidence-based strategies can produce more accurate results than each algorithm produces by itself and even the ones exhibited by the majority voting combination.

Legal texts play an essential role in the organisation, be it public or private where each
actor must be aware of, and comply with regulations. However, because of the difficulties of the legal domain, the actors prefer to rely on the expert rather than resorting to search for the regulation in a collection of documents. In this paper, we use a rule-based approach based on the contextual exploration method for the semantic annotation of Algerian legal texts written in Arabic language. We are interested in the specification of the semantic information of the provision types: obligation, permission and prohibition, and the arguments role and action. The preliminary experiment presented promising results for the specification of provision types.

- by Nasria BOUHYAOUI and +1
- •
- Annotation, Arabic Natural Language Processing, Document Classification, Legal text

Selection of an industrial robot for a specific purpose is one of the most challenging problems in modern manufacturing atmosphere. The selection decisions become more multifaceted due to continuous incorporation of advanced features and facilities as the decision makers in the manufacturing environment are to asses a wide varieties of alternatives based on a set of conflicting criteria. To assist the selection procedure various Multiple Criteria Decision Making (MCDM) approaches are available. The present investigation endeavours to mitigate and unravel the robot selection dilemma employing the newly proposed Multiplicative Model of Multiple Criteria Analysis (MMMCA) approach. MMMCA is a novel model in which all performance ratings are converted into numerical values greater than and equal to unity and converting all non-benefit rating into benefit category. Each normalized weight is used as the index of corresponding normalized ratings those are multiplied to obtain the resultant ...

- by Balaram Dey
- •
- Languages, Information Retrieval, Artificial Intelligence, Image Processing

Islam is the second largest and the fastest growing religion. The Islamic Law, Sharia, represents a profound component of the day-today lives of Muslims. This creates a lot of queries, about specific problems, that requires answers, or Fatwas. While sources of Sharia are available for anyone, it often requires a highly qualified person, the Mufti, to provide Fatwa. To get certified for Fatwa, the Mufti needs to undergo a sophisticated and long education process that starts from basic to high school. With Islam followers representing almost 25% of planet earth population, generating a lot of queries, and the sophistication of the Mufti qualification process, creating shortage in them, we have a supply-demand problem, calling for Automation solutions. This motivates the application of Artificial Intelligence (AI) to Automated Islamic Fatwa. In this work, we explore the potential of AI, Machine Learning and Deep Learning, with technologies like Natural Language Processing (NLP), paving the way to help the Automation of Islam Fatwa. We start by surveying the State-of-The Art (SoTA) of NLP, and explore the potential use-cases to solve the problems of Question answering and Text Classification in the Islamic Fatwa Automation. We present the first and major enabler component for AI application for Islamic Fatwa, the data. We build the largest dataset for Islamic Fatwa, spanning the widely used websites for Fatwa. Moreover, we present baseline systems, for Topic Classification, Topic Modelling and Retrieval-based Question-Answering, to set the direction for future research and benchmarking on our dataset. Finally, we release our dataset and baselines to the public domain, to help advance the future research in the area.

- by IJCSMC Journal and +2
- •
- Computer Science, Artificial Intelligence, Information Technology, Technology

Research on tools for automating the proofreading of Arabic text has received much attention in recent years. There is an increasing demand for applications that can detect and correct Arabic spelling and grammatical errors to improve the quality of Arabic text content and application input. Our review of previous studies indicates that few Arabic spell-checking research efforts appropriately address the detection and correction of ill-formed words that do not conform to the Arabic morphology system. Even fewer systems address the detection and correction of erroneous well-formed Arabic words that are either contextually or semantically inconsistent within the text. We introduce an approach that investigates employing deep neural network technology for error detection in Arabic text. We have developed a systematic framework for spelling and grammar error detection, as well as correction at the word level, based on a bidirectional long short-term memory mechanism and word embedding, in which a polynomial network classifier is at the top of the system. To get conclusive results, we have developed the most significant gold standard annotated corpus to date, containing 15 million fully inflected Arabic words. The data were collected from diverse text sources and genres, in which every erroneous and ill-formed word has been annotated, validated, and manually revised by Arabic specialists. This valuable asset is available for the Arabic natural language processing research community. The experimental results confirm that our proposed system significantly outperforms the performance of Microsoft Word 2013 and Open Office Ayaspell 3.4, which have been used in the literature for evaluating similar research.

- by Khaled Shaalan
- •
- Machine Learning, Arabic, Deep Learning, Arabic NLP

Arabic, like other highly inflected languages, encodes a large amount of information in its morphology and word structure. In this work, we propose two embedding strategies that modify the tokenization phase of traditional word embedding models (Word2Vec) and contextual word embedding models (BERT) to take into account Arabic's relatively complex morphology. In Word2Vec, we segment words into subwords during training time and then compose word-level representations from the subwords during test time. We train our embeddings on Arabic Wikipedia and show that they perform better than a Word2Vec model on multiple Arabic natural language processing datasets while being approximately 60% smaller in size. Moreover, we showcase our embeddings' ability to produce accurate representations of some out-of-vocabulary words that were not encountered before. In BERT, we modify the tokenization layer of Google's pretrained multilingual BERT model by incorporating information on morphology. By doing so, we achieve state of the art performance on two Arabic NLP datasets without pretraining.

To avoid fraudulent Job postings on the internet, we target to minimize the number of such frauds through the Machine Learning approach to predict the chances of a job being fake so that the candidate can stay alert and make informed decisions if required. The model will use NLP to analyze the sentiments and pattern in the job posting and TF-IDF vectorizer for feature extraction. In this model, we are going to use Synthetic Minority Oversampling Technique (SMOTE) to balance the data and for classification, we used Random Forest to predict output with high accuracy, even for the large dataset it runs efficiently, and it enhances the accuracy of the model and prevents the overfitting issue. The final model will take in any relevant job posting data and produce a result determining whether the job is real or fake.

Handout of classroom activities and homework for absolute beginners

Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific.

- by Nusrat Jahan
- •
- Engineering, Computer Science, Programming Languages, Second Language Acquisition

3rd International Conference on Semantic & Natural Language Processing (SNLP 2022) will provide an
excellent international forum for sharing knowledge and results in theory, methodology and applications of
Semantic and Natural Language Computing. The Conference looks for significant contributions to all major
fields of the Natural Language Computing in theoretical and practical aspects.
Authors are solicited to contribute to the conference by submitting articles that illustrate research results,
projects, surveying works and industrial experiences that describe significant advances in the following
areas, but are not limited to

يتناول البحث واقع اللغة العربية في ماليزيا ووسائل تطويرها من حيث التخطيط اللغوي لها

In recent years, we have witnessed the rapid development of deep neural networks and distributed representations in natural language processing. However, the applications of neural networks in resume parsing lack systematic investigation. In this study, we proposed an end-to-end pipeline for resume parsing based on neural networks-based classifiers and distributed embeddings. This pipeline leverages the position-wise line information and integrated meanings of each text block. The coordinated line classification by both line type classifier and line label classifier effectively segment a resume into predefined text blocks. Our proposed pipeline joints the text block segmentation with the identification of resume facts in which various sequence labelling classifiers perform named entity recognition within labelled text blocks. Comparative evaluation of four sequence labelling classifiers confirmed BLSTM-CNNs-CRF's superiority in named entity recognition task. Further comparison among three publicized resume parsers also determined the effectiveness of our text block classification method.

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze,... more

This article aims at describing a new approach for preprocessing vowelized and unvowelized Arabic texts in order to prepare them for Natural Language Processing (NLP) purposes. This approach is rule-based and made up of four phases: text tokenization, word light stemming, words' morphological analysis, and text annotation. The first phase preprocesses the input text in order to isolate the words and represent them in a formal way. The second phase applies a light stemmer in order to extract the stem of each word by eliminating the prefixes and suffixes. The third phase is a rule-based morphological analyzer that determines the root and the morphological pattern for each extracted stem. The last phase aims at producing an annotated text where each word is tagged with its morphological attributes. The preprocessor presented in this paper is capable of dealing with vowelized and unvowelized words, and provides the input words along with relevant linguistics information needed by different applications. It is designed to be used with different NLP applications such as machine translation, text summarization, text correction, information retrieval, and automatic vowelization of Arabic text.

In social media platforms, hate speech can be a reason of "cyber conflict" which can affect social life in both of individual-level and country-level. Hateful and antagonistic content propagated via social networks has the potential to cause harm and suffering on an individual basis and lead to social tension and disorder beyond cyber space. However, social networks cannot control all the content that users post. For this reason, there is a demand for automatic detection of hate speech. This demand particularly raises when the content is written in complex languages (e.g. Arabic). Arabic text is known with its challenges, complexity and scarcity of its resources. This paper will present a background on hate speech and its related detection approaches. In addition, the recent contributions on hate speech and its related antisocial behaviour topics will be reviewed. Finally, challenges and recommendations for the Arabic hate speech detection problem will be presented.

A the spotlight. Apple Siri, Amazon Alexa, and, more recently, Google Duplex are just a few of the most well instances of NLP in work, with the are estimated to be eight billion digital voice assistants in use, owing to their popularity. With such a large the data generated from these interactions would also b further research and development of Natural Language Processing and can be used in many industries such as healthcare, technology, business etc. Sentiment analysis, with the help of Natural Language industries process these huge datasets faster and efficiently and can be used, for example, in the healthcare industry to diagnose patients and develop diagnostic models for detecting chronic disease in its early stages. all that an individual, policy maker(s), organizations and Governments might need. K Natural language processing refers to a system's capacity to process sentences written in natural languages such as English rather than specialized artificial computer languages such of artificial intelligence that examines data in order to better comprehend natural human language. Natural language processing, or NLP, is a branch of machine learning that focuses on analyzing the nuanc communicate with one another so that it may replicate how we speak to and with one another without the constraints of the English language. It is impossible to overestimate the relevance of computer communication in ordinary language. Natu poised to revolutionize customer service and beyond. While computers have always been extremely useful for abstract tasks such as quantification, real that computing technologies enable extremely rapid and precise communication channels, machines have never been satisfactory at understanding how and why we interact. NLP is focused to deciphering the between computers and machines through the use of language. Today's consumers generate a vast amount of data through digital means. Because the data is created across multiple platforms, it is tough for businesses to keep track of it and respo (NLP) solutions give businesses a reliable way to keep track of a large amount of user several digital channels. Companies can actively engage customers by replying to their queries or c NLP techniques. NLP tools also assist businesses in installing various content filters, such as filtering undesired information or implementing spam filters on websites or social media pages. Businesses utilize NLP models to keep the qualit result of this. It can also auto "banned" words or vulgarism. be used to handle enormous volumes of text data at unprecedented speeds using cloud/distributed computing. Machines can now interpret more language consistent and unbiased manner.

With the recent developments in the field of Natural Language Processing, there has been a rise in the use of different architectures for Neural Machine Translation. Transformer architectures are used to achieve state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such setups consisting of high-end GPUs and other resources. We train our models on low computational resources and investigate the results. As expected, transformers outperformed other architectures, but there were some surprising results. Transformers consisting of more encoders and decoders took more time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively less time to train than transformers, making it suitable to use in situations having time constraints.

Kridantas play a vital role in understanding Sanskrit language. Kridantas includes nouns, adjectives and indeclinable words called avyayas. Kridantas are formed with root and certain suffixes called Krits. Some times Kridantas may occur with certain prefixes. Many morphological analyzers are lacking the complete analysis of Kridantas. This paper describes a novel approach to deal completely with Kridantas.

This article presents an English mnemonic that assists in teaching the pronunciation differences of the sun and moon letters of the Arabic alphabet when combined with the definite article. (Key words: Arabic pedagogy, sun letters, moon... more

Kaedah mengistinbat hukum fiqah di dalam Islam adalah salah satu metod penetapan hukum sama ada dalam aspek ibadah, perkahwinan, muamalat dan jenayah. Imam al-Shafie
sepertimana sedia maklum merupakan salah seorang tokoh ulama fiqah yang banyak
mengistinbat hukum-hakam dalam Islam. Terdapat lima sumber yang digunakan oleh Imam
al-Shafie dalam mengistinbat hukum fiqah iaitu; al-Quran dan al-Sunnah, ijmak, pendapat
sahabat Nabi Muhammad SAW yang disepakati dan tidak disepakati dan qias. Untuk merujuk kesemua sumber ini, ia memerlukan kefahaman dan ketelitian kerana sumber
rujukan tersebut menggunakan bahasa Arab. Justeru, makalah ini diketengahkan adalah
bertujuan untuk mengkaji kepentingan mempelajari dan menguasai ilmu bahasa Arab dalam usaha mengistinbat hukum fiqah menurut Imam al-Shafie. Kesimpulannya, mempelajari dan menguasai bahasa Arab adalah perkara asas dalam kalangan para fuqaha’ untuk mengistinbat hukum-hukum fiqah.

Natural Language Processing is a programmed approach to analyze text that is based on both a
set of theories and a set of technologies. This forum aims to bring together researchers who have
designed and build software that will analyze, understand, and generate languages that humans
use naturally to address computers.

In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.

The availability of lexical resources is huge to accelerate and simplify the sentiment analysis in English. In Arabic, there are few resources and these resources are not comprehensive. Most of the current research efforts for constructing Arabic Sentiment Lexicon (ASL) depend on a large number of lexical entities. However, the coverage of all Arabic sentiment expressions can be applied using refined regular expressions rather than a large number of lexical entities. This paper presents an ASL that more comprehensive than the existing lexicons, for covering many expressions with different dialects including Franco-Arabic, and in the same time more compact. Also, this paper shows how to integrate different lexicons and to refine them. To enrich lexical entries with very robust morphological syntactical information, regular expressions, the weight of sentiment polarity and n-gram terms have been augmented to each

Parsing of Arabic sentences has considered a necessary condition in many applications that rely on the natural language processing (NLP) techniques, such as automatic translation, information retrieval, and automatic summarization, etc. The study of diacritical marks has an important role in the formation of meaning, because parsing helps to understand the meaning and the relationships between sentence parts.
In this paper, Intelligent hybrid Arabic parser relies on Genetic Algorithm (GA) and expert system contains the Arabic language grammar has been designed and implemented. Text has been segmented into group of sentences that composed from it. Initial solutions (chromosomes) have been encrypted, initial population has been generated and genetic operations have been applied. Search and inference engine applied hybrid control structure that combines forward chaining and backward chaining has been designed. The results of morphological and syntactic analyzers have been relied on to evaluate solutions, Evaluation have been done through fitness function to reach to the optimal solution which is the true parsing of sentence words. Parser has been tested on many of Arabic sentences, and Results have been evaluated using precision criterion. Results of this new and novel system have proved that the system is able to parse Arabic sentences correctly and with high accuracy. This is open a broad horizon for understanding and auto processing of Arabic text.

The continuous rapid growth of electronic Arabic contents in social media channels and in Twitter particularly poses an opportunity for opinion mining research. Nevertheless, it is hindered by either the lack of sentimental analysis resources or Arabic language text analysis challenges. This study intro‐ duces an Arabic Jordanian twitter corpus where Tweets are annotated as either positive or negative. It investigates different supervised machine learning senti‐ ment analysis approaches when applied to Arabic user's social media of general subjects that are found in either Modern Standard Arabic (MSA) or Jordanian dialect. Experiments are conducted to evaluate the use of different weight schemes, stemming and N-grams terms techniques and scenarios. The experi‐ mental results provide the best scenario for each classifier and indicate that SVM classifier using term frequency–inverse document frequency (TF-IDF) weighting scheme with stemming through Bigrams feature outperforms the Naïve Bayesian classifier best scenario performance results. Furthermore, this study results outperformed other results from comparable related work.

- by Khaled Shaalan and +2
- •
- Natural Language Processing, Machine Learning, Arabic, Sentiment Analysis

Computer Science & Information Technology (CS & IT) is an open access peer reviewed Computer Science Conference Proceedings (CSCP) series that welcomes conferences to publish their proceedings / post conference proceedings. This series intends to focus on publishing high quality papers to help the scientific community furthering our goal to preserve and disseminate scientific knowledge. Conference proceedings are accepted for publication in CS & IT - CSCP based on peer-reviewed full papers and revised short papers that target international scientific community and latest IT trends. Our mission is to provide the most valuable publication service.