New Approach for Text Mining of Arabic on the Web (original) (raw)
Related papers
New_Approach_for_Text_Mining_of_Arabic_o.pdf
IJCSIS, 2017
Abstract—an online blogs provide facility to its users to write and read text-based posts known as "articles". It became one of the most commonly used social networks. However, an important problem arises is that the returned articles, when searching for a topic phrase, are only sorted by recently not relevancy. This makes the user to manually read through the articles in order to understand what are primarily saying about the particular topic. Some strategies were developed for clustering English text but Arabic text clustering is still an active research area. A major challenge in article clustering is the extremely high dimensionality. In this paper we proposed the new method for features reduction using stemming, (Arabic WordNet) (Arabic Word Net) dictionary and Arabic diacritics, Also, new method in measuring similarity by using (Arabic WordNet) relations to enhance accuracy of clustering.
A New Model for Arabic Text Clustering by Word Embedding and Arabic Word Net
Saudi J Eng Technol, 2019; 4(10): 401-406, 2019
A major challenge in article clustering is high dimensional, because this will affect directly to the accuracy. However, it is becoming more important due to the huge textual information available online. In this paper, we proposed an Arabic word net dictionary to extract, select and reduce the features. Additionally, we use the embedding Word2Vector model as feature weighting technique. Finally, for the clustering uses the hierarchy clustering. Our methods are using the Arabic word net dictionary with word embedding, additionally by using the discretization. This method are effective and can enhance improve the accuracy of clustering, which shown in our experimental results. Keywords: Machine Learning, Clustering, CBOW, SKIP-GRAM, Word Embedding, Arabic Word Net Dictionary.
Reinforcing Arabic Language Text Clustering : Theory and Application
2016
This paper presents a novel approach for automatic Arabic text clustering. The proposed method combines two well-known information retrieval techniques that are latent semantic indexing (LSI) and cosine similarity measure. The standard LSI technique generates the textual feature vectors based on the words co-occurrences; however, the proposed method generates the feature vectors using the cosine measures between the documents. The goal is to obtain high quality textual clusters based on semantic rich features for the benefit of linguistic applications. The performance of the proposed method evaluated using an Arabic corpus that contains 1,000 documents belongs to 10 topics (100 documents for each topic).For clustering, we used expectation-maximization (EM) unsupervised clustering technique to cluster the corpus’s documents for ten groups. The experimental results show that the proposed method outperforms the standard LSI method by about 15%.
International Journal of Electrical and Computer Engineering (IJECE), 2018
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
Enhancement of Arabic Text Classification Using Semantic Relations of Arabic WordNet
Journal of Computer Science, 2015
When it comes to Arabic text documents, Text Categorization (TC) becomes a challenge. TC is needed for clustering purposes in order to complete Text Mining (TC). Based on the nature of Arabic language, extracting roots or stems from the breakdown of multiple Arabic words and phrases is important task before applying TC. The results obtained by applying the proposed algorithm are compared with the results of three popular algorithms. These algorithms are Khoja stemmer, Light stemmer, and Root extractor. The performance of these three techniques are evaluated and compared based on the accuracy of Naive Bayesian classifier. The obtained result demonstrates that these techniquesare not as promising as expected.Therefore, we decided to consider the position tagger and conceptual representation to answer the question, which approach enhances the Arabic TC performance? Arabic WordNet (AWN))is used as a lexical and semantic resource. The performance of new relation "Has-hyponym",suggested in this work, iscompared with otheralready used relations like Synset, term+ Synset, all Synsets, and Bag of words representation to demonstrate its effectiveness. From the experimental results, it was found that the new suggested relationimproved the Arabic text classification, at which the macro average F1 is raised to 0.75437compared with the performance of the other approaches.
Stemming Effectiveness in Clustering of Arabic Documents
International Journal of Computer Applications, 2012
Clustering is an important task gives good results with information retrieval (IR), it aims to automatically put similar documents in one cluster. Stemming is an important technique, used as feature selection to reduce many redundant features have the same root in root-based stemming and have the same syntacticalform in light stemming. Stemming has many advantages it reducesthe size of document and increases processing speed and used in many applications as information retrieval (IR). In this paper, we have evaluatedstemmingtechniques in clustering of Arabic language documents and determined the most efficient in preprocessing of Arabic language,whichis more complex than most other languages. Evaluation used three stemming techniques: root-based Stemming, light Stemming and without stemming. K-means, one of famous and widely clustering algorithm, is applied for clustering.Evaluation depends on recall, precision andF-measure methods. From experiments, results show that light stemming achieved best results in terms of recall, precision and F-measure when compared with others stemming.
A New Algorithm for Arabic Document Clustering Utilizing Maximal Wordsets
Revue d'intelligence artificielle, 2024
Arabic document clustering (ADC) is a critical task in Arabic Natural Language Processing (ANLP), with applications in text mining, information retrieval, Arabic search engines, sentiment analysis, topic modeling, document summarization, and user review analysis. In spite of the critical needs of ADC, the available ADC algorithms achieved limited success based on the evaluation metrics used for clustering. This paper proposes a novel method for clustering Arabic documents. The method leverages Maximal Frequent Wordsets (MFWs). The MFWs are extracted using the FPMax algorithm, a data mining technique adept at identifying significant recurring word patterns within the documents. These MFWSs serve as features for a new clustering approach that groups documents based on content similarity. Each MFW serves as a data structure housing features, their respective strengths in clustering, and the corresponding documents, simplifying the clustering process to a mere measurement of similarity. The proposed approach offers various clustering results for varying numbers of clusters in one training session. The effectiveness of the proposed method is assessed using two well-known benchmark datasets (CNN and OSAC), achieving accuracy of 80% and 81% respectively. This approach offers a promising contribution to the field of ANLP.
Influence of stemming on Clustering of Arabic texts: Comparative Study in Document Retrieval
International Journal of Computer Applications, 2013
Initially, this paper, sets out to study the influence of stemming on the quality of the Arabic text clustering, and then describes the testing the application of an approach based on this clustering to improve Document Retrieval (DR). A classical local document system generally, employs statistical methods for calculating the similarity between the introduced query and each document in the target collection to finally provide an ordered list of documents (hit list). In the present approach, the collection is submitted to the clustering process, and then the list of documents returned is constructed from formed clusters based on the nearest representative among the representatives of clusters compared to the user's query. The choice of the Arabic language is motivated by its very particular morpho-syntactic characteristics.
Adapt Clustering Methods for Arabic Documents
American Journal of Information Systems, 2013
This research paper develops new clustering method (FWC) and further proposes a new approach to filtering data collected from internet resources. The focus of this research paper is clustering groups' data instances into subsets in such a manner that similar instances are grouped together, while different instances belong to different groups. The instances are thereby organized into an efficient representation that characterizes the population being sampled thereby reducing the gigantic size of retrieved data. This has been done by removing dissimilar text files, and grouping similar documents into homogeneous clusters. Arabic text files of 974 MB has been collected, processed, analyzed and filtered by using common clustering methods. This new clustering methods are presented, divided into: hierarchical, partitioning, density-based, model-based and soft-computing methods. Following the methods, the challenges of performing clustering in large data sets are discussed and tested by the proposed new clustering method. Two experiments were conducted to establish the effectiveness of FWC methods and the obtained results show that the new FCW method suggested in this paper produced better results and outperformed existing clustering methods.
Document Length Variation in the Vector Space Clustering of News in Arabic: A Comparison of Methods
2020
This article is concerned with addressing the effect of document length variation on measuring the semantic similarity in the text clustering of news in Arabic. Despite the development of different approaches for addressing the issue, there is no one strong conclusion recommending one approach. Furthermore, many of these have not been tested for the clustering of news in Arabic. The problem is that different length normalization methods can yield different analyses of the same data set, and that there is no obvious way of selecting the best one. The choice of an inappropriate method, however, has negative impacts on the accuracy and thus the reliability of clustering performance. Given the lack of agreement and disparity of opinions, we set out to comprehensively evaluate the existing normalization techniques to prove empirically which one is the best for the normalization of text length to improve the text clustering performance of news in Arabic. For this purpose, a corpus of 693 stories representing different categories and of different lengths is designed. Data is analyzed using different document length normalization methods along with vector space clustering (VSC), and then the analysis on which the clustering structure agrees most closely with the bibliographic information of the news stories is selected. The analysis of the data indicates that the clustering structure based on the byte length normalization method is the most accurate one. One main problem, however, with this method is that the lexical variables within the data set are not ranked which makes it difficult for retaining only the most distinctive lexical features for generating clustering structures based on semantic similarity. As thus, the study proposes the integration of TF-IDF for ranking the words within all the documents so that only those with the highest TF-IDF values are retained. It can be finally concluded that the proposed model proved effective in improving the function of the byte normalization method and thus on the performance and reliability of news clustering in Arabic. The findings of the study can also be extended to IR applications in Arabic. The proposed model can be usefully used in supporting the performance of the retrieval systems of Arabic in finding the most relevant documents for a given query based on semantic similarity, not document length.