Wesam Ashour | Islamic University of Gaza (original) (raw)
Papers by Wesam Ashour
International Journal of Intelligent Systems and Applications
The detection of outliers in text documents is a highly challenging task, primarily due to the un... more The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found in other documents belonging to the same category. Mining text document outliers has wide applications in various domains, including spam email identification, digital libraries, medical archives, enhancing the performance of web search engines, and cleaning corpora used in document classification. To address the issue of dimensionality, it is crucial to employ feature selection techniques that reduce the large number of features without compromising their representativeness of the domain. In this paper, we propose a hybrid density-based approach that incorporates mutual information for text document outlier detection. The proposed approach utilizes normalized mutual information to identify the most distinct features that characterize the target doma...
Lecture notes in networks and systems, Jul 13, 2022
2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE)
Basic Sequential Algorithm Scheme BSAS is a sequential algorithm for data clustering. It is suita... more Basic Sequential Algorithm Scheme BSAS is a sequential algorithm for data clustering. It is suitable for unraveling compact dataset. The BSAS algorithm is sensitive to the order of data presentation; different clustering results could be produced if the input data are presented in a different order. Because the number of clusters in the results varies depending on the value of threshold, multiple run is one of the solutions to obtain optimal threshold.In this paper, BSAS is optimized using Ant Colony Optimization ACO Algorithm to solve the order sensitivity problem. The new proposed algorithm obtains the best order from ACO algorithm, which is based on the calculations of minimum distances between points, and passes the optimal order to BSAS algorithm as an input order. Finally, the proposed algorithm is compared and verified using the Sum Square Error SSE. The experimental results show that the proposed algorithm developed the BSAS algorithm.
The work provided in this thesis, unless otherwise referenced, is the researcher's own work, and ... more The work provided in this thesis, unless otherwise referenced, is the researcher's own work, and has not been submitted by others elsewhere for any other degree or qualification.
Clustering and segmentation algorithms that depend on Gaussian kernel function as a way for const... more Clustering and segmentation algorithms that depend on Gaussian kernel function as a way for constructing affinity matrix, these algorithms like spectral clustering algorithms suffer from the poor estimation of parzen window . The final results depend on this parameter and differ on each time we change it.In this paper we present a new algorithm for estimation using optimization techniques, we construct a vector , each corresponding to i th row in a dissimilarity matrix which is used to construct an affinity matrix using Gaussian kernel function. Our algorithm shows that choosing as the formula 2 = ( , ) 2 ( , ) 2 is the opti-2 ( , ) 2 ( , ) 2 mum estimation, and we introduce more than one approach to calculate global value for from this vector. The affinity matrix which is produced using our algorithm is very informative and contains addition information like the number of clusters
Journal of Engineering Research and Technology, 2017
In this paper we propose a clustering method based on combination of the Particle Swarm Optimizat... more In this paper we propose a clustering method based on combination of the Particle Swarm Optimization (PSO) and the inverse weighted clustering algorithm IWC, It is shown how PSO can be used to find the centroids of a user specified number of clusters and basically uses PSO to refine the clusters formed by IWC. Since PSO algorithm was showed to successfully converge during the initial stages of a global search, but around global optimum, the search process will become very slow. On the contrary, IWC algorithm can achieve faster convergence to optimum solution, Experimental results show that the proposed technique has much potential to improve the clustering process.
This research presents and compares the impact of text preprocessing, which has not been addresse... more This research presents and compares the impact of text preprocessing, which has not been addressed before, on Arabic text classification using popular text classification algorithms; Decision Tree, K Nearest Neighbors, Support Vector Machines, Naïve Bayes and its variations. Text preprocessing includes applying different term weighting schemes, and Arabic morphological analysis (stemming and light stemming). We implemented and integrated Arabic morphological analysis tools within the leading open source machine learning tools: Weka, and RapidMiner. Text Classification algorithms are applied on seven Arabic corpora (3 in-house collected and 4 existing corpora). Experimental results show: (1) Light stemming with term pruning is best feature reduction technique. (2) Support Vector Machines and Naïve Bayes variations outperform other algorithms. (3) Weighting schemes impact the performance of distance based classifier.
Text mining draw more and more attention recently, it has been applied on different domains inclu... more Text mining draw more and more attention recently, it has been applied on different domains including web mining, opinion mining, and sentiment analysis. Text pre-processing is an important stage in text mining. The major obstacle in text mining is the very high dimensionality and the large size of text data. Natural language processing and morphological tools can be employed to reduce dimensionality and size of text data. In addition, there are many term weighting schemes available in the literature that may be used to enhance text representation as feature vector. In this paper, we study the impact of text pre-processing and different term weighting schemes on Arabic text classification. In addition, develop new combinations of term weighting schemes to be applied on Arabic text for classification purposes.
International Journal of Intelligent Systems and Applications, 2012
Clustering of huge spatial databases is an important issue which tries to track the densely regio... more Clustering of huge spatial databases is an important issue which tries to track the densely regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. Clustering approach should be efficient and can detect clusters of arbitrary shapes because spatial objects cannot be simply abstracted as isolated points they have different boundary, size, volume, and location. In this paper we use discrete wave atom transformation technique in clustering to achieve more accurate result .By using multi-resolution transformation like wavelet and wave atom we can effectively identify arbitrary shape clusters at different degrees of accuracy. Experimental results on very large data sets show the efficiency and effectiveness of the proposed wave atom bases clustering approach compared to other recent clustering methods. Experimental result shows that we get more accurate result and denoised output than others.
Specifying an address or placing a specific classification to a page of text is an easy process s... more Specifying an address or placing a specific classification to a page of text is an easy process somewhat, but what if there were many of these pages needed to reach a huge amount of documents. The process becomes difficult and debilitating to the human mind. Automatic text classification is the perfect solution to this problem by identifying a category for each document automatically. This can be achieved by machine learning; by building a model contains all possible attributes features of the text. But with the increase of attributes features, we had to pick the distinguishing features where a model is created to simulate the large amount of attributes (thousands of attributes). To deal with the high dimension of the original dataset, we use features selection process to reduce it by deleting the irrelevant attributes, words, where the rest of features still contain relevant information needed in the process of classification. In this research, a new approach which is Binary Partic...
K-means clustering algorithm is one of the best known algorithms used in clustering; nevertheless... more K-means clustering algorithm is one of the best known algorithms used in clustering; nevertheless it has many disadvantages as it may converge to a local optimum, depending on its random initialization of prototypes. We will propose an enhancement to the initialization process of k-means, which depends on using statistical information from the data set to initialize the prototypes. We show that our algorithm gives valid clusters, and that it decreases error and time. General Terms Data Mining, Unsupervised Learning, Data Clustering.
BIRCH algorithm is a clustering algorithm suitable for very large data sets. In the algorithm, a ... more BIRCH algorithm is a clustering algorithm suitable for very large data sets. In the algorithm, a CF-tree is built whose all entries in each leaf node must satisfy a uniform threshold T, and the CF-tree is rebuilt at each stage by different threshold. But using a single threshold cause many shortcomings in the birch algorithm, in this paper to propose a solution to this shortcoming by using multiple thresholds instead of a single threshold.
International Journal of Software Engineering and Its Applications
International Journal of Computer Applications
The Travelling Salesman Problem (TSP) is a Well-known nondeterministic problem aims to find the s... more The Travelling Salesman Problem (TSP) is a Well-known nondeterministic problem aims to find the shortest route that visits each city once and finally returns back to the starting city. Ant Colony Optimization (ACO) technique gives a good solution to TSP, However it takes a lot of computational time. In This paper, a novel algorithm as proposed to solve TSP. Adaptive Affinity Propagation (AAP) was used to optimize the performance of Ant Colony Optimization. The basic idea of the new proposed approach is to group cities into many clusters using AAP and then find the optimal path for each cluster separately using ACO. Thus, the computational time decreases. Experimental results show that the proposed algorithm has preferable performance compared to ACO in term of computational time and optimal path length.
International Journal of Knowledge-based and Intelligent Engineering Systems
ABSTRACT We discuss one of the shortcomings of the standard K-means algorithm - its tendency to c... more ABSTRACT We discuss one of the shortcomings of the standard K-means algorithm - its tendency to converge to a local rather than a global optimum. This is often accommodated by means of different random restarts of the algorithm, however in this paper, we attack the problem by amending the performance function of the algorithm in such a way as to incorporate global information into the performance function. We do this in three different manners and show on artificial data sets that the resulting algorithms are less initialisation-dependent than the standard K-means algorithm. We also show how to create a family of topology-preserving manifolds using these algorithms and an underlying constraint on the positioning of the prototypes.
International Journal of Computer Science and Information Technology, 2016
The amount of text data mining in the world and in our life seems ever increasing and there's no ... more The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification.
International Journal of Signal Processing, Image Processing and Pattern Recognition, 2013
In this work, we develop a new method of setting the input to reservoir and reservoir to reservoi... more In this work, we develop a new method of setting the input to reservoir and reservoir to reservoir weights in echo state machines. We use a clustering technique which we have previously developed as a pre-processing stage to set the reservoir parameters which at this stage are prototypes. We then use these prototypes as weights in the standard architecture while setting the reservoir to output weights in a standard manner. We show results on a variety of data sets in the literature which show that this method out-performs a standard random echo state machine.
ABSTRACT Clustering is widely used to explore and understand large collections of data. K-means c... more ABSTRACT Clustering is widely used to explore and understand large collections of data. K-means clustering method is one of the most popular approaches due to its ease of use and simplicity to implement. In this book, the researcher introduces Distance-based Initialization Method for K-means clustering algorithm (DIMK-means) which is developed to select carefully a set of centroids that would get high accuracy results compared to the random selection of standard K-means clustering method in choosing initial centroids, which gets low accuracy results. The researcher also Introduces Density-based Split- and -Merge K-means clustering Algorithm (DSMK-means) which is developed to address stability problems of K-means clustering, and to improve the performance of clustering when dealing with datasets that contain clusters with different complex shapes and noise or outliers. Based on a set of many experiments, this research concluded that the developed algorithms are more capable to finding high accuracy results compared with other algorithms.
ABSTRACT We consider the problem of visualisation of high dimensional multivariate time series. A... more ABSTRACT We consider the problem of visualisation of high dimensional multivariate time series. A data analyst in creating a two dimensional projection of such a time series might hope to gain some intuition into the structure of the original high dimensional data set. We review a method for visualising time series data using an extension of Echo State Networks (ESNs).The method uses the multidimensional scaling criterion in order to create a visualisation of the time series after its representation in the reservoir of the ESN. We illustrate the method with two dimensional maps of a �financial time series. The method is then compared with a mapping which uses a fixed latent space and a novel objective function.
International Journal of Intelligent Systems and Applications
The detection of outliers in text documents is a highly challenging task, primarily due to the un... more The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found in other documents belonging to the same category. Mining text document outliers has wide applications in various domains, including spam email identification, digital libraries, medical archives, enhancing the performance of web search engines, and cleaning corpora used in document classification. To address the issue of dimensionality, it is crucial to employ feature selection techniques that reduce the large number of features without compromising their representativeness of the domain. In this paper, we propose a hybrid density-based approach that incorporates mutual information for text document outlier detection. The proposed approach utilizes normalized mutual information to identify the most distinct features that characterize the target doma...
Lecture notes in networks and systems, Jul 13, 2022
2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE)
Basic Sequential Algorithm Scheme BSAS is a sequential algorithm for data clustering. It is suita... more Basic Sequential Algorithm Scheme BSAS is a sequential algorithm for data clustering. It is suitable for unraveling compact dataset. The BSAS algorithm is sensitive to the order of data presentation; different clustering results could be produced if the input data are presented in a different order. Because the number of clusters in the results varies depending on the value of threshold, multiple run is one of the solutions to obtain optimal threshold.In this paper, BSAS is optimized using Ant Colony Optimization ACO Algorithm to solve the order sensitivity problem. The new proposed algorithm obtains the best order from ACO algorithm, which is based on the calculations of minimum distances between points, and passes the optimal order to BSAS algorithm as an input order. Finally, the proposed algorithm is compared and verified using the Sum Square Error SSE. The experimental results show that the proposed algorithm developed the BSAS algorithm.
The work provided in this thesis, unless otherwise referenced, is the researcher's own work, and ... more The work provided in this thesis, unless otherwise referenced, is the researcher's own work, and has not been submitted by others elsewhere for any other degree or qualification.
Clustering and segmentation algorithms that depend on Gaussian kernel function as a way for const... more Clustering and segmentation algorithms that depend on Gaussian kernel function as a way for constructing affinity matrix, these algorithms like spectral clustering algorithms suffer from the poor estimation of parzen window . The final results depend on this parameter and differ on each time we change it.In this paper we present a new algorithm for estimation using optimization techniques, we construct a vector , each corresponding to i th row in a dissimilarity matrix which is used to construct an affinity matrix using Gaussian kernel function. Our algorithm shows that choosing as the formula 2 = ( , ) 2 ( , ) 2 is the opti-2 ( , ) 2 ( , ) 2 mum estimation, and we introduce more than one approach to calculate global value for from this vector. The affinity matrix which is produced using our algorithm is very informative and contains addition information like the number of clusters
Journal of Engineering Research and Technology, 2017
In this paper we propose a clustering method based on combination of the Particle Swarm Optimizat... more In this paper we propose a clustering method based on combination of the Particle Swarm Optimization (PSO) and the inverse weighted clustering algorithm IWC, It is shown how PSO can be used to find the centroids of a user specified number of clusters and basically uses PSO to refine the clusters formed by IWC. Since PSO algorithm was showed to successfully converge during the initial stages of a global search, but around global optimum, the search process will become very slow. On the contrary, IWC algorithm can achieve faster convergence to optimum solution, Experimental results show that the proposed technique has much potential to improve the clustering process.
This research presents and compares the impact of text preprocessing, which has not been addresse... more This research presents and compares the impact of text preprocessing, which has not been addressed before, on Arabic text classification using popular text classification algorithms; Decision Tree, K Nearest Neighbors, Support Vector Machines, Naïve Bayes and its variations. Text preprocessing includes applying different term weighting schemes, and Arabic morphological analysis (stemming and light stemming). We implemented and integrated Arabic morphological analysis tools within the leading open source machine learning tools: Weka, and RapidMiner. Text Classification algorithms are applied on seven Arabic corpora (3 in-house collected and 4 existing corpora). Experimental results show: (1) Light stemming with term pruning is best feature reduction technique. (2) Support Vector Machines and Naïve Bayes variations outperform other algorithms. (3) Weighting schemes impact the performance of distance based classifier.
Text mining draw more and more attention recently, it has been applied on different domains inclu... more Text mining draw more and more attention recently, it has been applied on different domains including web mining, opinion mining, and sentiment analysis. Text pre-processing is an important stage in text mining. The major obstacle in text mining is the very high dimensionality and the large size of text data. Natural language processing and morphological tools can be employed to reduce dimensionality and size of text data. In addition, there are many term weighting schemes available in the literature that may be used to enhance text representation as feature vector. In this paper, we study the impact of text pre-processing and different term weighting schemes on Arabic text classification. In addition, develop new combinations of term weighting schemes to be applied on Arabic text for classification purposes.
International Journal of Intelligent Systems and Applications, 2012
Clustering of huge spatial databases is an important issue which tries to track the densely regio... more Clustering of huge spatial databases is an important issue which tries to track the densely regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. Clustering approach should be efficient and can detect clusters of arbitrary shapes because spatial objects cannot be simply abstracted as isolated points they have different boundary, size, volume, and location. In this paper we use discrete wave atom transformation technique in clustering to achieve more accurate result .By using multi-resolution transformation like wavelet and wave atom we can effectively identify arbitrary shape clusters at different degrees of accuracy. Experimental results on very large data sets show the efficiency and effectiveness of the proposed wave atom bases clustering approach compared to other recent clustering methods. Experimental result shows that we get more accurate result and denoised output than others.
Specifying an address or placing a specific classification to a page of text is an easy process s... more Specifying an address or placing a specific classification to a page of text is an easy process somewhat, but what if there were many of these pages needed to reach a huge amount of documents. The process becomes difficult and debilitating to the human mind. Automatic text classification is the perfect solution to this problem by identifying a category for each document automatically. This can be achieved by machine learning; by building a model contains all possible attributes features of the text. But with the increase of attributes features, we had to pick the distinguishing features where a model is created to simulate the large amount of attributes (thousands of attributes). To deal with the high dimension of the original dataset, we use features selection process to reduce it by deleting the irrelevant attributes, words, where the rest of features still contain relevant information needed in the process of classification. In this research, a new approach which is Binary Partic...
K-means clustering algorithm is one of the best known algorithms used in clustering; nevertheless... more K-means clustering algorithm is one of the best known algorithms used in clustering; nevertheless it has many disadvantages as it may converge to a local optimum, depending on its random initialization of prototypes. We will propose an enhancement to the initialization process of k-means, which depends on using statistical information from the data set to initialize the prototypes. We show that our algorithm gives valid clusters, and that it decreases error and time. General Terms Data Mining, Unsupervised Learning, Data Clustering.
BIRCH algorithm is a clustering algorithm suitable for very large data sets. In the algorithm, a ... more BIRCH algorithm is a clustering algorithm suitable for very large data sets. In the algorithm, a CF-tree is built whose all entries in each leaf node must satisfy a uniform threshold T, and the CF-tree is rebuilt at each stage by different threshold. But using a single threshold cause many shortcomings in the birch algorithm, in this paper to propose a solution to this shortcoming by using multiple thresholds instead of a single threshold.
International Journal of Software Engineering and Its Applications
International Journal of Computer Applications
The Travelling Salesman Problem (TSP) is a Well-known nondeterministic problem aims to find the s... more The Travelling Salesman Problem (TSP) is a Well-known nondeterministic problem aims to find the shortest route that visits each city once and finally returns back to the starting city. Ant Colony Optimization (ACO) technique gives a good solution to TSP, However it takes a lot of computational time. In This paper, a novel algorithm as proposed to solve TSP. Adaptive Affinity Propagation (AAP) was used to optimize the performance of Ant Colony Optimization. The basic idea of the new proposed approach is to group cities into many clusters using AAP and then find the optimal path for each cluster separately using ACO. Thus, the computational time decreases. Experimental results show that the proposed algorithm has preferable performance compared to ACO in term of computational time and optimal path length.
International Journal of Knowledge-based and Intelligent Engineering Systems
ABSTRACT We discuss one of the shortcomings of the standard K-means algorithm - its tendency to c... more ABSTRACT We discuss one of the shortcomings of the standard K-means algorithm - its tendency to converge to a local rather than a global optimum. This is often accommodated by means of different random restarts of the algorithm, however in this paper, we attack the problem by amending the performance function of the algorithm in such a way as to incorporate global information into the performance function. We do this in three different manners and show on artificial data sets that the resulting algorithms are less initialisation-dependent than the standard K-means algorithm. We also show how to create a family of topology-preserving manifolds using these algorithms and an underlying constraint on the positioning of the prototypes.
International Journal of Computer Science and Information Technology, 2016
The amount of text data mining in the world and in our life seems ever increasing and there's no ... more The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification.
International Journal of Signal Processing, Image Processing and Pattern Recognition, 2013
In this work, we develop a new method of setting the input to reservoir and reservoir to reservoi... more In this work, we develop a new method of setting the input to reservoir and reservoir to reservoir weights in echo state machines. We use a clustering technique which we have previously developed as a pre-processing stage to set the reservoir parameters which at this stage are prototypes. We then use these prototypes as weights in the standard architecture while setting the reservoir to output weights in a standard manner. We show results on a variety of data sets in the literature which show that this method out-performs a standard random echo state machine.
ABSTRACT Clustering is widely used to explore and understand large collections of data. K-means c... more ABSTRACT Clustering is widely used to explore and understand large collections of data. K-means clustering method is one of the most popular approaches due to its ease of use and simplicity to implement. In this book, the researcher introduces Distance-based Initialization Method for K-means clustering algorithm (DIMK-means) which is developed to select carefully a set of centroids that would get high accuracy results compared to the random selection of standard K-means clustering method in choosing initial centroids, which gets low accuracy results. The researcher also Introduces Density-based Split- and -Merge K-means clustering Algorithm (DSMK-means) which is developed to address stability problems of K-means clustering, and to improve the performance of clustering when dealing with datasets that contain clusters with different complex shapes and noise or outliers. Based on a set of many experiments, this research concluded that the developed algorithms are more capable to finding high accuracy results compared with other algorithms.
ABSTRACT We consider the problem of visualisation of high dimensional multivariate time series. A... more ABSTRACT We consider the problem of visualisation of high dimensional multivariate time series. A data analyst in creating a two dimensional projection of such a time series might hope to gain some intuition into the structure of the original high dimensional data set. We review a method for visualising time series data using an extension of Echo State Networks (ESNs).The method uses the multidimensional scaling criterion in order to create a visualisation of the time series after its representation in the reservoir of the ESN. We illustrate the method with two dimensional maps of a �financial time series. The method is then compared with a mapping which uses a fixed latent space and a novel objective function.