Feature Selection using Genetic Programming (original) (raw)

Aggressive and effective feature selection using genetic programming

2012

One of the major challenges in automatic classification is to deal with highly dimensional data. Several dimensionality reduction strategies, including popular feature selection metrics such as Information Gain and χ 2 , have already been proposed to deal with this situation. However, these strategies are not well suited when the data is very skewed, a common situation in real-world data sets. This occurs when the number of samples in one class is much larger than the others, causing common feature selection metrics to be biased towards the features observed in the largest class. In this paper, we propose the use of Genetic Programming (GP) to implement an aggressive, yet very effective, selection of attributes. Our GP-based strategy is able to largely reduce dimensionality, while dealing effectively with skewed data. To this end, we exploit some of the most common feature selection metrics and, with GP, combine their results into new sets of features, obtaining a better unbiased estimate for the discriminative power of each feature. Our proposal was evaluated against each individual feature selection metric used in our GP-based solution (namely, Information Gain, χ 2 , Odds-Ratio, Correlation Coefficient) using a k8 cancer-rescue mutants data set, a very unbalanced collection referring to examples of p53 protein. For this data set, our solution not only increases the efficiency of the learning algorithms, with an aggressive reduction of the input space, but also significantly increases its accuracy.

Genetic algorithms as a strategy for feature selection

Journal of Chemometrics, 1992

Genetic algorithms have been created as an optimization strategy to be used especially when complex response surfaces do not allow the use of better-known methods (simplex, experimental design techniques, etc.). This paper shows that these algorithms, conveniently modified, can also be a valuable tool in solving the feature selection problem. The subsets of variables selected by genetic algorithms are generally more efficient than those obtained by classical methods of feature selection, since they can produce a better result by using a lower number of features.

Data Mining Feature Subset Weighting and Selection Using Genetic Algorithms

2012

We present a simple genetic algorithm (sGA), which is developed under Genetic Rule and Classifier Construction Environment (GRaCCE) to solve feature subset selection and weighting problem to have better classification accuracy on k-nearest neighborhood (KNN) algorithm. Our hypotheses are that weighting the features will affect the performance of the KNN algorithm and will cause better classification accuracy rate than that of binary classification. The weighted-sGA algorithm uses real-value chromosomes to find the weights for features and binary-sGA uses integer-value chromosomes to select the subset of features from original feature set. A Repair algorithm is developed for weighted-sGA algorithm to guarantee the feasibility of chromosomes. By feasibility we mean that the sum of values of each gene in a chromosome must be equal to 1. To calculate the fitness values for each chromosome in the population, we use K Nearest Neighbor Algorithm (KNN) as our fitness function. The Euclidean distance from one individual to other individuals is calculated on the d-dimensional feature space to classify an unknown instance. GRaCCE searches for good feature subsets and their associated weights. These feature weights are then multiplied with normalized feature values and these new values are used to calculate the distance between features.

Addressing Optimisation Challenges for Datasets with Many Variables, Using Genetic Algorithms to Implement Feature Selection

AI, Computer Science and Robotics Technology

This article provides an optimisation method using a Genetic Algorithm approach to apply feature selection techniques for large data sets to improve accuracy. This is achieved through improved classification, a reduced number of features, and furthermore it aids in interpreting the model. A clinical dataset, based on heart failure, is used to illustrate the nature of the problem and to show the effectiveness of the techniques developed. Clinical datasets are sometimes characterised as having many variables. For instance, blood biochemistry data has more than 60 variables that have led to complexities in developing predictions of outcomes using machine-learning and other algorithms. Hence, techniques to make them more tractable are required. Genetic Algorithms can provide an efficient and low numerically complex method for effectively selecting features. In this paper, a way to estimate the number of required variables is presented, and a genetic algorithm is used in a “wrapper” form...

A genetic algorithm-based method for feature subset selection

Soft Computing, 2007

As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.

An empirical study of feature selection for classification using genetic algorithm

International Journal of Advanced Intelligence Paradigms, 2018

Feature selection is one of the most important pre-processing steps for a data mining, pattern recognition or machine learning problem. Features get eliminated because either they are irrelevant, or they are redundant. As per literature study, most of the approaches combine the above objectives in a single numeric measure. In this paper, in contrast the problem of finding optimal feature subset has been formulated as a multi objective problem. The concept of redundancy is further refined with a concept of threshold value. Additionally, an objective of maximising entropy has been added. An extensive empirical study has been setup which uses 33 publicly available dataset. A 12% improvement in classification accuracy is reported in a multi objective setup. Other suggested refinements have shown to improve the performance measure. The performance improvement is statistical significant as found by pair wise t-test and Friedman's test.

An Implementation of genetic algorithm based feature selection approach over medical datasets

International Journal of Engineering and Technology, 2017

One of the heuristic approaches that can be applied to many real world applications for the attainment of optimized solutions is the Genetic Algorithm (GA). Feature selection techniques of data mining alsotake the advantages of genetic algorithm for extracting the meaningful attributes from high dimensional datasets. This paper presents an improved genetic algorithm for the same feature selection process for the enhancement of classification or clustering results. A Multiple Linear Regression (MLR) technique is employed as fitness function to identify the best influencing attribute for building a knowledge prediction model. The results of the MLR-GA have outperformed the existing feature selection algorithms in terms of accuracy. Keyword-Genetic Algorithm, Feature Selection, MLR, Fitness Function, High Dimensional I. INTRODUCTION Data mining comprises of techniques that analyzes the meaningful facts that are hidden in large data stores. The tasks of data mining can be organized into data collection, cleaning, data extraction and predictive or descriptive analysis of data with which the extraction of meaningful data is the most important task for enhancing the outcome of facts. Data extraction also called feature selection can be viewed as a search technique that proposes new subset of features through an evaluation measure based on ranking the attributes. The ranking of the attributes are computed by selecting the subset of features that minimizes the error rate. Feature selection performs a complete search over the space for identifying the best attributes which may increase the computational cost for all but the smallest of feature sets. Despite the types of feature selection algorithm are majorly classified into filter and wrapper methods, now-a-days the evolutionary algorithms have also implemented to obtain the optimized features. The evolutionary algorithms are meta-heuristics algorithms that employ population based optimized solutions in search space. Genetic algorithm stands first and best ever known evolutionary algorithm that imitates the natural process of gene duplication such as selection, cross-over and mutation for the reproduction of chromosomes. The algorithm is initialized with few random individuals called initial populations (IPs) which are initially considered sub optimized solutions which can be refined in the consecutive iterations. The fitness of IPs is calculated next, for assessing their ability in the prediction task. During each iteration, a set of best individuals are chosen to breed a new offspring through cross-over and mutation operations and the fitness of the new offspring is calculated to replace the least fitness valued individuals. In this paper, a novel genetic algorithm is proposed to calculate the fitness of the attributes to extract the best features that most contribute to the prediction task. II. REVIEW OF LITERATURE Abualigah et al. [1] have proposed a genetic algorithm based unsupervised feature selection method (FSGATC) for extracting the best subset of features that obtains accurate clusters. The authors have used mean absolute difference method as fitness function, uniform two point as mutation operator and a probability based parameter as cross over operator. The authors have claimed that the proposed FSGATC have improved the performance of text clustering with highest accuracy and suggested to collaborate the proposed work with other meta-heuristics algorithms to improve global search to find more accurate clusters. Kashyap et al. [2] have proposed a multi objective genetic algorithm for feature selection with the objective of maximizing the Laplacian score which aims at analyzing the importance or relevance of features and minimizing the inter-attribute correlation which aims at analyzing the dependency of feature. The authors have set the mutation, cross over and Laplacian score probability as 0.7, 0.05 and 0.5 respectively to perform the multi objective GA operations and claimed that their method has achieved feature set reduction by removing