A Comparative Study on Bioinformatics Feature Selection and Classification (original) (raw)
Related papers
International Journal of Computer Science & Engineering Survey, 2011
The DNA microarray technology has modernized the approach of biology research in such a way that scientists can now measure the expression levels of thousands of genes simultaneously in a single experiment. Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. But compared to the number of genes involved, available training data sets generally have a fairly small sample size for classification. These training data limitations constitute a challenge to certain classification methodologies. Feature selection techniques can be used to extract the marker genes which influence the classification accuracy effectively by eliminating the un wanted noisy and redundant genes This paper presents a review of feature selection techniques that have been employed in micro array data based cancer classification and also the predominant role of SVM for cancer classification.
In gene expression dataset, classification is the task of involving high dimensionality and risk since large number of features is irrelevant and redundant. The classification requires feature selection method and a classification; hence this paper proposed a method of choosing suitable combination of attribute selection and classifying algorithms for good accuracy in addition for computational efficiency, generalization performance and feature interpretability. In this paper, the comparative study had done by some well known feature selection methods such as FCBF, ReliefF,
2017 10th International Conference on Electrical and Electronics Engineering (ELECO), 2017
In this study, three different feature selection algorithms are compared using Support Vector Machines as classifier for cancer classification through gene expression data. The ability of feature selection algorithms to select an optimal gene subset for a cancer type is evaluated by the classification ability of selected genes. A publicly available micro array dataset is employed for gene expression values. Selected gene subsets were able to classify subtypes of the considered cancer type with high accuracies and showed that these feature selection methods were applicable for bio-marker gene selection.
Gene expression data classification using genetic algorithm-based feature selection
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2021
In this study, hybrid methods are proposed for feature selection and classification of gene expression datasets. In the proposed genetic algorithm/support vector machine (GA-SVM) and genetic algorithm/k nearest neighbor (GA-KNN) hybrid methods, genetic algorithm is improved using Pearson's correlation coefficient, Relief-F, or mutual information. Crossover and selection operations of the genetic algorithm are specialized. Eight different gene expression datasets are used for classification process. The classification performances of the proposed methods are compared with the traditional GA-KNN and GA-SVM wrapper methods and other studies in the literature. Classification results demonstrate that higher accuracy rates are obtained with the proposed methods compared to the other methods for all datasets.
A Novel Feature Selection for Gene Expression Data
Proceedings of the 9th Joint Conference on Information Sciences (JCIS), 2006
The feature selection process can be considered a problem of global combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant, noisy and redundant data, and results in an acceptable classification accuracy. Therefore, a good feature selection method based on the number of features investigated for sample classification is needed in order to speed up the processing rate, predictive accuracy, and to avoid incomprehensibility. In this paper, particle swarm optimization (PSO) is used to implement a feature selection, and the K-nearest neighbor (K-NN) method with leave-one-out cross-validation (LOOCV) serves as an evaluator of PSO. The support vector machines (SVMs) with the one-versus-rest method serve as a classifier for the classification problem. Experimental results show that our method simplifies features effectively and obtains a higher classification accuracy compared to the other classification methods from the literature.
A novel feature selection method to improve classification of gene expression data
2004
This paper introduces a novel method for minimum number of gene (feature) selection for a classification problem based on gene expression data with an objective function to maximise the classification accuracy. The method uses a hybrid of Pearson correlation coefficient (PCC) and signal-to-noise ratio (SNR) methods combined with an evolving classification function (ECF). First, the correlation coefficients between genes in a set of thousands, is calculated. Genes, that are highly correlated across samples are considered either dependent or coregulated and form a group (a cluster). Signal-to-noise ratio (SNR) method is applied to rank the correlated genes in this group according to their discriminative power towards the classes. Genes with the highest SNR are used in a preliminary feature set as representatives of each group.
Review On Feature Selection Approaches Using Gene Expression Data
Feature selection has become elementary tool for processing high dimensional data. DNA microarray technology is used for the study of large number of genes simultaneously, which helps in determining the expression levels of the genes. Gene selection using high dimensional gene expression data is foremost and imperative for prediction and classification of disease. This gene expression data can be shown in the form of matrix and usually contains irrelevant, redundant and noisy data, so the study and analysis of data becomes very problematic. The prime purpose of feature selection approaches is to remove the curse of dimensionality, improve the performance and accuracy of classification and clustering algorithms by the elimination of these irrelevant features and reduction of noise. This paper explains the taxonomy of feature selection methods stating their respective pros and cons. It also presents a review on few feature selection approaches, mainly those that have been proposed over the past few years.
Effectiveness of Feature Selection and Classification Techniques for Gene Expression Data Analysis
2011
Gene expression data is characterized by high dimensionality and small number of samples. Reducing the dimensionality is essential for effective analysis of the samples for efficient knowledge discovery. Actually, there is a tradeoff between feature selection and maintaining acceptable accuracy. The target is to find the reduction level or compact set of features which once used for knowledge discovery will lead to acceptable accuracy. Realizing the importance of dimensionality reduction for gene expression data, this paper presents novel framework which integrates dimensionality reduction with classification for gene expression data analysis. In other words, we present techniques for feature selection and demonstrate their effectiveness once coupled with data mining techniques for knowledge discovery. We concentrate on four feature selection techniques, namely chi-square, consistency subset, clustering-based and communitybased. The effectiveness of the feature reduction techniques ...
A novel approach for automatic gene selection and classification of gene based colon cancer datasets
2014 International Conference on Emerging Technologies (ICET), 2014
Colon cancer heavily changes the composition of human genes (expressions). The deviation in the chemical composition of genes can be exploited to automatically diagnose colon cancer. The major challenge in the analysis of human gene based datasets is their large dimensionality. Therefore, efficient techniques are needed to select discerning genes. In this research article, we propose a novel classification technique that exploits the variations in gene expressions for classifying colon gene samples into normal and malignant classes, and quite intelligently tackles the larger dimensionality of gene based datasets. Previously individual feature selection techniques have been used for selection of discerning gene expressions, however, their performance is limited. In this research study, we propose a feed forward gene selection technique, wherein, two feature selection techniques are used one after the other. The genes selected by the first technique are fed as input to the second feature selection technique that selects genes from the given gene subset. The selected genes are then classified by using linear kernel of support vector machines (SVM). The feed forward approach of gene selection has shown improved performance. The proposed technique has been tested on three standard colon cancer datasets, and improved performance has been observed. It is observed that feed forward method of gene selection substantially reduces the size of gene based datasets, thereby reducing the computational time to a great extent. Performance of the proposed technique has also been compared with existing techniques of colon cancer diagnosis, and improved performance has been observed. Therefore, we hope that the proposed technique can be effectively used for diagnosis of colon cancer.
A Comparative Analysis of Feature Extraction Methods for Classifying Colon Cancer Microarray Data
ICST Transactions on Scalable Information Systems, 2017
Feature extract ion is a proficient method for reducing dimensions in the analysis and prediction of cancer classification. Microarray procedure has shown great importance in fetching informat ive genes th at needs enhancement in diagnosis. Microarray data is a challenging task due to high dimensional-low sample dataset with a lot of noisy or irrelevant genes and missing data. In this paper, a comparative study to demonstrate the effectiveness of feature ext raction as a dimensionality reduction process is proposed, and concludes by investigating the most efficient approach that can be used to enhance classification of microarray. Principal Co mponent Analysis (PCA) as an unsupervised technique and Partial Least Square (PLS) as a supervised technique are considered, Support Vector Machine (SVM) classifier were applied on the dataset. The overall result shows that PLS algorithm provides an improved performance of about 95.2% accu racy compared to PCA algorith ms .