Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method (original) (raw)

Gene Selection from Gene Expression Data using Genetic Algorithm for Cancer Classification

2006

Abstract Constantly improving gene expression technology offer the ability to measure the expression levels of thousand of genes in parallel. Gene expression data is expected to significantly aid in the development of efficient cancer diagnosis. Key issue that needs to be addressed is a selection of small number of genes that contribute to a disease from the thousands of genes measured on microarrays that are inherently noisy.

A Three-Stage Method to Select Informative Genes from Gene Expression Data in Classifying Cancer Classes

2010

Abstract The process of gene selection for the cancer classification faces with a major problem due to the properties of the data such as the small number of samples compared to the huge number of genes, irrelevant genes, and noisy data. Hence, this paper aims to select a near-optimal (small) subset of informative genes that is most relevant for the cancer classification. To achieve the aim, a three-stage method has been proposed.

Reliability Analysis of Classification of Gene Expression Data Using Efficient Gene Selection Techniques

Gene expression data usually contains a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Classification of tissue samples into tumor or normal is one of the applications of microarray technology. When classifying tissue samples, gene selection plays an important role. In this paper, we propose a two-stage selection algorithm for genomic data by combining some existing statistical gene selection techniques and ROC score of SVM and k-nn classifiers. The motivation for the use of a Support Vector Machine is that DNA microarray problems can be very high dimensional and have very few training data. This type of situation is particularly well suited for an SVM approach. The proposed approach is carried out by first grouping genes with similar expression profiles into distinct clusters, calculating the cluster quality and the discriminative score for each gene by using statistical techniques, and then selecting informative genes from these clusters based on the cluster quality and discriminative score .In the second stage, the effectiveness of this technique is investigated by comparing ROC score of SVM that uses different kernel functions and k-nn classifiers. Then Leave One Out Cross Validation (LOOCV) is used to validate the techniques.

Combining Bayesian Networks, k Nearest Neighbours Algorithm and Attribute Selection for Gene Expression Data Analysis

2004

In the last years, there has been a large growth in gene expression profiling technologies, which are expected to provide insight into cancer related cellular processes. Machine Learning algorithms, which are extensively applied in many areas of the real world, are not still popular in the Bioinformatics community. We report on the successful application of the combination of two supervised Machine Learning methods, Bayesian Networks and k Nearest Neighbours algorithms, to cancer class prediction problems in three DNA microarray datasets of huge dimensionality (Colon, Leukemia and NCI-60). The essential gene selection process in microarray domains is performed by a sequential search engine and after used for the Bayesian Network model learning. Once the genes are selected for the Bayesian Network paradigm, we combine this paradigm with the well known K NN algorithm in order to improve the classification accuracy.

A novel feature selection method to improve classification of gene expression data

2004

This paper introduces a novel method for minimum number of gene (feature) selection for a classification problem based on gene expression data with an objective function to maximise the classification accuracy. The method uses a hybrid of Pearson correlation coefficient (PCC) and signal-to-noise ratio (SNR) methods combined with an evolving classification function (ECF). First, the correlation coefficients between genes in a set of thousands, is calculated. Genes, that are highly correlated across samples are considered either dependent or coregulated and form a group (a cluster). Signal-to-noise ratio (SNR) method is applied to rank the correlated genes in this group according to their discriminative power towards the classes. Genes with the highest SNR are used in a preliminary feature set as representatives of each group.

Gene expression data classification using genetic algorithm-based feature selection

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2021

In this study, hybrid methods are proposed for feature selection and classification of gene expression datasets. In the proposed genetic algorithm/support vector machine (GA-SVM) and genetic algorithm/k nearest neighbor (GA-KNN) hybrid methods, genetic algorithm is improved using Pearson's correlation coefficient, Relief-F, or mutual information. Crossover and selection operations of the genetic algorithm are specialized. Eight different gene expression datasets are used for classification process. The classification performances of the proposed methods are compared with the traditional GA-KNN and GA-SVM wrapper methods and other studies in the literature. Classification results demonstrate that higher accuracy rates are obtained with the proposed methods compared to the other methods for all datasets.

A model for gene selection and classifi cation of gene expression data

Gene expression data are expected to be of significant help in the development of effi cient cancer diagnosis and classifi cation platforms. One problem arising from these data is how to select a small subset of genes from thousands of genes and a few samples that are inherently noisy. This research aims to select a small subset of informative genes from the gene expression data which will maximize the classifi cation accuracy. A model for gene selection and classifi cation has been developed by using a filter approach, and an improved hybrid of the genetic algorithm and a support vector machine classifi er. We show that the classification accuracy of the proposed model is useful for the cancer classifi cation of one widely used gene expression benchmark data set.

Feature (Gene) Selection in Gene Expression-Based Tumor Classification

Molecular Genetics and Metabolism, 2001

There is increasing interest in changing the emphasis of tumor classification from morphologic to molecular. Gene expression profiles may offer more information than morphology and provide an alternative to morphology-based tumor classification systems. Gene selection involves a search for gene subsets that are able to discriminate tumor tissue from normal tissue, and may have either clear biological interpretation or some implication in the molecular mechanism of the tumorigenesis. Gene selection is a fundamental issue in gene expressionbased tumor classification. In the formation of a discriminant rule, the number of genes is large relative to the number of tissue samples. Too many genes can harm the performance of the tumor classification system and increase the cost as well. In this report, we discuss criteria and illustrate techniques for reducing the number of genes and selecting an optimal (or near optimal) subset of genes from an initial set of genes for tumor classification. The practical advantages of gene selection over other methods of reducing the dimensionality (e.g., principal components), include its simplicity, future cost savings, and higher likelihood of being adopted in a clinical setting. We analyze the expression profiles of 2000 genes in 22 normal and 40 colon tumor tissues, 5776 sequences in 14 human mammary epithelial cells and 13 breast tumors, and 6817 genes in 47 acute lymphoblastic leukemia and 25 acute myeloid leukemia samples. Through these three examples, we show that using 2 or 3 genes can achieve more than 90% accuracy of classification. This result implies that after initial investigation of tumor classification using microarrays, a small number of selected genes may be used as biomarkers for tumor classification, or may have some relevance in tumor development and serve as a potential drug target. In this report we also show that stepwise Fisher's linear discriminant function is a practicable method for gene expression-based tumor classification.

Selection of relevant genes in cancer diagnosis based on their prediction accuracy

Artificial Intelligence in Medicine, 2007

One of the main problems in cancer diagnosis by using DNA microarray data is selecting genes relevant for the pathology by analyzing their expression profiles in tissues in two different phenotypical conditions. The question we pose is the following: how do we measure the relevance of a single gene in a given pathology? Methods: A gene is relevant for a particular disease if we are able to correctly predict the occurrence of the pathology in new patients on the basis of its expression level only. In other words, a gene is informative for the disease if its expression levels are useful for training a classifier able to generalize, that is, able to correctly predict the status of new patients. In this paper we present a selection bias free, statistically well founded method for finding relevant genes on the basis of their classification ability. Results: We applied the method on a colon cancer data set and produced a list of relevant genes, ranked on the basis of their prediction accuracy. We found, out of more than 6500 available genes, 54 overexpressed in normal tissues and 77 overexpressed in tumor tissues having prediction accuracy greater than 70% with p-value 0:05. Conclusions: The relevance of the selected genes was assessed (a) statistically, evaluating the p-value of the estimate prediction accuracy of each gene; (b)