Knowledge-based Analysis of Microarray Gene Expression Data By Using Support Vector Machines (original) (raw)

Support Vector Machine Classification of Microarray Gene Expression Data

Proceedings of the …, 2000

We introduce a new method of functionally classifying genes using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of sup-port vector machines (SVMs). We describe SVMs that use different similarity metrics includ-ing a ...

Classification of Microarray Gene Expression Data Using a New Binary Support Vector System

2005 International Conference on Neural Networks and Brain, 2005

Classification of yeast genes based on their expression levels obtained from micro array hybridization experiments is an important and challenging application domain in data mining and knowledge discovery. Over the past decade, neural networks and support vector machines (SVMs) have achieved good results for genes classification. This paper presents a methodology which uses two neural networks to classify unseen genes based on their expression levels. In order to remove some of the noise and deal with the imbalanced class distribution of the dataset, data pre-processing is firstly performed before data classification in which data cleaning, data transformation and data over-sampling using SMOTE algorithm are undertaken. Thereafter, two neural networks with different architectures are trained using Scaled Conjugate Gradient in two different ways: 1) the training-validation-testing approach and 2) 10-fold crossvalidation. Experimental results show that this methodology outperforms the previous best-performing SVM for this problem and 8 other classifiers: 3 SVMs, C4.5, Bayesian network, Naive Bayes, K-NN and JRip.

Classification of Microarray Gene Expression Data

Classification of yeast genes based on their expression levels obtained from micro array hybridization experiments is an important and challenging application domain in data mining and knowledge discovery. Over the past decade, neural networks and support vector machines (SVMs) have achieved good results for genes classification. This paper presents a methodology which uses two neural networks to classify unseen genes based on their expression levels. In order to remove some of the noise and deal with the imbalanced class distribution of the dataset, data pre-processing is firstly performed before data classification in which data cleaning, data transformation and data over-sampling using SMOTE algorithm are undertaken. Thereafter, two neural networks with different architectures are trained using Scaled Conjugate Gradient in two different ways: 1) the training-validation-testing approach and 2) 10-fold crossvalidation. Experimental results show that this methodology outperforms the previous best-performing SVM for this problem and 8 other classifiers: 3 SVMs, C4.5, Bayesian network, Naive Bayes, K-NN and JRip.

A cascading support vector machines system for gene expression data classification

Intelligent Systems, …, 2004

Abstract   Microarray technology provides the ability of monitoring the gene expression levels of thousands of genes in parallel. Gene expression data classification applies for diseases' diagnosis or prediction. We propose a novel intelligent system for the classification of multiclass gene expression data. It is based on a cascading Support Vector Machines (SVM) scheme and utilizes Welch's t-test for the detection of differentially expressed genes. The system was applied for the discrimination of normal and lung cancer subtypes' specimens. The overall accuracy achieved was 98.5%. The results show that the proposed system can be efficiently used for microarray data analysis.

A HYBRID OF GENETIC ALGORITHM AND SUPPORT VECTOR MACHINE FOR FEATURES SELECTION AND CLASSIFICATION OF GENE EXPRESSION MICROARRAY

Constantly improving gene expression technology offer the ability to measure the expression levels of thousand of genes in parallel. Gene expression data is expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. Key issue that needs to be addressed is the selection of small number of genes that contribute to a disease from the thousands of genes measured on microarrays that are inherently noisy. This work deals with finding the small subset of informative genes from gene expression microarray data which maximize the classification accuracy. This paper introduces a new algorithm of hybrid Genetic Algorithm and Support Vector Machine for genes selection and classification task. We show that the classification accuracy of the proposed algorithm is superior to a number of current state-of-the-art methods of two widely used benchmark datasets. The informative genes from the best subset are validated and verified by comparing them with the biological results produced from biologist and computer scientist researches in order to explore the biological plausibility.

Gene Expression Data Classification using Support Vector Machine and Mutual Information-based Gene Selection

DNA microarray technology can monitor the expression levels of thousands of genes simultaneously during important biological processes and across collections of related samples. Knowledge gained through microarray data analysis is increasingly important as they are useful for phenotype classification of diseases. This paper presents an effective method for gene classification using Support Vector Machine (SVM). SVM is a supervised learning algorithm capable of solving complex classification problems. Mutual information (MI) between the genes and the class label is used for identifying the informative genes. The selected genes are utilized for training the SVM classifier and the testing ability is evaluated using Leave-one-Out Cross Validation (LOOCV) method. The performance of the proposed approach is evaluated using two cancer microarray datasets. From the simulation study it is observed that the proposed approach reduces the dimension of the input features by identifying the most informative gene subset and improve classification accuracy when compared to other approaches.

A comparative study of different machine learning methods on microarray gene expression data

2008

Background: Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results. Results: In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers. Conclusions: We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.

A comprehensive survey on computational learning methods for analysis of gene expression data

arXiv (Cornell University), 2022

Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.

Classification Techniques in Gene Expression Microarray Data

IJCSMC, 2018

Cancer nowadays is a common and heterogeneous disease affecting all people of all ages. Gene expression data can serve to understand cancer or other types of disease well. Building classification system using gene expression dataset that can properly classify new samples is a challenging task due to the nature of gene expression data that is usually composed of dozens of samples characterized by thousands of genes. This paper put a light on different classification methods used in classifying gene expression data including SVM, NB, C4.5 and some of the state-of-the-art techniques.

Computational intelligence approach for gene expression data mining and classification

The exploration of high dimensional gene expression microarray data demands powerful analytical tools. Our data mining software, VISual Data Analyzer (VISDA) for cluster discovery, reveals many distinguishing patterns among gene expression profiles. The model-supported hierarchical data exploration tool has two complementary schemes: discriminatory dimensionality reduction for structure-focused data visualization, and cluster decomposition by probabilistic clustering. Reducing dimensionality generates the visualization of the complete data set at the top level. This data set is then partitioned into subclusters that can consequently be visualized at lower levels and if necessary partitioned again. These approaches produce different visualizations that are compared against known phenotypes from the microarray experiments. For class prediction on cancers using miroarray data, Multilayer Perceptrons (MLPs) are trained and optimized, whose architecture and parameters are regularized and initialized by weighted Fisher Criterion (wFC)-based Discriminatory Component Analysis (DCA). The prediction performance is compared and evaluated via multifold cross-validation.