Data mining techniques for cancer detection using serum proteomic profiling (original) (raw)

Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis

Cancer Informatics, 2007

A key challenge in clinical proteomics of cancer is the identifi cation of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specifi c to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The fi rst two steps are the selection of the most discriminating biomarkers with a construction of different classifi ers. Finally, we compare and validate their performance and robustness using different supervised classifi cation methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classifi cation Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested.

Comparative Analysis of Data Mining Algorithms for Cancer Gene Expression Data

International Journal of Advanced Computer Science and Applications

Cancer is amongst the most challenging disorders to diagnose nowadays, and experts are still struggling to detect it on early stage. Gene selection is significant for identifying cancercausing different parameters. The two deadliest cancers namely, colorectal cancer and breast malignant, is found in male and female, respectively. This study aims at predicting the cancer at an early stage with the help of cancer bioinformatics. According to the complexity of illness metabolic rates, signaling, and interaction, cancer bioinformatics is among strategies to focus bioinformatics technologies like data mining in cancer detection. The goal of the proposed study is to make a comparison between support vector machine, random forest, decision tree, artificial neural network, and logistic regression for the prediction of cancer malignant gene expression data. For analyzing data against algorithms, WEKA is used. The findings show that smart computational data mining techniques could be used to detect cancer recurrence in patients. Finally, the strategies that yielded the best results were identified.

A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns

GENOME INFORMATICS SERIES, 2002

Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on features chosen by these methods, error rates of several classification algorithms were obtained for analysis. Our results demonstrate the importance of feature selection in accurately classifying new samples.

Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples

Statistical Methodology, 2006

This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm-feature selection tool-cutoff criteria combination on the performance as measured by an appropriate error rate measure.

Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques

2020

Early detection of cancer increases the probability of recovery. This paper presents an intelligent decision support system (IDSS) for the early diagnosis of cancer based on gene expression profiles collected using DNA microarrays. Such datasets pose a challenge because of the small number of samples (no more than a few hundred) relative to the large number of genes (on the order of thousands). Therefore, a method of reducing the number of features (genes) that are not relevant to the disease of interest is necessary to avoid overfitting. The proposed methodology uses the information gain (IG) to select the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the grey wolf optimization (GWO) algorithm. Finally, the methodology employs a support vector machine (SVM) classifier for cancer type classification. The proposed methodology was applied to two datasets (Breast and Colon) and was evaluated based on its classification accu...

Cancer diagnosis using proteomic patterns

Expert Review of Molecular Diagnostics, 2003

The advent of proteomics has brought with it the hope of discovering novel biomarkers that can be used to diagnose diseases, predict susceptibility and monitor progression. Much of this effort has focused upon the mass spectral identification of the thousands of proteins that populate complex biosystems such as serum and tissues. A revolutionary approach in proteomic pattern analysis has emerged as an effective method for the early diagnosis of diseases such as ovarian cancer. Proteomic pattern analysis relies on the pattern of proteins observed and does not rely on the identification of a traceable biomarker. Hundreds of clinical samples per day can be analyzed utilizing this technology, which has the potential to be a novel, highly sensitive diagnostic tool for the early detection of cancer.

A Comparison of Methods for Classifying Clinical Samples Based on Proteomics Data: A Case Study for Statistical and Machine Learning Approaches

PLoS ONE, 2011

The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called ''omics'' disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n%p constraint, and as such, require pretreatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems.

Multiple approaches to data-mining of proteomic data based on statistical and pattern classification methods

PROTEOMICS, 2003

The data-mining challenge presented is composed of two fundamental problems. Problem one is the separation of forty-one subjects into two classifications based on the data produced by the mass spectrometry of protein samples from each subject. Problem two is to find the specific differences between protein expression data of two sets of subjects. In each problem, one group of subjects has a disease, while the other group is nondiseased. Each problem was approached with the intent to introduce a new and potentially useful tool to analyze protein expression from mass spectrometry data. A variety of methodologies, both conventional and nonconventional were used in the analysis of these problems. The results presented show both overlap and discrepancies. What is important is the breadth of the techniques and the future direction this analysis will create.

A hybrid feature subset selection algorithm for analysis of high correlation proteomic data

Journal of Medical Signals & Sensors, 2012

A major problem in the treatment of cancer is the lack of a suitable technique for early diagnosis of the disease. The ovarian cancer is a widespread disease within the population of women, and its early diagnosis can greatly prevent the mortality rate. [1] With current diagnostic tools, the disease is diagnosed at an advanced clinical stage in more than 80% of patients that the 5-year survival is only 35% after late stage presentation. [2] It is known that the pathological changes within an organ can be reflected as proteomic patterns in biological fluids such as plasma, serum, and urine. [3] The surface-enhanced laser desorption and ionization time-of-flight mass spectrometry (SELDI-TOF MS) has been used to provide proteomics profile from biological fluids. [4-6] The mass spectrum data analysis is a fast and rather inexpensive procedure to diagnose the disease, and it may potentially allow cancer screening without any complication during the time of diagnosis. In many screening tasks, the input data are presented by a very large number of features of A b s t r A c t Pathological changes within an organ can be reflected as proteomic patterns in biological fluids such as plasma, serum, and urine. The surface-enhanced laser desorption and ionization time-of-flight mass spectrometry (SELDI-TOF MS) has been used to generate proteomic profiles from biological fluids. Mass spectrometry yields redundant noisy data that the most data points are irrelevant features for differentiating between cancer and normal cases. In this paper, we have proposed a hybrid feature subset selection algorithm based on maximum-discrimination and minimum-correlation coupled with peak scoring criteria. Our algorithm has been applied to two independent SELDI-TOF MS datasets of ovarian cancer obtained from the NCI-FDA clinical proteomics databank. The proposed algorithm has used to extract a set of proteins as potential biomarkers in each dataset. We applied the linear discriminate analysis to identify the important biomarkers. The selected biomarkers have been able to successfully diagnose the ovarian cancer patients from the noncancer control group with an accuracy of 100%, a sensitivity of 100%, and a specificity of 100% in the two datasets. The hybrid algorithm has the advantage that increases reproducibility of selected biomarkers and able to find a small set of proteins with high discrimination power.

Performance Analysis and Evaluation of Different Data Mining Algorithms used for Cancer Classification

Classification algorithms of data mining have been successfully applied in the recent years to predict cancer based on the gene expression data. Micro-array is a powerful diagnostic tool that can generate handful information of gene expression of all the human genes in a cell at once. Various classification algorithms can be applied on such micro-array data to devise methods that can predict the occurrence of tumor. However, the accuracy of such methods differ according to the classification algorithm used. Identifying the best classification algorithm among all available is a challenging task. In this study, we have made a comprehensive comparative analysis of 14 different classification algorithms and their performance has been evaluated by using 3 different cancer data sets. The results indicate that none of the classifiers outperformed all others in terms of the accuracy when applied on all the 3 data sets. Most of the algorithms performed better as the size of the data set is increased. We recommend the users not to stick to a particular classification method and should evaluate different classification algorithms and select the better algorithm.