Machine learning methods for predictive proteomics (original) (raw)
Related papers
Robust SVM-Based Biomarker Selection with Noisy Mass Spectrometric Proteomic Data
Lecture Notes in Computer Science, 2006
Computational analysis of mass spectrometric (MS) proteomic data from sera is of potential relevance for diagnosis, prognosis, choice of therapy, and study of disease activity. To this aim, feature selection techniques based on machine learning can be applied for detecting potential biomarkes and biomaker patterns. A key issue concerns the interpretability and robustness of the output results given by such techniques. In this paper we propose a robust method for feature selection with MS proteomic data. The method consists of the sequentail application of a filter feature selection algorithm, RELIEF, followed by multiple runs of a wrapper feature selection technique based on support vector machines (SVM), where each run is obtained by changing the class label of one support vector. Frequencies of features selected over the runs are used to identify features which are robust with respect to perturbations of the data. This method is tested on a dataset produced by a specific MS technique, called MALDI-TOF MS. Two classes have been artificially generated by spiking. Moreover, the samples have been collected at different storage durations. Leave-one-out cross validation (LOOCV) applied to the resulting dataset, indicates that the proposed feature selection method is capable of identifying highly discriminatory proteomic patterns.
A major goal of clinical proteomics is the identification of disease-specific biomarker panels that can be obtained from mass spectral analyses of body fluids, such as blood serum or CSF (cerebrospinal fluid). In this paper, we present preliminary results from the application of symbolic machine learning and data mining techniques to three high-throughput proteomic datasets -publicly available ovarian cancer data, pre-release datasets for amyotrophic lateral sclerosis (ALS) and trisomy 21 (Down's syndrome).
PLoS ONE, 2011
The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called ''omics'' disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n%p constraint, and as such, require pretreatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems.
Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data
BMC bioinformatics, 2006
Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data. We developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5%- approximately 20% improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-M...
A Workflow for Preprocessing and Proteomic Biomarker Identification on Mass-Spectrometry Data
Biomedical Engineering, 2010
A core technology in proteomics is mass spectroscopy (MS) that permits the measurement of thousands of proteins/peptides simultanously. Sophisticated data mining methods are necessary to identify highly predictive proteomic biomarker candidates in generated MS spectra that are specific to a certain disease. However, before analysis can be started the preprocessing of raw mass spectra is an essential task, mainly due to the presence of background signals in the spectra such as electrical and chemical noise. In this work we present a new data mining workflow for the identification of proteomic biomarker candidates using mass spectrometry data. The workflow includes two major steps: 1) the preprocessing of raw spectra, and 2) the identification of highly discriminating candidate masses using a 3-step feature selection approach by combining the advantages of efficient filter and effective wrapper techniques. With the proposed workflow we were able to identify putative candidate biomarkers in a lifethreatening human disease using matrix-assisted laser desorption/ionization imaging MS (MALDI-IMS).
Data mining in proteomic mass spectrometry
Clinical Proteomics, 2006
Data mining application to proteomic data from mass spectrometry has gained much interest in recent years. Advances made in proteomics and mass spectrometry have resulted in considerable amount of data that cannot be easily visualized or interpreted. Mass spectral proteomic datasets are typically high dimensional but with small sample size. Consequently, advanced artificial intelligence and machine learning algorithms are increasingly being used for knowledge discovery from such datasets. Their overall goal is to extract useful information that leads to the identification of protein biomarker candidates. Such biomarkers could potentially have diagnostic value as tools for early detection, diagnosis, and prognosis of many diseases. The purpose of this review is to focus on the current trends in mining mass spectral proteomic data. Special emphasis is placed on the critical steps involved in the analysis of surface-enhanced laser desorption/ionization mass spectrometry proteomic data. Examples are drawn from previously published studies and relevant data mining terminology and techniques are exlained.
Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis
Cancer Informatics, 2007
A key challenge in clinical proteomics of cancer is the identifi cation of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specifi c to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The fi rst two steps are the selection of the most discriminating biomarkers with a construction of different classifi ers. Finally, we compare and validate their performance and robustness using different supervised classifi cation methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classifi cation Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested.
Mining whole-sample mass spectrometry proteomics data for biomarkers - An overview
Expert Systems With Applications, 2009
Biomarkers are proteins or other components of a clinical sample whose measured intensity alters in response to a biological change such as an infection or disease, and which may therefore be useful for prediction and diagnosis. Proteomics is the science of discovering, identifying and understanding such components using tools such as mass spectrometry. In this paper we aim to provide a concise overview of designing and conducting an MS proteomics study in such a way as to allow statistical analysis that may lead to the discovery of novel markers. We provide a summary of the various stages that make up such an experiment, highlighting the need for experimental goals to be decided upon in advance. We discuss issues in experimental design at the sample collection stage, and good practice for standardising protocols within the proteomics laboratory. We then describe approaches to the data mining stage of the experiment, including the processing steps that transform a raw mass spectrum into a useable form. We propose a permutation-based procedure for determining the significance of reported error rates. Finally, because of its advantage in speed and low cost, we suggest that MS proteomics may be a good candidate for an early primary screening approach to disease diagnosis, identifying areas of risk and making referrals for more specific tests without necessarily making a diagnosis in its own right. Our discussion is illustrated with examples drawn from experiments on bovine blood serum designed to pinpoint novel biomarkers for bovine tuberculosis.
Feature selection for classification with proteomic data of mixed quality
2005
In this paper we assess experimentally the performance of two state-of-the-art feature selection methods, called RFE and RELIEF, when used for classifying pattern proteomic samples of mixed quality. The data are generated by spiking human sera to artificially create differentiable sample groups, and by handling samples at different storage temperature. We consider two type of classifiers: support vector machines (SVM) and k-nearest neighbour (kNN). Results of leave-one-out cross validation (LOOCV) experiments indicate that RELIEF selects more stable feature subsets than RFE over the runs, where the selected features are mainly spiked ones. However, RFE outperforms RELIEF in terms of (average LOOCV) accuracy, both when combined with SVM and kNN. Perfect LOOCV accuracy is obtained by RFE combined with 1NN. Almost all the samples that are wrongly classified by the algorithms have high storage temperature. The results of experiments on this data indicate that when samples of mixed quality are analyzed computationally, feature selection of only relevant (spiked) features does not necessarily correspond to highest accuracy of classification.
Biostatistics, 2003
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated dataanalytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g. treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data-analytic strategy for discovering protein biomarkers based on such high-dimensional mass spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique.