Biomarker discovery in MALDI-TOF serum protein profiles using discrete wavelet transformation (original) (raw)

Support vector classification of proteomic profile spectra based on feature extraction with the bi-orthogonal discrete wavelet transform

Computing and Visualization in Science, 2009

Automatic classification of high-resolution mass spectrometry data has increasing potential to support physicians in diagnosis of diseases like cancer. The proteomic data exhibit variations among different disease states. A precise and reliable classification of mass spectra is essential for a successful diagnosis and treatment. The underlying process to obtain such reliable classification results is a crucial point. In this paper such a method is explained and a corresponding semi automatic parameterization procedure is derived. Thereby a simple straightforward classification procedure to assign mass spectra to a particular disease state is derived. The method is based on an initial preprocessing stage of the whole set of spectra followed by the bi-orthogonal discrete wavelet transform (DWT) for feature extraction. The approximation coefficients calculated from the scaling function exhibit a high peak pattern matching property and feature a denoising of the spectrum. The discriminating coefficients, Communicated by G. Wittum. selected by the Kolmogorov-Smirnov test are finally used as features for training and testing a support vector machine with both a linear and a radial basis kernel. For comparison the peak areas obtained with the ClinProt-System 1 [33] were analyzed using the same support vector machines. The introduced approach was evaluated on clinical MALDI-MS data sets with two classes each originating from cancer studies. The cross validated error rates using the wavelet coefficients where better than those obtained from the peak areas. 2 Keywords Bi-orthogonal wavelet transform · Mass spectrometry · Clinical proteomics · Support vector machine 2 In this contribution the classifications were calculated using LIB-SVM © (Version 2.8,

A data-mining approach to biomarker identification from protein profiles using discrete stationary wavelet transform

Journal of Zhejiang University-science B, 2008

Objective: To develop a new bioinformatic tool based on a data-mining approach for extraction of the most informative proteins that could be used to find the potential biomarkers for the detection of cancer. Methods: Two independent datasets from serum samples of 253 ovarian cancer and 167 breast cancer patients were used. The samples were examined by surfaceenhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS). The datasets were used to extract the informative proteins using a data-mining method in the discrete stationary wavelet transform domain. As a dimensionality reduction procedure, the hard thresholding method was applied to reduce the number of wavelet coefficients. Also, a distance measure was used to select the most discriminative coefficients. To find the potential biomarkers using the selected wavelet coefficients, we applied the inverse discrete stationary wavelet transform combined with a two-sided t-test. Results: From the ovarian cancer dataset, a set of five proteins were detected as potential biomarkers that could be used to identify the cancer patients from the healthy cases with accuracy, sensitivity, and specificity of 100%. Also, from the breast cancer dataset, a set of eight proteins were found as the potential biomarkers that could separate the healthy cases from the cancer patients with accuracy of 98.26%, sensitivity of 100%, and specificity of 95.6%. Conclusion: The results have shown that the new bioinformatic tool can be used in combination with the high-throughput proteomic data such as SELDI-TOF MS to find the potential biomarkers with high discriminative power.

Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis

Cancer Informatics, 2007

A key challenge in clinical proteomics of cancer is the identifi cation of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specifi c to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The fi rst two steps are the selection of the most discriminating biomarkers with a construction of different classifi ers. Finally, we compare and validate their performance and robustness using different supervised classifi cation methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classifi cation Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested.

Analysis of mass spectral serum profiles for biomarker selection

Bioinformatics/computer Applications in The Biosciences, 2005

Motivation: Mass spectrometric profiles of peptides and proteins obtained by current technologies are characterized by complex spectra, high dimensionality and substantial noise. These characteristics generate challenges in the discovery of proteins and protein-profiles that distinguish disease states, e.g. cancer patients from healthy individuals. We present low-level methods for the processing of mass spectral data and a machine learning method that combines support vector machines, with particle swarm optimization for biomarker selection. Results: The proposed method identified mass points that achieved high prediction accuracy in distinguishing liver cancer patients from healthy individuals in SELDI-QqTOF profiles of serum.

Biomarker Signature Discovery From Mass Spectrometry Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2000

Mass spectrometry based high throughput proteomics are used for protein analysis and clinical diagnosis. Many machine learning methods have been used to construct classifiers based on mass spectrometry data, for discrimination between cancer stages. However, the classifiers generated by machine learning such as SVM techniques typically lack biological interpretability. We present an innovative technique for automated discovery of signatures optimized to characterize various cancer stages. We validate our signature discovery algorithm on one new colorectal cancer MALDI-TOF dataset, and two well-known ovarian cancer SELDI-TOF datasets. In all these cases, our signature based classifiers performed either better or at least as well as four benchmark machine learning algorithms including SVM and KNN.

Study on Preprocessing and Classifying Mass Spectral Raw Data Concerning Human Normal and Disease Cases

Lecture Notes in Computer Science, 2006

Mass spectrometry is becoming an important tool in biological sciences. Tissue samples or easily obtained biological fluids (serum, plasma, urine) are analysed by a variety of mass spectrometry methods, producing spectra characterized by very high dimensionality and a high level of noise. Here we address a feature exraction method for mass spectra which consists of two main steps : In the first step an algorithm for low level preprocessing of mass spectra is applied, including denoising with the Shift-Invariant Discrete Wavelet Transform (SIDWT), smoothing, baseline correction, peak detection and normalization of the resulting peak-lists. After this step, we claim to have reduced dimensionality and redundancy of the initial mass spectra representation while keeping all the meaningful features (potential biomarkers) required for disease related proteomic patterns to be identified. In the second step, the peak-lists are alligned and fed to a Support Vector Machine (SVM) which classifies the mass spectra. This procedure was applied to SELDI-QqTOF spectral data collected from normal and ovarian cancer serum samples. The classification performance was assessed for distinct values of the parameters involved in the feature extraction pipeline. The method described here for low-level preprocessing of mass spectra results in 98.3% sensitivity, 98.3% specificity and an AUC (Area Under Curve) of 0.981 in spectra classification.

Protocols for disease classification from mass spectrometry data

PROTEOMICS, 2003

We report our results in classifying protein matrix‐assisted laser desorption/ionization‐time of flight mass spectra obtained from serum samples into diseased and healthy groups. We discuss in detail five of the steps in preprocessing the mass spectral data for biomarker discovery, as well as our criterion for choosing a small set of peaks for classifying the samples. Cross‐validation studies with four selected proteins yielded misclassification rates in the 10–15% range for all the classification methods. Three of these proteins or protein fragments are down‐regulated and one up‐regulated in lung cancer, the disease under consideration in this data set. When cross‐validation studies are performed, care must be taken to ensure that the test set does not influence the choice of the peaks used in the classification. Misclassification rates are lower when both the training and test sets are used to select the peaks used in classification versus when only the training set is used. This ...

T Ovarian Cancer Diagnosis Using Discrete Wavelet Transform Based Feature Extraction from Serum Proteomic Patterns

2006

Pathological changes within an organ might be reflected in proteomic patterns in serum. Mass spectrometry is becoming an important tool that generates the proteomic Patterns. Mass spectrometry yields complex functional data for which the features of scientific interest are the peaks. Due to this complexity of data, a higher order analysis such as wavelet transform is needed to uncover the differences in proteomic patterns. We have applied wavelet based feature extraction method to available data and used a filter approach to feature subset selection in order to identify the appropriate biomarkers from reconstructed mass spectra. Using different classification algorithms, our approach yielded an accuracy of 98%, specificity of 97%, and sensitivity of 100%.

Robust SVM-Based Biomarker Selection with Noisy Mass Spectrometric Proteomic Data

Lecture Notes in Computer Science, 2006

Computational analysis of mass spectrometric (MS) proteomic data from sera is of potential relevance for diagnosis, prognosis, choice of therapy, and study of disease activity. To this aim, feature selection techniques based on machine learning can be applied for detecting potential biomarkes and biomaker patterns. A key issue concerns the interpretability and robustness of the output results given by such techniques. In this paper we propose a robust method for feature selection with MS proteomic data. The method consists of the sequentail application of a filter feature selection algorithm, RELIEF, followed by multiple runs of a wrapper feature selection technique based on support vector machines (SVM), where each run is obtained by changing the class label of one support vector. Frequencies of features selected over the runs are used to identify features which are robust with respect to perturbations of the data. This method is tested on a dataset produced by a specific MS technique, called MALDI-TOF MS. Two classes have been artificially generated by spiking. Moreover, the samples have been collected at different storage durations. Leave-one-out cross validation (LOOCV) applied to the resulting dataset, indicates that the proposed feature selection method is capable of identifying highly discriminatory proteomic patterns.

Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples

Statistical Methodology, 2006

This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm-feature selection tool-cutoff criteria combination on the performance as measured by an appropriate error rate measure.