Wavelet-based procedures for proteomic mass spectrometry data processing (original) (raw)

Mathematical Framework and Wavelets Applications in Proteomics for Cancer Study

Series in Mathematical Biology and Medicine, 2008

Cancer is a proteomic disease. Though MALDI-TOF mass spectrometry allows direct measurement of the protein signature of tissue, blood, or their biological samples, and holds tremendous potential for disease diagnosis and treatment, key challenges remain in the processing of proteomic data. In this chapter, we will introduce a wavelet based mathematical framework and computational tools for proteomic data processing, feature selection, and statistical analysis in cancer study.

WaveletQuant, an improved quantification software based on wavelet signal threshold de-noising for labeled quantitative proteomic analysis

BMC Bioinformatics, 2010

Background Quantitative proteomics technologies have been developed to comprehensively identify and quantify proteins in two or more complex samples. Quantitative proteomics based on differential stable isotope labeling is one of the proteomics quantification technologies. Mass spectrometric data generated for peptide quantification are often noisy, and peak detection and definition require various smoothing filters to remove noise in order to achieve accurate peptide quantification. Many traditional smoothing filters, such as the moving average filter, Savitzky-Golay filter and Gaussian filter, have been used to reduce noise in MS peaks. However, limitations of these filtering approaches often result in inaccurate peptide quantification. Here we present the WaveletQuant program, based on wavelet theory, for better or alternative MS-based proteomic quantification. Results We developed a novel discrete wavelet transform (DWT) and a 'Spatial Adaptive Algorithm' to remove noise...

Software WaveletQuant, an improved quantification software based on wavelet signal threshold de-noising for labeled quantitative proteomic analysis

Bmc Bioinformatics, 2010

Background: Quantitative proteomics technologies have been developed to comprehensively identify and quantify proteins in two or more complex samples. Quantitative proteomics based on differential stable isotope labeling is one of the proteomics quantification technologies. Mass spectrometric data generated for peptide quantification are often noisy, and peak detection and definition require various smoothing filters to remove noise in order to achieve accurate peptide quantification. Many traditional smoothing filters, such as the moving average filter, Savitzky-Golay filter and Gaussian filter, have been used to reduce noise in MS peaks. However, limitations of these filtering approaches often result in inaccurate peptide quantification. Here we present the WaveletQuant program, based on wavelet theory, for better or alternative MS-based proteomic quantification. Results: We developed a novel discrete wavelet transform (DWT) and a 'Spatial Adaptive Algorithm' to remove noise and to identify true peaks. We programmed and compiled WaveletQuant using Visual C++ 2005 Express Edition. We then incorporated the WaveletQuant program in the Trans-Proteomic Pipeline (TPP), a commonly used open source proteomics analysis pipeline. Conclusions: We showed that WaveletQuant was able to quantify more proteins and to quantify them more accurately than the ASAPRatio, a program that performs quantification in the TPP pipeline, first using known mixed ratios of yeast extracts and then using a data set from ovarian cancer cell lysates. The program and its documentation can be downloaded from our website at http://systemsbiozju.org/data/WaveletQuant.

Support vector classification of proteomic profile spectra based on feature extraction with the bi-orthogonal discrete wavelet transform

Computing and Visualization in Science, 2009

Automatic classification of high-resolution mass spectrometry data has increasing potential to support physicians in diagnosis of diseases like cancer. The proteomic data exhibit variations among different disease states. A precise and reliable classification of mass spectra is essential for a successful diagnosis and treatment. The underlying process to obtain such reliable classification results is a crucial point. In this paper such a method is explained and a corresponding semi automatic parameterization procedure is derived. Thereby a simple straightforward classification procedure to assign mass spectra to a particular disease state is derived. The method is based on an initial preprocessing stage of the whole set of spectra followed by the bi-orthogonal discrete wavelet transform (DWT) for feature extraction. The approximation coefficients calculated from the scaling function exhibit a high peak pattern matching property and feature a denoising of the spectrum. The discriminating coefficients, Communicated by G. Wittum. selected by the Kolmogorov-Smirnov test are finally used as features for training and testing a support vector machine with both a linear and a radial basis kernel. For comparison the peak areas obtained with the ClinProt-System 1 [33] were analyzed using the same support vector machines. The introduced approach was evaluated on clinical MALDI-MS data sets with two classes each originating from cancer studies. The cross validated error rates using the wavelet coefficients where better than those obtained from the peak areas. 2 Keywords Bi-orthogonal wavelet transform · Mass spectrometry · Clinical proteomics · Support vector machine 2 In this contribution the classifications were calculated using LIB-SVM © (Version 2.8,

Mathematical Tools and Statistical Techniques for Proteomic Data Mining

2010

Proteomics is the study of and the search for information about proteins. The development of mass spectrometry (MS) such as matrix-assisted laser desorption ionization (MALDI) time-of-flight (TOF) MS and imaging mass spectrometry (IMS), greatly speeds up proteomics studies. At the same time, the MS and IMS applications in medical science give rise to many challenges in mathematics and statistics regarding to the MS and IMS data analysis including data preprocessing, classification, and biomarker discovery. In this paper, we give a review of recent development of mathematical techniques and statistical tools for MS and IMS based proteomic data mining including wavelet based MS data preprocessing and multivariate statistical methods for IMS data classification and biomarker discovery.

Feature extraction in the analysis of proteomic mass spectra

PROTEOMICS, 2006

Feature extraction or biomarker selection is a critical step in disease diagnosis and knowledge discovery based on protein MS. Many studies have discussed the classification methods applied in proteomics; however, few could be found to address feature extraction in detail. In this paper, we developed a systematic approach for the extraction of mass spectrum peak apex and peak area with special emphasis on noise filtration and peak calibration. Application to a head and neck cancer data generated at the Eastern Virginia Medical School [

Mass spectrometry data processing using zero-crossing lines in multi-scale of Gaussian derivative wavelet

Bioinformatics, 2010

Motivation: Peaks are the key information in mass spectrometry (MS) which has been increasingly used to discover diseases-related proteomic patterns. Peak detection is an essential step for MS-based proteomic data analysis. Recently, several peak detection algorithms have been proposed. However, in these algorithms, there are three major deficiencies: (i) because the noise is often removed, the true signal could also be removed; (ii) baseline removal step may get rid of true peaks and create new false peaks; (iii) in peak quantification step, a threshold of signal-to-noise ratio (SNR) is usually used to remove false peaks; however, noise estimations in SNR calculation are often inaccurate in either time or wavelet domain. In this article, we propose new algorithms to solve these problems. First, we use bivariate shrinkage estimator in stationary wavelet domain to avoid removing true peaks in denoising step. Second, without baseline removal, zero-crossing lines in multi-scale of deri...

Biomarker discovery in MALDI-TOF serum protein profiles using discrete wavelet transformation

Bioinformatics, 2009

Automatic classification of high-resolution mass spectrometry proteomic data has increasing potential in the early diagnosis of cancer. We propose a new procedure of biomarker discovery in serum protein profiles based on: (i) discrete wavelet transformation of the spectra; (ii) selection of discriminative wavelet coefficients by a statistical test and (iii) building and evaluating a support vector machine classifier by double cross-validation with attention to the generalizability of the results. In addition to the evaluation results (total recognition rate, sensitivity and specificity), the procedure provides the biomarker patterns, i.e. the parts of spectra which discriminate cancer and control individuals. The evaluation was performed on matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) serum protein profiles of 66 colorectal cancer patients and 50 controls. Results: Our procedure provided a high recognition rate (97.3%), sensitivity (98.4%) and specificity (95.8%). The extracted biomarker patterns mostly represent the peaks expressing mean differences between the cancer and control spectra. However, we showed that the discriminative power of a peak is not simply expressed by its mean height and cannot be derived by comparison of the mean spectra. The obtained classifiers have high generalization power as measured by the number of support vectors. This prevents overfitting and contributes to the reproducibility of the results, which is required to find biomarkers differentiating cancer patients from healthy individuals. Availability: The data and scripts used in this study are available at

A data-mining approach to biomarker identification from protein profiles using discrete stationary wavelet transform

Journal of Zhejiang University-science B, 2008

Objective: To develop a new bioinformatic tool based on a data-mining approach for extraction of the most informative proteins that could be used to find the potential biomarkers for the detection of cancer. Methods: Two independent datasets from serum samples of 253 ovarian cancer and 167 breast cancer patients were used. The samples were examined by surfaceenhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS). The datasets were used to extract the informative proteins using a data-mining method in the discrete stationary wavelet transform domain. As a dimensionality reduction procedure, the hard thresholding method was applied to reduce the number of wavelet coefficients. Also, a distance measure was used to select the most discriminative coefficients. To find the potential biomarkers using the selected wavelet coefficients, we applied the inverse discrete stationary wavelet transform combined with a two-sided t-test. Results: From the ovarian cancer dataset, a set of five proteins were detected as potential biomarkers that could be used to identify the cancer patients from the healthy cases with accuracy, sensitivity, and specificity of 100%. Also, from the breast cancer dataset, a set of eight proteins were found as the potential biomarkers that could separate the healthy cases from the cancer patients with accuracy of 98.26%, sensitivity of 100%, and specificity of 95.6%. Conclusion: The results have shown that the new bioinformatic tool can be used in combination with the high-throughput proteomic data such as SELDI-TOF MS to find the potential biomarkers with high discriminative power.