Prediction of high-responding peptides for targeted protein assays by mass spectrometry - PubMed (original) (raw)
Prediction of high-responding peptides for targeted protein assays by mass spectrometry
Vincent A Fusaro et al. Nat Biotechnol. 2009 Feb.
Abstract
Protein biomarker discovery produces lengthy lists of candidates that must subsequently be verified in blood or other accessible biofluids. Use of targeted mass spectrometry (MS) to verify disease- or therapy-related changes in protein levels requires the selection of peptides that are quantifiable surrogates for proteins of interest. Peptides that produce the highest ion-current response (high-responding peptides) are likely to provide the best detection sensitivity. Identification of the most effective signature peptides, particularly in the absence of experimental data, remains a major resource constraint in developing targeted MS-based assays. Here we describe a computational method that uses protein physicochemical properties to select high-responding peptides and demonstrate its utility in identifying signature peptides in plasma, a complex proteome with a wide range of protein concentrations. Our method, which employs a Random Forest classifier, facilitates the development of targeted MS-based assays for biomarker verification or any application where protein levels need to be measured.
Figures
Figure 1
ESP application and model development overview. (a) A typical proteomic workflow to select signature peptides for targeted protein analysis using MRM. Candidate proteins are experimentally analyzed, and five signature peptides per protein are selected based primarily on high peptide-response and sequence composition, among other factors. After optimization, the remaining peptides are referred to as validated MRM peptides. (b) We computationally digest each candidate protein, in silico (no missed cleavages, 600–2,800 Da), to produce a set of predicted tryptic peptides. Peptide sequences are input into the ESP predictor and we select the five peptides with the highest probability of response for each protein. To validate the ESP predictions, we compare the top five predicted peptides to the experimentally determined five highest-responding peptides from a, denoted by asterisks (3 out of 5, in this example). (c) We developed the ESP predictor using peptides from a yeast lysate experimental analysis. We trained the ESP predictor using Random Forest on 90% of the peptides and held out 10% to test the model, referred to as Yeast test. We split the data at the protein level to avoid any bias in training and testing the model on peptides from the same protein and to keep the training and test data completely separated.
Figure 2
ESP predictor validation and method comparison. ESP predictions outperform existing computation models and are statistically significant for all validation data sets based on a random permutation test. We plotted the mean number of cumulative correctly predicted peptides (Ts) for random combinations of 1–20 proteins. We calculated the 95% confidence interval of the mean, but the error bars were too small to display. The null distribution for _P_-value calculation is derived using a predictor that randomly selects the top five high-responding peptides for a protein (Supplementary Fig. 2). (a) ESP predictor performance on multiple validation sets, with the performance of a random predictor shown in gray. Each validation set produces its own set of random distributions, depending on the number of peptides per protein. We grouped all random distributions into a single shaded area. (b) ESP predictions on plasma validation sets. The samples represent undepleted plasma, top 14 most-abundant proteins depleted, and depleted and then fractionated using SCX (also referred to as MUDPIT). Random selection of the top five peptides resulted in the gray area. (c) Comparison between the ESP predictor, proteotypic predictors and random predictions on a HeLa GeLC-MS cell lysate. (d) Comparison between the ESP predictor, proteotypic predictors and random predictions on a depleted and fractionated plasma sample. This is the sample type most commonly used for MRM biomarker verification. See Tables 1 and 2 for more details. STEPP, SVM technique for evaluating proteotypic peptides.
Figure 3
ESP predictions translate into experimentally validated MRM peptides. For each protein, we performed an in silico digest (600–2,800 Da) and ensured that the top five peptides predicted by the ESP predictor were unique in the Swiss-Prot human database. Although additional filtering criteria could easily be applied after analysis with the ESP predictor, we opted for no filtering (except top five uniqueness) to demonstrate the simplicity of using the ESP predictor to select candidate signature-peptides to configure an MRM-MS assay. For all plots, peptides are sorted by the ESP predicted probability of response (_y_-axis). The actual rank order of measured peptide response is shown in Supplementary Table 2. (a) The ESP predictor correctly selected all three validated MRM peptides (filled black circles) out of the five predicted candidate signature-peptides for troponin I. (b) The ESP predictor correctly selected two validated MRM peptides out of the five predicted candidate signature-peptides for IL-33. In a and b, two representative proteins not found in the GPM database are shown. (c) GPM correctly selected all four of the validated MRM peptides among the top five. Three peptides are common between the ESP predictor and GPM. (d) Only two peptides were suggested by GPM of which only one was a validated MRM peptide. In c and d, two representative proteins are shown where we overlaid the MRM peptides suggested by GPM (open red circles). Example d highlights the limitations of relying solely on database predictions because two validated MRM peptides would have been missed.
Figure 4
Analysis of important physicochemical properties in predicting high-responding peptides. (a) The yeast training set was randomly split into training- (80%) and test- (20%) sets to produce 100 different Random Forest models (1,000 trees) at each step of halving the number of important properties. The box plot shows the test set error distribution. (b) The stability of property importance improves with increased number of trees in the Random Forest model. For a given number of trees, five models were built and the pairwise Spearman rank correlation coefficient of determination (_R_2) was calculated for the ranked list of important features (error bars ± 1 s.d). (c) The top 35 features from the ESP predictor using 50,000 trees are listed.
References
- Rifai N, Gillette MA, Carr SA. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 2006;24:971–983. -PubMed
- Uhlen M, Hober S. Generation and validation of affinity reagents on a proteome-wide level. J. Mol. Recognit. 2008 -PubMed
- Anderson L, Hunter CL. Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell. Proteomics. 2006;5:573–588. -PubMed
Publication types
MeSH terms
Substances
Grants and funding
- R01 GM074024/GM/NIGMS NIH HHS/United States
- R01 CA126219/CA/NCI NIH HHS/United States
- 1U24 CA126476/CA/NCI NIH HHS/United States
- U01-HL081341/HL/NHLBI NIH HHS/United States
- U01 HL081341/HL/NHLBI NIH HHS/United States
- U24 CA126476/CA/NCI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources