A five-level classification system for proteoform identifications (original) (raw)

ABRF Proteome Informatics Research Group (iPRG) 2016 Study: Inferring Proteoforms from Bottom-up Proteomics Data

Journal of biomolecular techniques : JBT, 2018

This report presents the results from the 2016 Association of Biomolecular Resource Facilities Proteome Informatics Research Group (iPRG) study on proteoform inference and false discovery rate (FDR) estimation from bottom-up proteomics data. For this study, 3 replicate Q Exactive Orbitrap liquid chromatography-tandom mass spectrometry datasets were generated from each of 4 samples spiked with different equimolar mixtures of small recombinant proteins selected to mimic pairs of homologous proteins. Participants were given raw data and a sequence file and asked to identify the proteins and provide estimates on the FDR at the proteoform level. As part of this study, we tested a new submission system with a format validator running on a virtual private server (VPS) and allowed methods to be provided as executable R Markdown or IPython Notebooks. The task was perceived as difficult, and only eight unique submissions were received, although those who participated did well with no one meth...

Differential Proteomics via Probabilistic Peptide Identification Scores

Analytical Chemistry, 2005

Relative quantitation is key to enable differential proteomics and hence answer biological questions by comparing samples. Classical approaches involve stable isotope labeling with/without spiked standards. Although stable isotopes may lead to precise results, their application is not straightforward. In Proteomics, 2004, 4, 2333-2351, we proposed an approach where we summed peptide identification scores to derive a semiquantitative abundance indicator. In this study, we combine such an indicator with a statistical test to detect differentially expressed proteins. We demonstrate the effectiveness of this method by using mixtures of purified proteins and human plasma spiked with proteins at low-nanomolar concentrations. The impact of the number of repeated experiments is discussed, and we show that the statistical test we use performs well with two to three repetitions, whereas a classical t-test would require at least four repetitions to achieve the same performance. Typically, 2.5-5-fold changes are detected with 90-95% confidence in human plasma. The method is finally characterized by deriving estimates of its false positive and negative rates. This new characterization is valid for a wider class of methods such as spectrum sampling (Liu, H.; Sadygov, R. G.; Yates, J. R. III. Anal. Chem. 2004, 76, 4193-4201). Barrillat, N.; Baussant, T.; Boiteau, C.; Botti, P.; Bougueleret, L.; Budin, N.; Canet, D.; Carraud, S.; Chiappe, D.; Christmann, N.; Colinge, J.; Cusin, I.; Dafflon, N.; Depresle, B.; Fasso, I.; Frauchiger, P.; Gaertner, H.; Gleizes, A.; Gonzalez-Couto, E.; Jeandenans, C.; Karmime, A.; Kowall, T.; Lagache, S.; Mahe, E.; Masselot, A.; Mattou, H.; Moniatte, M.; Niknejad, A.; Paolini, M.; Perret, F.; Pinaud, N.; Ranno, F.; Raimondi, S.; Reffas, S.; Regamey, P. O.; Rey, P. A.; Rodriguez-Tome, P.; Rose, K.; Rossellat, G.; Saudrais, C.; Schmidt, C.; Villain, M.; Zwahlen, C. Proteomics 2004, 4, 2333-51. (2) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates,

Quantifying homologous proteins and proteoforms

Many proteoforms – arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes – have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass–spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries. We characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code. A HIquant server is implemented at: https://web.northeastern.edu/slavov/2014\_HIquant/

The Proteomics Identifications database: 2010 update

Nucleic Acids Research, 2010

The Proteomics Identifications database (PRIDE, http://www.ebi.ac.uk/pride) at the European Bioinformatics Institute has become one of the main repositories of mass spectrometry-derived proteomics data. For the last 2 years, PRIDE data holdings have grown substantially, comprising 60 different species, more than 2.5 million protein identifications, 11.5 million peptides and over 50 million spectra by September 2009. We here describe several new and improved features in PRIDE, including the revised submission process, which now includes direct submission of fragment ion annotations. Correspondingly, it is now possible to visualize spectrum fragmentation annotations on tandem mass spectra, a key feature for compliance with journal data submission requirements. We also describe recent developments in the PRIDE BioMart interface, which now allows integrative queries that can join PRIDE data to a growing number of biological resources such as Reactome, Ensembl, InterPro and UniProt. This ability to perform extremely powerful across-domain queries will certainly be a cornerstone of future bioinformatics analyses. Finally, we highlight the importance of data sharing in the proteomics field, and the corresponding integration of PRIDE with other databases in the ProteomExchange consortium.

Proteome-Wide Identification of Proteins and Their Modifications with Decreased Ambiguities and Improved False Discovery Rates Using Unique Sequence Tags

Analytical Chemistry, 2008

Identifying proteins correctly and with known levels of confidence remain as significant challenges for proteomics. Random or decoy peptide databases are increasingly being used to estimate the false discovery rate (FDR), e.g., from liquid chromatography-tandem mass spectrometry (LC-MS/MS) analyses of tryptic digests. We show that this approach can significantly underestimate the FDR, and describe an approach for more confident protein identifications that uses unique partial sequences derived from a combination of database searching and amino acid residue sequencing using high accuracy MS/MS data. Applied to a Saccharomyces cerevisiae tryptic digest, the approach provided 3,132 confident peptide identifications (∼5% modified in some fashion), covering 575 proteins with an estimated zero FDR. The conventional approach provided 3,359 peptide identifications and 656 proteins with 0.3% FDR based upon a decoy database analysis. However, the present approach revealed ∼5% of the 3,359 identifications to be incorrect, and many more as potentially ambiguous, (e.g., due to not considering certain amino acid substitutions and modifications). In addition, 677 peptides and 39 proteins were identified that had been missed by conventional analysis, including non-tryptic peptides, peptides with various expected/unexpected chemical modifications, known/ unknown posttranslational modifications, single nucleotide polymorphisms or gene encoding errors, and multiple modifications of individual peptides. Keywords precise proteomics; high-precision tandem mass spectrometry; unique sequences; post-translational modifications and mutants; false discovery rate and identification ambiguity A currently popular strategy utilizes a comparably sized "decoy" set of "false" peptides to estimate the level of incorrect identifications for a particular set of filtering criteria. 5 While