Prediction of Missed Cleavage Sites in Tryptic Peptides Aids Protein Identification in Proteomics (original) (raw)
Related papers
Fast and accurate identification of semi-tryptic peptides in shotgun proteomics
Bioinformatics, 2008
Motivation: One of the major problems in shotgun proteomics is the low peptide coverage when analyzing complex protein samples. Identifying more peptides, e.g. non-tryptic peptides, may increase the peptide coverage and improve protein identification and/or quantification that are based on the peptide identification results. Searching for all potential non-tryptic peptides is, however, time consuming for shotgun proteomics data from complex samples, and poses a challenge for a routine data analysis. Results: We hypothesize that non-tryptic peptides are mainly created from the truncation of regular tryptic peptides before separation. We introduce the notion of truncatability of a tryptic peptide, i.e. the probability of the peptide to be identified in its truncated form, and build a predictor to estimate a peptide's truncatability from its sequence. We show that our predictions achieve useful accuracy, with the area under the ROC curve from 76% to 87%, and can be used to filter the sequence database for identifying truncated peptides. After filtering, only a limited number of tryptic peptides with the highest truncatability are retained for nontryptic peptide searching. By applying this method to identification of semi-tryptic peptides, we show that a significant number of such peptides can be identified within a searching time comparable to that of tryptic peptide identification.
Proteomics, 2001
A specialised proteomic database for comparing matrix-assisted laser desorption/ionization-time of flight mass spectrometry data of tryptic peptides with corresponding sequence database segments We have developed a specialised proteomic database for the analysis of matrixassisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF MS) data derived from tryptic peptides of Sinorhizobium meliloti proteins. This database currently contains the amino acid sequence data of the proteins predicted from the complete chromosome, MALDI-TOF MS data from proteolytic peptides of about 400 tryptically digested proteins, and the results of a search of the MALDI-TOF MS spectra against the chromosomal amino acid sequences. The database made it possible to access and compare the sequences of theoretical tryptic peptides that correspond to MALDI-TOF peaks in the mass spectrum with predicted tryptic peptides from identified proteins that could not be matched to MALDI-TOF peaks. A comparison of the molecular weights, isoelectric points and amino acid compositions of the identified and nonidentified peptides is presented. We also show how the system can assist in the development of an automated scoring function that facilitates and consolidates protein identification.
2010
Database search is the most popular approach used for the identification of peptides in contem porary shotgun proteomics; it utilizes only mass spectrometric data. In this work, we introduce three criteria for the verification of peptide identification; these are based on the analysis of data orthogonal to tandem mass spectra. The first one utilizes chromatographic retention times of peptides. The development of such approaches has been hindered by the relatively low accuracy of retention time prediction algorithms. In this work, we suggest the use of two independent models of the liquid chromatography of peptides, which increase the reliability of the results. The second criterion utilizes the mean number of missed tryptic cleavages per peptide. The third one results from the analysis of the difference between theoretical and experimentally mea sured peptide masses. The proposed criteria were applied to the tandem mass spectra of tryptic peptides from rat kidney tissue, which were processed by the Mascot search engine. All the criteria showed that Mascot sig nificantly overestimated the reliability of an identification. This conclusion was supported by the PeptidePro phet algorithm.
Mass spectrometry-based shotgun proteomics approaches are currently considered as the technology-of-choice for large-scale proteo-genomics due to high throughput, good availability and relative ease of use. Protein mixtures are firstly digested with protease, e. g. trypsin, and the resultant peptides are analyzed using liquid chromatography-tandem mass spectrometry. Proteins and peptides are identified from the resultant tandem mass spectra by de novo interpretation of the spectra or by searching databases of putative sequences. Since this data represents the expressed proteins in the sample, it can be used to infer novel proteogenomic features when mapped to the genome. However, high-throughput mass spectrometry instruments can readily generate hundreds of thousands, perhaps millions, of spectra and the size of genomic databases, such as six-frame translated genome databases, is enormous. Therefore, computational demands are very high, and there is potential inaccuracy in peptide identification due to the large search space. These issues are considered the main challenges that limit the utilization of this approach. In this review, we highlight the efforts of the proteomics and bioinformatics communities to develop methods, algorithms and software tools that facilitate peptide sequence identification from databases in large-scale proteogenomic studies.
Computational prediction of proteotypic peptides for quantitative proteomics
2007
Mass spectrometry-based quantitative proteomics has become an important component of biological and clinical research. Although such analyses typically assume that a protein's peptide fragments are observed with equal likelihood, only a few so-called 'proteotypic' peptides are repeatedly and consistently identified for any given protein present in a mixture. Using 4600,000 peptide identifications generated by four proteomic platforms, we empirically identified 416,000 proteotypic peptides for 4,030 distinct yeast proteins. Characteristic physicochemical properties of these peptides were used to develop a computational tool that can predict proteotypic peptides for any protein from any organism, for a given platform, with 485% cumulative accuracy. Possible applications of proteotypic peptides include validation of protein identifications, absolute quantification of proteins, annotation of coding sequences in genomes, and characterization of the physical principles governing key elements of mass spectrometric workflows (e.g., digestion, chromatography, ionization and fragmentation).
Analytical Chemistry, 2008
Identifying proteins and their modification states and with known levels of confidence remains as a significant challenge for proteomics. Random or decoy peptide databases are increasingly being used to estimate the false discovery rate (FDR), e.g., from liquid chromatographytandem mass spectrometry (LC-MS/MS) analyses of tryptic digests. We show that this approach can significantly underestimate the FDR and describe an approach for more confident protein identifications that uses unique partial sequences derived from a combination of database searching and amino acid residue sequencing using highaccuracy MS/MS data. Applied to a Saccharomyces cerevisiae tryptic digest, the approach provided 3 132 confident peptide identifications (∼5% modified in some fashion), covering 575 proteins with an estimated zero FDR. The conventional approach provided 3 359 peptide identifications and 656 proteins with 0.3% FDR based upon a decoy database analysis. However, the present approach revealed ∼5% of the 3 359 identifications to be incorrect and many more as potentially ambiguous (e.g., due to not considering certain amino acid substitutions and modifications). In addition, 677 peptides and 39 proteins were identified that had been missed by conventional analysis, including nontryptic peptides, peptides with a variety of expected/unexpected chemical modifications, known/unknown post-translational modifications, single nucleotide polymorphisms or gene encoding errors, and multiple modifications of individual peptides.
A Novel Algorithm for Validating Peptide Identification from a Shotgun Proteomics Search Engine
Journal of Proteome Research, 2013
Liquid chromatography coupled with tandem mass spectrometry has revolutionized the proteomics analysis of complexes, cells, and tissues. In a typical proteomic analysis, the tandem mass spectra from a LC/MS/MS experiment are assigned to a peptide by a search engine that compares the experimental MS/MS peptide data to theoretical peptide sequences in a protein database. The peptide spectra matches are then used to infer a list of identified proteins in the original sample. However, the search engines often fail to distinguish between correct and incorrect peptides assignments. In this study, we designed and implemented a novel algorithm called De-Noise to reduce the number of incorrect peptide matches and maximize the number of correct peptides at a fixed false discovery rate using a minimal number of scoring outputs from the SEQUEST search engine. The novel algorithm uses a three step process: data cleaning, data refining through a SVM-based decision function, and a final data refining step based on proteolytic peptide patterns. Using proteomics data generated on different types of mass spectrometers, we optimized the De-Noise algorithm based on the resolution and mass accuracy of the mass spectrometer employed in the LC/MS/MS experiment. Our results demonstrate De-Noise improves peptide identification compared to other methods used to process the peptide sequence matches assigned by SEQUEST. Because De-Noise uses a limited number of scoring attributes, it can be easily implemented with other search engines.
PROTEOMICS, 2001
Identification of proteins from the mass spectra of peptide fragments generated by proteolytic cleavage using database searching has become one of the most powerful techniques in proteome science, capable of rapid and efficient protein identification. Using computer simulation, we have studied how the application of chemical derivatisation techniques may improve the efficiency of protein identification from mass spectrometric data. These approaches enhance ion yield and lead to the promotion of specific ions and fragments, yielding additional database search information. The impact of three alternative techniques has been assessed by searching representative proteome databases for both single proteins and simple protein mixtures. For example, by reliably promoting fragmentation of singly-charged peptide ions at aspartic acid residues after homoarginine derivatisation, 82% of yeast proteins can be unambiguously identified from a single typical peptide-mass datum, with a measured mass accuracy of 50 ppm, by using the associated secondary ion data. The extra search information also provides a means to confidently identify proteins in protein mixtures where only limited data are available. Furthermore, the inclusion of limited sequence information for the peptides can compensate and exceed the search efficiency available via high accuracy searches of around 5 ppm, suggesting that this is a potentially useful approach for simple protein mixtures routinely obtained from two-dimensional gels.
Peptide-mass fingerprinting and the ideal covering set for protein characterisation
Electrophoresis, 1997
The rules that govern the dynamics of protein characterisation by peptidemass fingerprinting (PMF) were investigated through multiple interrogations of a nonredundant protein database. This was achieved by analysing the eficiency of identifying each entry in the entire database via perfect in silico digestion with a series of 20 pseudo-endoproteinases cutting at the carboxy terminal of each amino acid residue, and the multiple cutters: trypsin, chymotrypsin and Glu-C. The distribution of peptide fragment masses generated by endoproteinase digestion was examined with a view to designing better approaches to protein characterisation by PMF. On average, and for both common and rare cutters, the combination of approximately two fragments was sufficient to identify most database entries. However, the rare cutters left more entries unidentified in the database. Total coverage of the entire database could not be achieved with one enzymatic cutter alone, nor when all 23 cutters were used together. Peptide fragments of > 5000 Da had little effect on the outcome of PMF to correctly characterise database entries, while those with low mass (near to 350 Da in the case of trypsin) were found to be of most utility. The most frequently occurring fragments were also found in this lower mass region. The maximum size of uncut database entries (those not containing a specific amino acid residue) ranged from 52 908 Da to 258 314 Da, while the failure rate for a single cutter in identifying database entries varied from 10 865 (8.4%) to 23 290 (l8.l0/o). PMF is likely to be a mainstay of any high-throughput protein screening strategy for large-scale proteome analysis. A better understanding of the merits and limitations of this technique will allow researchers to optimise their protein characterisation procedures.
In silico analysis of accurate proteomics, complemented by selective isolation of peptides
Journal of Proteomics, 2011
Protein identification by mass spectrometry is mainly based on MS/MS spectra and the accuracy of molecular mass determination. However, the high complexity and dynamic ranges for any species of proteomic samples, surpass the separation capacity and detection power of the most advanced multidimensional liquid chromatographs and mass spectrometers. Only a tiny portion of signals is selected for MS/MS experiments and a still considerable number of them do not provide reliable peptide identification. In this article, an in silico analysis for a novel methodology of peptides and proteins identification is described. The approach is based on mass accuracy, isoelectric point (pI), retention time (t R ) and N-terminal amino acid determination as protein identification criteria regardless of high quality MS/MS spectra. When the methodology was combined with the selective isolation methods, the number of unique peptides and identified proteins increases. Finally, to demonstrate the feasibility of the methodology, an OFFGEL-LC-MS/MS experiment was also implemented. We compared the more reliable peptide identified with MS/MS information, and peptide identified with three experimental features (pI, t R , molecular mass). Also, two theoretical assumptions from MS/MS identification (selective isolation of peptides and N-terminal amino acid) were analyzed. Our results show that using the information provided by these features and selective isolation methods we could found the 93% of the high confidence protein identified by MS/MS with false-positive rate lower than 5%.