Onco-proteogenomics: a novel approach to identify cancer-specific mutations combining proteomics and transcriptome deep sequencing (original) (raw)

Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data TOWARD MORE EFFICIENT IDENTIFICATION OF POST-TRANSLATIONAL MODIFICATIONS, SEQUENCE POLYMORPHISMS, AND NOVEL PEPTIDES* □ S

In mass spectrometry-based proteomics, frequently hundreds of thousands of MS/MS spectra are collected in a single experiment. Of these, a relatively small fraction is confidently assigned to peptide sequences, whereas the majority of the spectra are not further analyzed. Spectra are not assigned to peptides for diverse reasons. These include deficiencies of the scoring schemes implemented in the database search tools, sequence variations (e.g. single nucleotide polymorphisms) or omissions in the database searched, post-translational or chemical modifications of the peptide analyzed, or the observation of sequences that are not anticipated from the genomic sequence (e.g. splice forms, somatic rearrangement, and processed proteins). To increase the amount of information that can be extracted from proteomic MS/MS data-sets we developed a robust method that detects high quality spectra within the fraction of spectra unassigned by conventional sequence database searching and computes a quality score for each spectrum. We also demonstrate that iterative search strategies applied to such detected unassigned high quality spectra significantly increase the number of spectra that can be assigned from datasets and that biologically interesting new insights can be gained from existing data. Molecular & Cellular Pro-teomics 5:652– 670, 2006. Proteomics, the systematic identification and characterization of all proteins expressed in a cell, has become a key analytical approach in the life sciences (1). The dramatic progress of proteomic research over the last decade has been catalyzed by several, seemingly independent developments. First, the wealth of genomic sequence information generated by large scale sequencing projects and the development of computational gene prediction and annotation tools have produced sequence databases that are expected to contain most coding gene regions. These databases can be searched with proteomic data and constrain the proteomic search space (2). Second, technological improvements in mass spectrometry and peptide and protein separation techniques allow rapid and sensitive protein identification from minute amounts of complex biological samples (for reviews, see Refs. 1, 3, and 4). Third, the development of computational tools for the assignment of MS/MS spectra to peptide sequences and the statistical validation of these assignments support the consistent analysis of large datasets with no or minimal human intervention (5). Collectively these developments resulted in the emergence of shotgun proteomics, a strategy based on the combination of tandem mass spectrometry-based pep-tide sequencing and sequence database searching, which now routinely permits the identification of hundreds to thousands of proteins in a single experiment. Shotgun proteomics creates significant computational challenges (5– 8). Large numbers (on the order of 10 5) of MS/MS spectra acquired in each experiment need to be computationally processed to identify peptides that produced them and to infer what proteins were present in the original sample. In most high throughput studies, peptide identification is performed by searching MS/MS spectra against protein sequence databases. A number of automated database search tools have been developed for that purpose, including commercial and open source programs (9 –17). These programs correlate the experimental MS/MS spectra with theoretical fragmentation patterns of peptides obtained from a sequence database and use various scoring schemes to find the best matching peptide sequence. This high throughput protein identification process, however, is prone to false pos-itives resulting from incorrect peptide assignments to MS/MS spectra by the database search tools (5, 18 –21). The problem of false positives has received significant attention in recent years. As a result, statistical approaches and computational tools were developed for assigning confidence measures to peptide and protein identifications and for estimating the false identification rates. These tools reduce the need for time-consuming manual verification of peptide assignments

XMAn – A Homo sapiens Mutated-Peptide Database for the MS Analysis of Cancerous Cell States

Analytical Chemistry, 2014

To enable the identification of mutated peptide sequences in complex biological samples, in this work, two novel cancer- and disease-related protein databases with mutation information collected from several public resources such as COSMIC, IARC P53, OMIM and UniProtKB, were developed. In-house developed Perl-scripts were used to search and process the data, and to translate each gene-level mutation into a mutated peptide sequence. The cancer and disease mutation databases comprise a total of 872,125 and 27,148 peptide entries from 25,642 and 2,913 proteins, respectively. A description line for each entry provides the parent protein ID and name, the cDNA- and protein-level mutation site and type, the originating database, and the disease or cancer tissue type and corresponding hits. The two databases are FASTA formatted to enable data retrieval by commonly used tandem MS search engines. While the largest number of mutations were encountered for the amino acids A/D/E/G/L/P/R/S, the global mutation profiles replicate closely the outcome of the 1000 Genomes Project aimed at cataloguing natural mutations in the human population. The affected proteins were primarily involved in transcription regulation, splicing, protein synthesis/folding/binding, redox/energy production, adhesion/motility, and to some extent in DNA damage repair and signaling. The applicability of the database to identifying the presence of mutated peptides was investigated with MCF-7 breast cancer cell extracts.

Dynamic Spectrum Quality Assessment and Iterative Computational Analysis of Shotgun Proteomic Data: Toward More Efficient Identification of Post-translational Modifications, Sequence Polymorphisms, and Novel Peptides

Molecular & Cellular Proteomics, 2005

In mass spectrometry-based proteomics, frequently hundreds of thousands of MS/MS spectra are collected in a single experiment. Of these, a relatively small fraction is confidently assigned to peptide sequences, whereas the majority of the spectra are not further analyzed. Spectra are not assigned to peptides for diverse reasons. These include deficiencies of the scoring schemes implemented in the database search tools, sequence variations (e.g. single nucleotide polymorphisms) or omissions in the database searched, post-translational or chemical modifications of the peptide analyzed, or the observation of sequences that are not anticipated from the genomic sequence (e.g. splice forms, somatic rearrangement, and processed proteins). To increase the amount of information that can be extracted from proteomic MS/MS datasets we developed a robust method that detects high quality spectra within the fraction of spectra unassigned by conventional sequence database searching and computes a quality score for each spectrum. We also demonstrate that iterative search strategies applied to such detected unassigned high quality spectra significantly increase the number of spectra that can be assigned from datasets and that biologically interesting new insights can be gained from existing data.

Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry

Genome biology, 2005

A crucial aim upon the completion of the human genome is the verification and functional annotation of all predicted genes and their protein products. Here we describe the mapping of peptides derived from accurate interpretations of protein tandem mass spectrometry (MS) data to eukaryotic genomes and the generation of an expandable resource for integration of data from many diverse proteomics experiments. Furthermore, we demonstrate that peptide identifications obtained from high-throughput proteomics can be integrated on a large scale with the human genome. This resource could serve as an expandable repository for MS-derived proteome information.

XMAn v2-a database of Homo sapiens mutated peptides

Bioinformatics, 2020

The 'Unknown Mutation Analysis (XMAn)' database is a compilation of Homo sapiens mutated peptides in FASTA format, that was constructed for facilitating the identification of protein sequence alterations by tandem mass spectrometry detection. The database comprises 2 539 031 non-redundant mutated entries from 17 599 proteins, of which 2 377 103 are missense and 161 928 are nonsense mutations. It can be used in conjunction with search engines that seek the identification of peptide amino acid sequences by matching experimental tandem mass spectrometry data to theoretical sequences from a database.

InsPecT: Identification of Posttranslationally Modified Peptides from Tandem Mass Spectra

Analytical Chemistry, 2005

Reliable identification of post-translational modifications is key to understanding various cellular regulatory processes. We describe a tool, InsPecT, to identify post-translational modifications using tandem mass spectrometry data. InsPecT constructs database filters that proved to be very successful in genomics searches. Given an MS/MS spectrum S and a database D, a database filter selects a small fraction of database D that is guaranteed (with high probability) to contain a peptide that produced S. InsPecT uses peptide sequence tags as efficient filters that reduce the size of the database by a few orders of magnitude while retaining the correct peptide with very high probability. In addition to filtering, InsPecT also uses novel algorithms for scoring and validating in the presence of modifications, without explicit enumeration of all variants.

Improving gene annotation using peptide mass spectrometry

Genome Research, 2007

Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.

De Novo Sequencing of Unique Sequence Tags for Discovery of Post-Translational Modifications of Proteins

Analytical Chemistry, 2008

De novo sequencing is a spectrum analysis approach for mass spectrometry data to discover posttranslational modifications in proteins; however, such an approach is still in its infancy and is still not widely applied to proteomic practices due to its limited reliability. In this work, we describe a de novo sequencing approach for the discovery of protein modifications based on identification of the proteome UStags. The de novo information was obtained from Fourier-transform tandem mass spectrometry data for peptides and polypeptides from a yeast lysate, and the de novo sequences obtained were selected based on filter levels designed to provide a limited yet high quality subset of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags' prefix and suffix sequences and the UStags themselves) were used to infer possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances within several yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. To determine false discovery rates, two random (false) databases were independently used for sequence matching, and ∼3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity of the approach were investigated and described. The combined de novo-UStag approach complements the UStag method previously reported by enabling the discovery of new protein modifications. The UStag method for unambiguous peptide and polypeptide identification has recently been demonstrated for the analysis of enzymatically (e.g., tryptic) digested cell lysates 1 and for the determination of natural intracellular proteolysis (degradation) of proteins 2 using accurate Fourier-transform tandem mass spectrometry (FT-MS/MS) data. Sequences are determined to be UStags when the accurately measured consecutive fragments reveal these sequences to be unique in the genome for single proteins. The UStags reported 1,2 are assigned for the candidates that have the top closest spectral similarities to the MS/MS measurement (e.g., candidates ranked from Sequest). Advantage of such a database search-UStag approach is that it produces sequence identities with extremely low false discovery rates for peptides/polypeptides having a large range of lengths and with various amino acid termini. 1,2 Also, this approach is capable of identifying unknown or unexpected changes, deviations, and errors from the predicted protein sequences. 1 However, the amino acid changes, deviations, and errors either on the UStag's prefix (i.e., the part of sequence prior to a UStag in the sequencing direction

In silico analysis of accurate proteomics, complemented by selective isolation of peptides

Journal of Proteomics, 2011

Protein identification by mass spectrometry is mainly based on MS/MS spectra and the accuracy of molecular mass determination. However, the high complexity and dynamic ranges for any species of proteomic samples, surpass the separation capacity and detection power of the most advanced multidimensional liquid chromatographs and mass spectrometers. Only a tiny portion of signals is selected for MS/MS experiments and a still considerable number of them do not provide reliable peptide identification. In this article, an in silico analysis for a novel methodology of peptides and proteins identification is described. The approach is based on mass accuracy, isoelectric point (pI), retention time (t R ) and N-terminal amino acid determination as protein identification criteria regardless of high quality MS/MS spectra. When the methodology was combined with the selective isolation methods, the number of unique peptides and identified proteins increases. Finally, to demonstrate the feasibility of the methodology, an OFFGEL-LC-MS/MS experiment was also implemented. We compared the more reliable peptide identified with MS/MS information, and peptide identified with three experimental features (pI, t R , molecular mass). Also, two theoretical assumptions from MS/MS identification (selective isolation of peptides and N-terminal amino acid) were analyzed. Our results show that using the information provided by these features and selective isolation methods we could found the 93% of the high confidence protein identified by MS/MS with false-positive rate lower than 5%.

Proteome-Wide Identification of Proteins and Their Modifications with Decreased Ambiguities and Improved False Discovery Rates Using Unique Sequence Tags

Analytical Chemistry, 2008

Identifying proteins correctly and with known levels of confidence remain as significant challenges for proteomics. Random or decoy peptide databases are increasingly being used to estimate the false discovery rate (FDR), e.g., from liquid chromatography-tandem mass spectrometry (LC-MS/MS) analyses of tryptic digests. We show that this approach can significantly underestimate the FDR, and describe an approach for more confident protein identifications that uses unique partial sequences derived from a combination of database searching and amino acid residue sequencing using high accuracy MS/MS data. Applied to a Saccharomyces cerevisiae tryptic digest, the approach provided 3,132 confident peptide identifications (∼5% modified in some fashion), covering 575 proteins with an estimated zero FDR. The conventional approach provided 3,359 peptide identifications and 656 proteins with 0.3% FDR based upon a decoy database analysis. However, the present approach revealed ∼5% of the 3,359 identifications to be incorrect, and many more as potentially ambiguous, (e.g., due to not considering certain amino acid substitutions and modifications). In addition, 677 peptides and 39 proteins were identified that had been missed by conventional analysis, including non-tryptic peptides, peptides with various expected/unexpected chemical modifications, known/ unknown posttranslational modifications, single nucleotide polymorphisms or gene encoding errors, and multiple modifications of individual peptides. Keywords precise proteomics; high-precision tandem mass spectrometry; unique sequences; post-translational modifications and mutants; false discovery rate and identification ambiguity A currently popular strategy utilizes a comparably sized "decoy" set of "false" peptides to estimate the level of incorrect identifications for a particular set of filtering criteria. 5 While