A novel computational approach applicable to human microbiome studies – urinary tract microbiome exaple (original) (raw)
Related papers
Randomized Sequence Databases for Tandem Mass Spectrometry Peptide and Protein Identification
OMICS: A Journal of Integrative Biology, 2005
Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence."
Mass spectrometry-based shotgun proteomics approaches are currently considered as the technology-of-choice for large-scale proteo-genomics due to high throughput, good availability and relative ease of use. Protein mixtures are firstly digested with protease, e. g. trypsin, and the resultant peptides are analyzed using liquid chromatography-tandem mass spectrometry. Proteins and peptides are identified from the resultant tandem mass spectra by de novo interpretation of the spectra or by searching databases of putative sequences. Since this data represents the expressed proteins in the sample, it can be used to infer novel proteogenomic features when mapped to the genome. However, high-throughput mass spectrometry instruments can readily generate hundreds of thousands, perhaps millions, of spectra and the size of genomic databases, such as six-frame translated genome databases, is enormous. Therefore, computational demands are very high, and there is potential inaccuracy in peptide identification due to the large search space. These issues are considered the main challenges that limit the utilization of this approach. In this review, we highlight the efforts of the proteomics and bioinformatics communities to develop methods, algorithms and software tools that facilitate peptide sequence identification from databases in large-scale proteogenomic studies.
Microbiome, 2021
Background A few recent large efforts significantly expanded the collection of human-associated bacterial genomes, which now contains thousands of entities including reference complete/draft genomes and metagenome assembled genomes (MAGs). These genomes provide useful resource for studying the functionality of the human-associated microbiome and their relationship with human health and diseases. One application of these genomes is to provide a universal reference for database search in metaproteomic studies, when matched metagenomic/metatranscriptomic data are unavailable. However, a greater collection of reference genomes may not necessarily result in better peptide/protein identification because the increase of search space often leads to fewer spectrum-peptide matches, not to mention the drastic increase of computation time. Methods Here, we present a new approach that uses two steps to optimize the use of the reference genomes and MAGs as the universal reference for human gut me...
Analytical Chemistry, 2000
We derive and validate a simple statistical model that predicts the distribution of false matches between peaks in matrix-assisted laser desorption/ionization mass spectrometry data and proteins in proteome databases. The model allows us to calculate the significance of previously reported microorganism identification results. In particular, for deltam = +/-1.5 Da, we find that the computed significance levels are sufficient to demonstrate the ability to identify microorganisms, provided the number of candidate microorganisms is limited to roughly three Escherichia coli-like or roughly 10 Bacillus subtilis-like microorganisms (in the sense of having roughly the same number of proteins per unit-mass interval). We conclude that, given the cluttered and incomplete nature of the data, it is likely that neither simple ranking nor simple hypothesis testing will be sufficient for truly robust microorganism identification over a large number of candidate microorganisms.
PROTEOMICS, 2002
With the recent quick expansion of DNA and protein sequence databases, intensive efforts are underway to interpret the linear genetic information of DNA in terms of function, structure and control of biological processes. The systematic identification and quantification of the expressed proteins has proven particularly powerful in this regard. Large-scale protein identification is usually achieved by automated liquid chromatography-tandem mass spectrometry (LC-MS/MS) of complex peptide mixtures and sequence database searching of the resulting spectra [1]. As generating large numbers of sequence-specific mass spectra (collision-induced dissociation/CID) spectra has become a routine operation, research has shifted from the generation of sequence database search results to their validation. Here we describe in detail a novel probabilistic model and a score function that ranks the quality of the match between tandem mass spectral data and a peptide sequence in a database. We document the performance of the algorithm on a reference data set and in comparison with another sequence database search tool.
OLAV: Towards high‐throughput tandem mass spectrometry data identification
PROTEOMICS, 2003
Mass spectrometry combined with database searching has become the preferred method for identifying proteins in proteomics projects. Proteins are digested by one or several enzymes to obtain peptides, which are analyzed by mass spectrometry. We introduce a new family of scoring schemes, named OLAV, aimed at identifying peptides in a database from their tandem mass spectra. OLAV scoring schemes are based on signal detection theory, and exploit mass spectrometry information more extensively than previously existing schemes. We also introduce a new concept of structural matching that uses pattern detection methods to better separate true from false positives. We show the superiority of OLAV scoring schemes compared to MASCOT, a widely used identification program. We believe that this work introduces a new way of designing scoring schemes that are especially adapted to high‐throughput projects such as GeneProt large‐scale human plasma project, where it is impractical to check all identif...
A Scalable Parallel Approach for Peptide Identification from Large-Scale Mass Spectrometry Data
2009
Identifying peptides, which are short polymeric chains of amino acid residues in a protein sequence, is of fundamental importance in systems biology research. The most popular approach to identify peptides is through database search. In this approach, an experimental spectrum ("query") generated from fragments of a target peptide using mass spectrometry is computationally compared with a database of already known protein sequences. The goal is to detect database peptides that are most likely to have generated the target peptide. The exponential growth rates and overwhelming sizes of biomolecular databases make this an ideal application to benefit from parallel computing. However, the present generation of software tools is not expected to scale to the magnitudes and complexities of data that will be generated in the next few years. This is because they are all either serial algorithms or parallel strategies that have been designed over inherently serial methods, thereby requiring high spaceand time-requirements. In this paper, we present an efficient parallel approach for peptide identification through database search. Three key factors distinguish our approach from that of existing solutions: i) (space) Given p processors and a database with N residues, we provide the first space-optimal algorithm (O( N p )) under distributed memory machine model; ii) (time) Our algorithm uses a combination of parallel techniques such as one-sided communication and masking of communication with computation to ensure that the overhead introduced due to parallelism is minimal; and iii) (quality) The run-time savings achieved using parallel processing has allowed us to incorporate highly accurate statistical models that have previously been demonstrated to ensure high quality prediction albeit on smaller scale data. We present the design and evaluation of two different algorithms to implement our approach. Experimental results using 2.65 million microbial proteins show linear scaling up to 128 processors of a Linux commodity cluster, with parallel efficiency at ∼50%. We expect that this new approach will be critical to meet the data-intensive and qualitative demands stemming from this important application domain.
Matching peptide mass spectra to EST and genomic DNA databases
Trends in Biotechnology, 2001
The use of mass spectrometry data to search molecular sequence databases is a well-established method for protein identification. The technique can be extended to searching raw genomic sequences, providing experimental confirmation or correction of predicted coding sequences, and has the potential to identify novel genes and elucidate splicing patterns.
Evaluating Peptide Mass Fingerprinting-based Protein Identification
Genomics, Proteomics & Bioinformatics, 2007
Identif ication of proteins by mass spectrometry (MS) is an essential step in proteomic studies and is typically accomplished by either peptide mass f ingerprinting (PMF) or amino acid sequencing of the peptide. Although sequence information from MS/MS analysis can be used to validate PMF-based protein identif ication, it may not be practical when analyzing a large number of proteins and when highthroughput MS/MS instrumentation is not readily available. At present, a vast majority of proteomic studies employ PMF. However, there are huge disparities in criteria used to identify proteins using PMF. Therefore, to reduce incorrect protein identif ication using PMF, and also to increase conf idence in PMF-based protein identif ication without accompanying MS/MS analysis, def initive guiding principles are essential. To this end, we propose a value-based scoring system that provides guidance on evaluating when PMF-based protein identif ication can be deemed suf f icient without accompanying amino acid sequence data from MS/MS analysis.